User:IssaRice/AI safety/Possibility of act-based agents

"If a predictor can predict a system containing consequentialists (e.g. a human in a room), then it is using some kind of consequentialist machinery internally to make these predictions. For example, it might be modelling the human as an approximately rational consequentialist agent. This presents some problems. If the predictor simulates consequentialist agents in enough detail, then these agents might try to break out of the system." [1]

"I think we have a foundational disagreement here about to what extent saying 'Oh, the AI will just predict that by modeling humans' solves all these issues versus sweeping the same unsolved issues under the rug into whatever is supposed to be modeling the humans." [2]

"For example currently I find it really confusing to think about corrigible agents relative to goal-directed agents." [3]

"It seems like a core benefit of this sort of approach is that it replaces the motivation system (take action that argmaxes some score) with an approval-seeking system that can more easily learn to not do things the overseer doesn’t approve of--like search for ways to circumvent the overseer’s guidance--but it’s not clear how much this actually buys you. Approval-maximization is still taking the action that argmaxes some score, and the difference is that approval-direction doesn’t attempt to argmax (because argmaxing is disapproved of), but without pointing explicitly at the math that it will use instead, it’s not clear that this won’t reduce to something like argmax with the same sort of problems." [4]

"Wasn't the whole point that we want to avoid such goal-direction?" [5]

"There is clearly a difference between act-based agents and traditional rational agents. But it’s not entirely clear what the key difference is." [6]

there's some more discussion here https://lw2.issarice.com/posts/FuGDYNvA6qh4qyFah/thoughts-on-human-compatible

I think there's two kinds of skepticism here:

are act-based/approval-directed agents even possible, or do they turn out to be goal-directed agents after all? act-based agents are still trying to maximize something (the approval of the overseer). this seems to create the same sorts of problems we get with goal-directed agents (e.g. wouldn't the agent try to trick the overseer into giving high approval?).
daemons: if the best way to predict the overseer/find the highest-approval action turns out to involve creating consequentialists inside the act-based agent, we might get goal-directed behavior that overpowers the outer act-based layer eventually. so in the long run act-based agency is not a stable situation.

as far as i know, paul's responses are something like:

use approval-direction internally: "An agent maximizing Hugh’s approval might find those actions and take them. But if the agent was internally approval-directed, then it wouldn’t try to exploit errors in Hugh’s ratings. Actions that lead to reported approval but not real approval, don’t lead to approval for approved reasons" [7]
the kinds of obviously-catastrophic actions a goal-directed agent takes shouldn't happen, because they will look obviously bad to the overseer.
we can use techniques for optimizing worst-case performance/reward engineering/informed oversight
the overseer is supposed to be much smarter than the agent being trained (the distilled agent), which makes it easier to detect problems.
approval-direction searches over individual actions rather than sequences of actions, so sequences of actions that lead to problems (e.g. brainwashing or tricking the overseer) won't be approved unless there is some chain of actions (each with high approval) that leads to the problem. [8] (This seems to depend on how you define what an atomic action is. sometimes paul talks about actions on the level of "help me stay in control" as atomic actions. see also [9])