User:IssaRice/AI safety/Possibility of act-based agents

From Machinelearning


"If a pre­dic­tor can pre­dict a sys­tem con­tain­ing con­se­quen­tial­ists (e.g. a hu­man in a room), then it is us­ing some kind of con­se­quen­tial­ist ma­chin­ery in­ter­nally to make these pre­dic­tions. For ex­am­ple, it might be mod­el­ling the hu­man as an ap­prox­i­mately ra­tio­nal con­se­quen­tial­ist agent. This pre­sents some prob­lems. If the pre­dic­tor simu­lates con­se­quen­tial­ist agents in enough de­tail, then these agents might try to break out of the sys­tem." [1]

"I think we have a foundational disagreement here about to what extent saying 'Oh, the AI will just predict that by modeling humans' solves all these issues versus sweeping the same unsolved issues under the rug into whatever is supposed to be modeling the humans." [2]

"For ex­am­ple cur­rently I find it re­ally con­fus­ing to think about cor­rigible agents rel­a­tive to goal-di­rected agents." [3]

"It seems like a core benefit of this sort of approach is that it replaces the motivation system (take action that argmaxes some score) with an approval-seeking system that can more easily learn to not do things the overseer doesn’t approve of--like search for ways to circumvent the overseer’s guidance--but it’s not clear how much this actually buys you. Approval-maximization is still taking the action that argmaxes some score, and the difference is that approval-direction doesn’t attempt to argmax (because argmaxing is disapproved of), but without pointing explicitly at the math that it will use instead, it’s not clear that this won’t reduce to something like argmax with the same sort of problems." [4]

"Wasn't the whole point that we want to avoid such goal-direction?" [5]

"There is clearly a difference between act-based agents and traditional rational agents. But it’s not entirely clear what the key difference is." [6]

there's some more discussion here https://lw2.issarice.com/posts/FuGDYNvA6qh4qyFah/thoughts-on-human-compatible

I think there's two kinds of skepticism here:

  • are act-based/approval-directed agents even possible, or do they turn out to be goal-directed agents after all? act-based agents are still trying to maximize something (the approval of the overseer). this seems to create the same sorts of problems we get with goal-directed agents (e.g. wouldn't the agent try to trick the overseer into giving high approval?).
  • daemons: if the best way to predict the overseer/find the highest-approval action turns out to involve creating consequentialists inside the act-based agent, we might get goal-directed behavior that overpowers the outer act-based layer eventually. so in the long run act-based agency is not a stable situation.

as far as i know, paul's responses are something like:

  • use approval-direction internally: "An agent maximizing Hugh’s approval might find those actions and take them. But if the agent was internally approval-directed, then it wouldn’t try to exploit errors in Hugh’s ratings. Actions that lead to reported approval but not real approval, don’t lead to approval for approved reasons" [7]
  • the kinds of obviously-catastrophic actions a goal-directed agent takes shouldn't happen, because they will look obviously bad to the overseer.
  • we can use techniques for optimizing worst-case performance/reward engineering/informed oversight
  • the overseer is supposed to be much smarter than the agent being trained (the distilled agent), which makes it easier to detect problems.
  • approval-direction searches over individual actions rather than sequences of actions, so sequences of actions that lead to problems (e.g. brainwashing or tricking the overseer) won't be approved unless there is some chain of actions (each with high approval) that leads to the problem. [8] (This seems to depend on how you define what an atomic action is. sometimes paul talks about actions on the level of "help me stay in control" as atomic actions. see also [9])