User:IssaRice/AI safety/Possibility of act-based agents

From Machinelearning


"If a pre­dic­tor can pre­dict a sys­tem con­tain­ing con­se­quen­tial­ists (e.g. a hu­man in a room), then it is us­ing some kind of con­se­quen­tial­ist ma­chin­ery in­ter­nally to make these pre­dic­tions. For ex­am­ple, it might be mod­el­ling the hu­man as an ap­prox­i­mately ra­tio­nal con­se­quen­tial­ist agent. This pre­sents some prob­lems. If the pre­dic­tor simu­lates con­se­quen­tial­ist agents in enough de­tail, then these agents might try to break out of the sys­tem." [1]

"I think we have a foundational disagreement here about to what extent saying 'Oh, the AI will just predict that by modeling humans' solves all these issues versus sweeping the same unsolved issues under the rug into whatever is supposed to be modeling the humans." [2]

"For ex­am­ple cur­rently I find it re­ally con­fus­ing to think about cor­rigible agents rel­a­tive to goal-di­rected agents." [3]

"It seems like a core benefit of this sort of approach is that it replaces the motivation system (take action that argmaxes some score) with an approval-seeking system that can more easily learn to not do things the overseer doesn’t approve of--like search for ways to circumvent the overseer’s guidance--but it’s not clear how much this actually buys you. Approval-maximization is still taking the action that argmaxes some score, and the difference is that approval-direction doesn’t attempt to argmax (because argmaxing is disapproved of), but without pointing explicitly at the math that it will use instead, it’s not clear that this won’t reduce to something like argmax with the same sort of problems." [4]

"Wasn't the whole point that we want to avoid such goal-direction?" [5]