User:IssaRice/AI safety/Possibility of act-based agents
"If a predictor can predict a system containing consequentialists (e.g. a human in a room), then it is using some kind of consequentialist machinery internally to make these predictions. For example, it might be modelling the human as an approximately rational consequentialist agent. This presents some problems. If the predictor simulates consequentialist agents in enough detail, then these agents might try to break out of the system." [1]
"I think we have a foundational disagreement here about to what extent saying 'Oh, the AI will just predict that by modeling humans' solves all these issues versus sweeping the same unsolved issues under the rug into whatever is supposed to be modeling the humans." [2]
"For example currently I find it really confusing to think about corrigible agents relative to goal-directed agents." [3]
"It seems like a core benefit of this sort of approach is that it replaces the motivation system (take action that argmaxes some score) with an approval-seeking system that can more easily learn to not do things the overseer doesn’t approve of--like search for ways to circumvent the overseer’s guidance--but it’s not clear how much this actually buys you. Approval-maximization is still taking the action that argmaxes some score, and the difference is that approval-direction doesn’t attempt to argmax (because argmaxing is disapproved of), but without pointing explicitly at the math that it will use instead, it’s not clear that this won’t reduce to something like argmax with the same sort of problems." [4]
"Wasn't the whole point that we want to avoid such goal-direction?" [5]