User:IssaRice/AI safety/Possibility of act-based agents

"If a predictor can predict a system containing consequentialists (e.g. a human in a room), then it is using some kind of consequentialist machinery internally to make these predictions. For example, it might be modelling the human as an approximately rational consequentialist agent. This presents some problems. If the predictor simulates consequentialist agents in enough detail, then these agents might try to break out of the system." [1]

"I think we have a foundational disagreement here about to what extent saying 'Oh, the AI will just predict that by modeling humans' solves all these issues versus sweeping the same unsolved issues under the rug into whatever is supposed to be modeling the humans." [2]

"For example currently I find it really confusing to think about corrigible agents relative to goal-directed agents." [3]