User:IssaRice/AI safety/Possibility of act-based agents - Revision history

https://machinelearning.subwiki.org/w/index.php?action=history&feed=atom&title=User%3AIssaRice%2FAI_safety%2FPossibility_of_act-based_agents User:IssaRice/AI safety/Possibility of act-based agents - Revision history 2026-07-12T12:59:56Z Revision history for this page on the wiki MediaWiki 1.41.2 https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/AI_safety/Possibility_of_act-based_agents&diff=2604&oldid=prev IssaRice at 06:52, 11 October 2019 2019-10-11T06:52:07Z

<p></p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 06:52, 11 October 2019</td> </tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l26">Line 26:</td> <td colspan="2" class="diff-lineno">Line 26:</td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* we can use techniques for optimizing worst-case performance/reward engineering/informed oversight</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* we can use techniques for optimizing worst-case performance/reward engineering/informed oversight</div></td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* the overseer is supposed to be much smarter than the agent being trained (the distilled agent), which makes it easier to detect problems.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* the overseer is supposed to be much smarter than the agent being trained (the distilled agent), which makes it easier to detect problems.</div></td></tr> <tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>* approval-direction searches over individual actions rather than sequences of actions, so sequences of actions that lead to problems (e.g. brainwashing or tricking the overseer) won't be approved unless there is some chain of actions (each with high approval) that leads to the problem. [https://www.lesswrong.com/posts/7Hr8t6xwuuxBTqADK/approval-directed-agents-1#m6PhiqS9Ffd3HR6wj] (This seems to depend on how you define what an atomic action is. sometimes paul talks about actions on the level of "help me stay in control" as atomic actions.)</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>* approval-direction searches over individual actions rather than sequences of actions, so sequences of actions that lead to problems (e.g. brainwashing or tricking the overseer) won't be approved unless there is some chain of actions (each with high approval) that leads to the problem. [https://www.lesswrong.com/posts/7Hr8t6xwuuxBTqADK/approval-directed-agents-1#m6PhiqS9Ffd3HR6wj] (This seems to depend on how you define what an atomic action is. sometimes paul talks about actions on the level of "help me stay in control" as atomic actions. <ins style="font-weight: bold; text-decoration: none;">see also [https://lw2.issarice.com/posts/BKM8uQS6QdJPZLqCr/towards-a-mechanistic-understanding-of-corrigibility#aZSQYwHMhwc9opYWw]</ins>)</div></td></tr> </table>

IssaRice https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/AI_safety/Possibility_of_act-based_agents&diff=2603&oldid=prev IssaRice at 05:40, 11 October 2019 2019-10-11T05:40:34Z

<p></p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 05:40, 11 October 2019</td> </tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l26">Line 26:</td> <td colspan="2" class="diff-lineno">Line 26:</td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* we can use techniques for optimizing worst-case performance/reward engineering/informed oversight</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* we can use techniques for optimizing worst-case performance/reward engineering/informed oversight</div></td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* the overseer is supposed to be much smarter than the agent being trained (the distilled agent), which makes it easier to detect problems.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* the overseer is supposed to be much smarter than the agent being trained (the distilled agent), which makes it easier to detect problems.</div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">* approval-direction searches over individual actions rather than sequences of actions, so sequences of actions that lead to problems (e.g. brainwashing or tricking the overseer) won't be approved unless there is some chain of actions (each with high approval) that leads to the problem. [https://www.lesswrong.com/posts/7Hr8t6xwuuxBTqADK/approval-directed-agents-1#m6PhiqS9Ffd3HR6wj] (This seems to depend on how you define what an atomic action is. sometimes paul talks about actions on the level of "help me stay in control" as atomic actions.)</ins></div></td></tr> </table>

IssaRice https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/AI_safety/Possibility_of_act-based_agents&diff=2602&oldid=prev IssaRice at 04:47, 11 October 2019 2019-10-11T04:47:27Z

<p></p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 04:47, 11 October 2019</td> </tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l12">Line 12:</td> <td colspan="2" class="diff-lineno">Line 12:</td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>"There is clearly a difference between act-based agents and traditional rational agents. But it’s not entirely clear what the key difference is." [https://ai-alignment.com/act-based-agents-8ec926c79e9c]</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>"There is clearly a difference between act-based agents and traditional rational agents. But it’s not entirely clear what the key difference is." [https://ai-alignment.com/act-based-agents-8ec926c79e9c]</div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">there's some more discussion here https://lw2.issarice.com/posts/FuGDYNvA6qh4qyFah/thoughts-on-human-compatible</ins></div></td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>I think there's two kinds of skepticism here:</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>I think there's two kinds of skepticism here:</div></td></tr> </table>

IssaRice https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/AI_safety/Possibility_of_act-based_agents&diff=2601&oldid=prev IssaRice at 19:59, 7 October 2019 2019-10-07T19:59:58Z

<p></p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 19:59, 7 October 2019</td> </tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l23">Line 23:</td> <td colspan="2" class="diff-lineno">Line 23:</td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* the kinds of obviously-catastrophic actions a goal-directed agent takes shouldn't happen, because they will look obviously bad to the overseer.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* the kinds of obviously-catastrophic actions a goal-directed agent takes shouldn't happen, because they will look obviously bad to the overseer.</div></td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* we can use techniques for optimizing worst-case performance/reward engineering/informed oversight</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* we can use techniques for optimizing worst-case performance/reward engineering/informed oversight</div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">* the overseer is supposed to be much smarter than the agent being trained (the distilled agent), which makes it easier to detect problems.</ins></div></td></tr> </table>

IssaRice https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/AI_safety/Possibility_of_act-based_agents&diff=2600&oldid=prev IssaRice at 00:50, 6 October 2019 2019-10-06T00:50:06Z

<p></p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 00:50, 6 October 2019</td> </tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l22">Line 22:</td> <td colspan="2" class="diff-lineno">Line 22:</td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* use approval-direction internally: "An agent maximizing Hugh’s approval might find those actions and take them. But if the agent was internally approval-directed, then it wouldn’t try to exploit errors in Hugh’s ratings. Actions that lead to reported approval but not real approval, don’t lead to approval for approved reasons" [https://ai-alignment.com/model-free-decisions-6e6609f5d99e]</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* use approval-direction internally: "An agent maximizing Hugh’s approval might find those actions and take them. But if the agent was internally approval-directed, then it wouldn’t try to exploit errors in Hugh’s ratings. Actions that lead to reported approval but not real approval, don’t lead to approval for approved reasons" [https://ai-alignment.com/model-free-decisions-6e6609f5d99e]</div></td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* the kinds of obviously-catastrophic actions a goal-directed agent takes shouldn't happen, because they will look obviously bad to the overseer.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* the kinds of obviously-catastrophic actions a goal-directed agent takes shouldn't happen, because they will look obviously bad to the overseer.</div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">* we can use techniques for optimizing worst-case performance/reward engineering/informed oversight</ins></div></td></tr> </table>

IssaRice https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/AI_safety/Possibility_of_act-based_agents&diff=2598&oldid=prev IssaRice at 22:28, 5 October 2019 2019-10-05T22:28:32Z

<p></p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 22:28, 5 October 2019</td> </tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l21">Line 21:</td> <td colspan="2" class="diff-lineno">Line 21:</td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* use approval-direction internally: "An agent maximizing Hugh’s approval might find those actions and take them. But if the agent was internally approval-directed, then it wouldn’t try to exploit errors in Hugh’s ratings. Actions that lead to reported approval but not real approval, don’t lead to approval for approved reasons" [https://ai-alignment.com/model-free-decisions-6e6609f5d99e]</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* use approval-direction internally: "An agent maximizing Hugh’s approval might find those actions and take them. But if the agent was internally approval-directed, then it wouldn’t try to exploit errors in Hugh’s ratings. Actions that lead to reported approval but not real approval, don’t lead to approval for approved reasons" [https://ai-alignment.com/model-free-decisions-6e6609f5d99e]</div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">* the kinds of obviously-catastrophic actions a goal-directed agent takes shouldn't happen, because they will look obviously bad to the overseer.</ins></div></td></tr> </table>

IssaRice https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/AI_safety/Possibility_of_act-based_agents&diff=2597&oldid=prev IssaRice at 22:26, 5 October 2019 2019-10-05T22:26:29Z

<p></p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 22:26, 5 October 2019</td> </tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l16">Line 16:</td> <td colspan="2" class="diff-lineno">Line 16:</td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* are act-based/approval-directed agents even possible, or do they turn out to be goal-directed agents after all? act-based agents are still trying to maximize something (the approval of the overseer). this seems to create the same sorts of problems we get with goal-directed agents (e.g. wouldn't the agent try to trick the overseer into giving high approval?).</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* are act-based/approval-directed agents even possible, or do they turn out to be goal-directed agents after all? act-based agents are still trying to maximize something (the approval of the overseer). this seems to create the same sorts of problems we get with goal-directed agents (e.g. wouldn't the agent try to trick the overseer into giving high approval?).</div></td></tr> <tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>* daemons: if the best way to predict the overseer/find the highest-approval action turns out to involve creating consequentialists inside the act-based agent, we might get goal-directed behavior that overpowers the outer act-based layer eventually.</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>* daemons: if the best way to predict the overseer/find the highest-approval action turns out to involve creating consequentialists inside the act-based agent, we might get goal-directed behavior that overpowers the outer act-based layer eventually. <ins style="font-weight: bold; text-decoration: none;">so in the long run act-based agency is not a stable situation.</ins></div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">as far as i know, paul's responses are something like:</ins></div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">* use approval-direction internally: "An agent maximizing Hugh’s approval might find those actions and take them. But if the agent was internally approval-directed, then it wouldn’t try to exploit errors in Hugh’s ratings. Actions that lead to reported approval but not real approval, don’t lead to approval for approved reasons" [https://ai-alignment.com/model-free-decisions-6e6609f5d99e]</ins></div></td></tr> </table>

IssaRice https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/AI_safety/Possibility_of_act-based_agents&diff=2596&oldid=prev IssaRice at 22:23, 5 October 2019 2019-10-05T22:23:26Z

<p></p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 22:23, 5 October 2019</td> </tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l12">Line 12:</td> <td colspan="2" class="diff-lineno">Line 12:</td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>"There is clearly a difference between act-based agents and traditional rational agents. But it’s not entirely clear what the key difference is." [https://ai-alignment.com/act-based-agents-8ec926c79e9c]</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>"There is clearly a difference between act-based agents and traditional rational agents. But it’s not entirely clear what the key difference is." [https://ai-alignment.com/act-based-agents-8ec926c79e9c]</div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">I think there's two kinds of skepticism here:</ins></div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">* are act-based/approval-directed agents even possible, or do they turn out to be goal-directed agents after all? act-based agents are still trying to maximize something (the approval of the overseer). this seems to create the same sorts of problems we get with goal-directed agents (e.g. wouldn't the agent try to trick the overseer into giving high approval?).</ins></div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">* daemons: if the best way to predict the overseer/find the highest-approval action turns out to involve creating consequentialists inside the act-based agent, we might get goal-directed behavior that overpowers the outer act-based layer eventually.</ins></div></td></tr> </table>

IssaRice https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/AI_safety/Possibility_of_act-based_agents&diff=2595&oldid=prev IssaRice at 22:19, 5 October 2019 2019-10-05T22:19:22Z

<p></p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 22:19, 5 October 2019</td> </tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l10">Line 10:</td> <td colspan="2" class="diff-lineno">Line 10:</td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>"Wasn't the whole point that we want to avoid such goal-direction?" [https://www.lesswrong.com/posts/7Hr8t6xwuuxBTqADK/approval-directed-agents-1#EXfbttoDhYm9Eduwd]</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>"Wasn't the whole point that we want to avoid such goal-direction?" [https://www.lesswrong.com/posts/7Hr8t6xwuuxBTqADK/approval-directed-agents-1#EXfbttoDhYm9Eduwd]</div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">"There is clearly a difference between act-based agents and traditional rational agents. But it’s not entirely clear what the key difference is." [https://ai-alignment.com/act-based-agents-8ec926c79e9c]</ins></div></td></tr> </table>

IssaRice https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/AI_safety/Possibility_of_act-based_agents&diff=2594&oldid=prev IssaRice at 22:14, 5 October 2019 2019-10-05T22:14:36Z

<p></p> <table style="background-color: #fff; color: #202122;" data-mw="interface"> <col class="diff-marker" /> <col class="diff-content" /> <col class="diff-marker" /> <col class="diff-content" /> <tr class="diff-title" lang="en"> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td> <td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 22:14, 5 October 2019</td> </tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l6">Line 6:</td> <td colspan="2" class="diff-lineno">Line 6:</td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br></td></tr> <tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>"For example currently I find it really confusing to think about corrigible agents relative to goal-directed agents." [https://www.greaterwrong.com/posts/tHxXdAn8Yuiy9y2pZ/ai-safety-without-goal-directed-behavior#comment-G66HqKdgJgFjcZhbE]</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>"For example currently I find it really confusing to think about corrigible agents relative to goal-directed agents." [https://www.greaterwrong.com/posts/tHxXdAn8Yuiy9y2pZ/ai-safety-without-goal-directed-behavior#comment-G66HqKdgJgFjcZhbE]</div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">"It seems like a core benefit of this sort of approach is that it replaces the motivation system (take action that argmaxes some score) with an approval-seeking system that can more easily learn to not do things the overseer doesn’t approve of--like search for ways to circumvent the overseer’s guidance--but it’s not clear how much this actually buys you. Approval-maximization is still taking the action that argmaxes some score, and the difference is that approval-direction doesn’t attempt to argmax (because argmaxing is disapproved of), but without pointing explicitly at the math that it will use instead, it’s not clear that this won’t reduce to something like argmax with the same sort of problems." [https://www.lesswrong.com/posts/rXzxMQwRq7KRonQDM/my-confusions-with-paul-s-agenda]</ins></div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr> <tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">"Wasn't the whole point that we want to avoid such goal-direction?" [https://www.lesswrong.com/posts/7Hr8t6xwuuxBTqADK/approval-directed-agents-1#EXfbttoDhYm9Eduwd]</ins></div></td></tr> </table>

IssaRice