Relevant, I think, is Gwern's later writing on Tool AIs:
There are similar general issues with Tool AIs as with Oracle AIs:
- a human checking each result is no guarantee of safety; even Homer nods. A extremely dangerous or subtly dangerous answer might slip through; Stuart Armstrong notes that the summary may simply not mention the important (to humans) downside to a suggestion, or frame it in the most attractive light possible. The more a Tool AI is used, or trusted by users, the less checking will be done of its answers before the user mindlessly implements it.
- an intelligent, never mind superintelligent Tool AI, will have built-in search processes and planners which may be quite intelligent themselves, and in ‘planning how to plan’, discover dangerous instrumental drives and the sub-planning process execute them.2 (This struck me as mostly theoretical until I saw how well GPT-3 could roleplay & imitate agents purely by offline self-supervised prediction on large text databases—imitation learning is (batch) reinforcement learning too! See Decision Transformer for an explicit use of this.)
- developing a Tool AI in the first place might require another AI, which itself is dangerous
Personally, I think the distinction is basically irrelevant in terms of safety concerns, mostly for reasons outlined by the second bullet-point above. The danger is in the fact that "useful answers" you might get out of a Tool AI are those answers which let you steer the future to hit narrow targets (approximately described as "apply optimization power" by Eliezer & such).
If you manage to construct a training regime for something that we'd call a Tool AI, which nevertheless gives us something smart enough that it does better than humans in terms of creating plans which affect reality in specific ways, then it approximately doesn't matter whether or not we give it actuators to act in the world. It has to be aiming at something; whether or not that something is friendly to human interests won't depend on what we name we give the AI.
I'm not sure how to evaluate the predictions themselves. I continue to think that the distinction is basically confused and doesn't carve reality at the relevant joints, and I think progress to date supports this view.