I am a Senior Policy Analyst at the Bipartisan Policy Center (BPC), working on US immigration and workforce policy. I previously worked as an economist and a management consultant.
I am interested in longtermism, global priorities research and animal welfare. Check out my blog The Ethical Economist.
Please get in touch if you would like to have a chat sometime, or connect with me on LinkedIn.
Thanks. I watched Robert Miles' video which was very helpful. Especially the part where he explains why an AI might want to act in accordance with its base objective in a training environment only to then pursue its mesa objective in the real world.
I'm quite uncertain at this point, but I have a vague feeling that Russell's second principle (The machine is initially uncertain about what those preferences are) is very important here. It is a vague feeling though...
You're making me want to listen to the podcast episode again. From a quick look at the transcript, Russell thinks the three principles of AI should be:
It certainly seems such an IRL-based AI would be more open to being told what to do than a traditional RL-based AI.
RL-based AI generally doesn't want to obey requests or have its goal be changed, because this hinders/prevents it from achieving its original goal. IRL-based AI literally has the goal of realising human preferences, so it would need to have a pretty good reason (from its point of view) not to obey someone's request.
Certainly early on, IRL-based AI would obey any request you make provided you have baked in a high enough degree of uncertainty into the AI (principle 2). After a while, the AI becomes more confident about human preferences and so may well start to manipulate or deceive people when it thinks they are not acting in their best interest. This sounds really concerning, but in theory it might be good if you have given the AI enough time to learn.
For example, after a sufficient amount of time learning about human preferences, an AI may say something like "I'm going to throw your cigarettes away because I have learnt people really value health and cigarettes are really bad for health". The person might say "no don't do that I really want a ciggie right now". If the AI ultimately knows that the person really shouldn't smoke for their own wellbeing, it may well want to manipulate or deceive the person into throwing away their cigarettes e.g. through giving an impassioned speech about the dangers of smoking.
This sounds concerning but, provided the AI has had enough time to properly learn about human preferences, the AI should, in theory, do the manipulation in a minimally-harmful way. It may for example learn that humans really don't like being tricked, so it will try to change the human's mind just by giving the person the objective facts of how bad smoking is, rather than more devious means. The most important thing seems to be that the IRL-based AI has sufficient uncertainty baked into them for a sufficient amount of time, so that they only start pushing back on human requests when they are sufficiently confident they are doing the right thing.
I'm far from certain that IRL-based AI is watertight (my biggest concern remains the AI learning from irrational/bad people), but on my current level of (very limited) knowledge it does seem the most sensible approach.
This is exactly the discussion I want! I’m mostly just surprised no one seems to be talking about IRL.
I don’t have firm answers (when it comes to technical AI alignment I’m a bit of a noob) but when I listened to the podcast with Stuart Russell I remember him saying that we need to build in a degree of uncertainty into the AI so they essentially have to ask for permission before they do things, or something like that. Maybe this means IRL starts to become problematic in much the same way as other reinforcement learning approaches as in some way we do “supervise” the AI, but it certainly seems like easier supervision compared to the other approaches.
Also as you say the AI could learn from bad people. This just seems an inherent risk of all possible alignment approaches though!
I'm late to this, but I'm surprised that this post doesn't acknowledge the approach of inverse reinforcement learning (IRL) which Stuart Russell discussed on the 80,000 Hours podcast and which also featured in his book Human Compatible.
I'm no AI expert, but this approach seems to me like it avoids the "as these models become superhuman, humans won’t be able to reliably supervise their outputs" problem, as a superhuman AI using IRL doesn't have to be supervised, it just observes us and through doing so better understands our values.
I'm generally surprised at the lack of discussion of IRL in the community. When one of the world leaders in AI says a particular approach in AI alignment is our best hope, shouldn't we listen to them?
Maybe an obvious point, but I think we shouldn't lose sight of the importance of providing EA funding for catastrophe-preventing interventions, alongside attempts to influence government. Attempts to influence government may fail / fall short of what is needed / take too long given the urgency of action.
Especially in the case of pure longtermist goods, we need to ensure the EA/longtermist movement has enough money to fund things that governments won't. Should we just get on with developing refuges ourselves?
I have a post where I address what I see as misconceptions about longtermism. In response to Future people count, but less than present people I would recommend you read the "Longtermists have to think future people have the same moral value as people today" section. In short, I don't think future people counting for less really dents longtermism that much at all as it isn't reasonable to discount that much. You seem to accept that we can't discount that much, so if you accept the other core claims of the argument longtermism will still go through. Discounting future people less is pretty irrelevant in my opinion.
I want to read that Thorstad paper and until I do can't really respond. I would say however that even if the expected number of people in the future isn't as high as many longtermists have claimed, it's still got to be at least somewhat large and large enough to mean GiveWell charities that focus on near-term effects aren't the best we can do. One could imagine being a 'medium-termist' and wanting to say address climate change and boost economic growth which affect the medium and long-term. Moving to GiveWell would seem to me to be overcorrecting.
The assumption that future people will be happy isn't required for longtermism (as you seem to imply). The value of reducing extinction risk does depend on future people being happy (or at least above the zero level of wellbeing), but there are longtermist approaches that don't involve reducing extinction risk. My post touches on some of these in the Sketch of the strong longtermist argument section. For example mitigating climate change, ensuring good institutions develop, and ensuring AI is aligned to benefit human wellbeing.
You say that some risks such as those from AGI or biological weapons are "less empirical and more based on intuitions or unverifiable claims, and hence near-impossible to argue against". I think one can argue against these risks. For example, David Thorstad argues that various assumptions underlying the singularity hypothesis are substantially less plausible than its advocates suppose, arguing that this should allay fears related to existential risk from AI. You can point out weaknesses in the arguments for specific existential risks, it just takes some effort! Personally I think the risks are credible enough to take them seriously, especially given how bad the outcomes would be.