Cross-posted from my blog.
Contrary to my carefully crafted brand as a weak nerd, I go to a local CrossFit gym a few times a week. Every year, the gym raises funds for a scholarship for teens from lower-income families to attend their summer camp program. I don’t know how many Crossfit-interested low-income teens there are in my small town, but I’ll guess there are perhaps 2 of them who would benefit from the scholarship. After all, CrossFit is pretty niche, and the town is small.
Helping youngsters get swole in the Pacific Northwest is not exactly as cost-effective as preventing malaria in Malawi. But I notice I feel drawn to supporting the scholarship anyway. Every time it pops in my head I think, “My money could fully solve this problem”. The camp only costs a few hundred dollars per kid and if there are just 2 kids who need support, I could give $500 and there would no longer be teenagers in my town who want to go to a CrossFit summer camp but can’t. Thanks to me, the hero, this problem would be entirely solved. 100%.
That is not how most nonprofit work feels to me.
You are only ever making small dents in important problems
I want to work on big problems. Global poverty. Malaria. Everyone not suddenly dying. But if I’m honest, what I really want is to solve those problems. Me, personally, solve them. This is a continued source of frustration and sadness because I absolutely cannot solve those problems.
Consider what else my $500 CrossFit scholarship might do:
* I want to save lives, and USAID suddenly stops giving $7 billion a year to PEPFAR. So I give $500 to the Rapid Response Fund. My donation solves 0.000001% of the problem and I feel like I have failed.
* I want to solve climate change, and getting to net zero will require stopping or removing emissions of 1,500 billion tons of carbon dioxide. I give $500 to a policy nonprofit that reduces emissions, in expectation, by 50 tons. My donation solves 0.000000003% of the problem and I feel like I have f
Risks/P-doom:
• P(doom) went down as a result of the dramatic shift in the Overton window. I'd say a moderate, but not massive update, because timelines are looking shorter as well.
• No longer worried about the possibility that we (the AI Safety community) are all essentially a bunch "cranks" given the support of Geoffrey Hinton and Yoshua Bengio have voiced towards the importance of addressing these concerns.
• Updated towards being more worried about risks such as AI-supported bioterrorism, cyber-attacks and manipulation. There's been results that make me feel like we're on the verge of this becoming a thing and "slow"-takeoff is looking more likely, so there would be a greater cost to just tanking these issues. I still see alignment as the most important issue to focus on.
• In terms of outer alignment: Due to ChatGPT, much more optimistic about training an AI to behave reasonably in normal situations using RLHF; also more optimistic about using such techniques to tell AI's to behave conservatively in weird philosophical thought experiments. My main remaining worry related to outer alignment is figuring out corrigibility lest we produce an AI that works well in the current context, but can't adapt to new circumstances.
Timelines:
• Long-timeline worlds feel less likely. Even though some of the capabilities of ChatGPT/GPT 4 are not that surprising to people who were following capability progress closely, the longer things continue as they are, the less time there is for an unexpected slow-down to occur before we hit AGI.
• More optimistic about evals work delivering value, largely due to the openness of governments and companies to evals work.
Governance:
• Far more optimistic about policy than before due to the opening up of the Overton Window.
• More complicated feelings on a pause as a result of the AI Pause Debate in terms of now understanding that the logistics of making a pause work net-positive would be much more complicated than I first realised. I think that pushing for the pause to be part of the public conversation/one of the options considered, while not completely risk-free, is a pretty strong bet.
• Became a huge fan of the Tony Blair Institute's proposal for the UK to create an organisation called sentinel which would perform research in order to help figure out AI policy.
• More worried about the threat of open-source AI given how fast it is catching up with GPT4 and Facebook's decision to champion open-source.
Groups:
• Less confident in Sam Altman's leadership of OpenAI.
• More worried about e/acc vs. previously thinking that they were so unimportant that we should just ignore them vs. risking amplifying their profile.
• More optimistic about allying with people concerned about near-termist risks where our interests align (largely due to the impact of the FLI letter).
Technical alignment:
• More optimistic about the value of empirical alignment research and less optimistic about the value of agent foundations research.
• I now feel the field is mature enough that "workhorse" researchers can make a significant contribution (vs. before when creativity to discover new research directions seemed more vital).
• More optimistic about approaches that take advantage of the linearity in neural networks.
• More optimistic on interpretability progress (due to a number of results incl. dictionary learning resolving super-position).
• I spent a lot of time this year trying to read up about as many alignment proposals as possible. I now think it would have been better for me to have spent less time doing this and to have spent more time focusing on doing concrete work.
Movement-building:
• Movement-building work to grow the pool of applicants to programs like SERI MATS seems less important because these programs are much more competitive these days. May be better to attempt to increase mentorship opportunities or to focus on people who are more experienced in terms of research or AI.
Looks like outer alignment is actually more difficult than I thought. Sherjil Ozair, a former Deepmind employee writes:
"From my experience doing early RLHF work for Gemini, larger models exploit the reward model more. You need to constantly keep collecting more preferences and retraining reward models to make it not exploitable. Otherwise you get nonsensical responses which have exploited the idiosyncracy of your preferences data. There is a reason few labs have done RLHF successfully"
In other words, even though we look at things like ChatGPT and go, "Wow, this is surprisingly aligned, I guess alignment is easier than we thought", we don't see all of the hard work that had to go into making it aligned. And perhaps as AI's become more powerful the amount of work required to align it will exceed what is humanly possible.
May I ask what your feelings on a pause were beforehand?
I think I was likely 90% in favour of a 6 month pause mostly as a way to wake people up. I guess my main update from that debate was the difficulty of actually implementing a pause.
If you decide to start pushing for a pause, you don't get nuanced control over when the pause occurs (you likely have to start pushing for it at least a couple of years ahead of when you want it to occur). Further, it's quite likely that you accidentally reduce the amount of crunchtime by reducing the gap between the leading players and the rest. If this happens, a pause would likely be net-negative.
For an indefinite pause, it's unclear that you'll be able to unpause when necessary to avoid someone else front-running you, particularly because you might have to make alliances with people who will want to keep it paused.
So while it may still be worth pausing, it's very hard to get the details right so that it is robustly net-positive.
My p(doom) went down slightly (From around 30% to around 25%) mainly as a result of how GPT-4 caused governments to begin taking AI seriously in a way I didn't predict. My timelines haven't changed - the only capability increase of GPT-4 that really surprised me was its multimodal nature. (Thus, governments waking up to this was a double surprise, because it clearly surprised them in a way that it didn't surprise me!)
I'm also less worried about misalignment and more worried about misuse when it comes to the next five years, due to how LLM"s appear to behave. It seems that LLM's aren't particularly agentic by default, but can certainly be induced to perform agent-like behaviour - GPT-4's inability to do this well seems to be a capability issue that I expect to be resolved in a generation or two. Thus, I'm less worried about the training of GPT-N but still worried about the deployment of GPT-N. It makes me put more credence in the slow takeoff scenario.
This also makes me much more uncertain about the merits of pausing in the short-term, like the next year or two. I expect that if our options were "Pause now" or "Pause after another year or two", the latter is better. In practice, I know the world doesn't work that way and slowing down AI now likely slows down the whole timeline, which complicates things. I still think that government efforts like the UK's AISI are net-positive (I'm joining them for a reason, after all) but I think a lot of the benefit to reducing x-risk here is building a mature field around AI policy and evaluations before we need it - if we wait until I think the threat of misaligned AI is imminent, that may be too late.