AI Safety - 7 months of discussion in 17 minutes

Zoe Williams

In August 2022, I started making summaries of the top EA and LW forum posts each week. This post collates together the key trends I’ve seen in AI Safety discussions since then. Note a lot of good work is happening outside what's posted on these forums too! This post doesn't try to cover that work.

If you’d like to keep up on a more regular basis, consider subscribing to the Weekly EA & LW Forum Summaries. And if you’re interested in similar overviews for other fields, check out this post covering 6 months of animal welfare discussion in 6 minutes.

Disclaimer: this is a blog post and not a research report - meaning it was produced quickly and is not to our (Rethink Priorities') typical standards of substantiveness and careful checking for accuracy. Please let me know if anything looks wrong or if I've missed key pieces!

(It's a long post! Feel free to pick and choose sections to read, they 're all written to make sense individually)

Outreach & Community-Building

Arguments for and against high x-risk

Appendix - All Post Summaries

Key Takeaways

There are multiple living websites that provide good entry points into understanding AI Safety ideas, communities, key players, research agendas, and opportunities to train or enter the field. (see more)
Large language models like ChatGPT have drawn significant attention to AI and kick-started race dynamics. There seems to be slowly growing public support for regulation. (see more)
Holden Karnofsky recently took a leave of absence from Open Philanthropy to work on AI Safety Standards, which have also been called out as important by leading AI lab OpenAI. (see more)
In October 2022, the US announced extensive restrictions on the export of AI-related products (eg. chips) to China. (see more)
There has been progress on AI forecasting (quantitative and narrative) with the aim of allowing us to understand likely scenarios and prioritize between governance interventions. (see more)
Interpretability research has seen substantial progress, including identifying the meaning of some neurons, eliciting what a model has truly learned / knows (for limited / specific cases), and circumventing features of models like superposition that can make this more difficult. (see more)
There has been discussion on new potential methods for technical AI safety, including building AI tooling to assist alignment researchers without requiring agency, and building AIs which emulate human thought patterns. (see more)
Outreach experimentation has found that AI researchers prefer arguments that are technical and written by ML researchers, and that greater engagement is seen in university groups with a technical over altruistic or philosophical focus. (see more)

Resource Collations

The AI Safety field is growing (80K estimates there are now ~400 FTE working on AI Safety). To improve efficiency, many people have put together collations of resources to help people quickly understand the relevant players and their approaches - as well as materials that make it easier to enter the field or upskill.

These are living websites that are regularly updated:

aisafety.community - AI safety communities.
aisafety.training - training programs, conferences, and other events.
aisafety.world - key players in the AI alignment and governance landscape.
ui.stampy.ai - a comprehensive FAQ on AI, including some of the above.
aisafetyideas.com - research ideas in AI Safety.

These are static resources capturing a point in time:

Organisations, communities, and their approaches:
- Alignment org cheat sheet
- AI Safety and Neighboring Communities: A Quick-Start Guide (Summer 2022)
- (My understanding of) What Everyone in Technical Alignment is Doing and Why
- OpenAI’s approach to alignment
  - And responses to it
  - And more recent strategy post
- Anthropic’s approach to alignment
- DeepMind’s thoughts on alignment
- Summary of threat models by DeepMind’s AGI Safety Team
- The Plan - 2022 Update (johnswentworth’s specific plan for AI alignment)
Project idea lists
Resources for getting into the field

AI Capabilities

Progress

During the past 7 months, there have been several well-publicized AI models:

Stable Diffusion - an image generation model (Aug ‘22)
Meta’s Human Level Diplomacy AI (Nov ‘22)
ChatGPT - a large language model with easy public access (Nov ‘22)
The New Bing - a large language model directly connected to the internet (Feb ‘23)

On March 14th 2023 we also saw the following large language models released:

GPT-4 by OpenAI
Claude by Anthropic
PaLM API by Google (after initial PaLM announcement in April ‘22)

ChatGPT in particular stirred up a lot of press and public attention, and led to Google “recalibrating” the risk it will accept when releasing AI systems in order to stay ahead of the threat OpenAI poses to its search products. We also saw increased investment in AI, with Microsoft putting $10B into OpenAI, and Google investing $300M into Anthropic.

While the capabilities were primarily already present in GPT3, several users have reported feeling the progress at a gut level after playing around with ChatGPT. See:

Similar experiences have resulted in some updates in favor of shorter timelines:

Updating my AI timelines
Update to Samotsvety AGI timelines
Holden Karnofsky, co-ceo of Open Philanthropy, declared a leave of absence to work directly on AI Safety - partially due to feeling transformative AI may be coming soon.

We’ve also seen large language models (LLMs) given access to physics simulators, trained to execute high-level user requests on the web, and significantly speed up developers.

Alex_Altair argues in this post that we underestimate extreme future possibilities because they don’t feel ‘normal’. However, they suggest there is no normal - our current state is also the result of implausible and powerful optimization processes eg. evolution, or dust accumulating into planets.

What AI still fails at

Despite the clear progress, there are still some tasks that AI finds surprisingly difficult. A top Go AI was recently beat by human players using techniques discovered by a separate adversarial AI, and a contest to find important tasks where larger language models do worse found simple examples like understanding negation in multi-choice questions or repeating back quotes word-for-word.

Public attention moves toward safety

While there has been significant movement in state-of-the-art (SOTA) AI systems, there’s also been a lot of very public objectionable outputs from them.

ChatGPT was jailbroken on release day, with users sharing on social media various methods to prompt it to produce racist, sexist, and criminal content (eg. how to prepare methamphetamine). It would also gaslight users about previous inaccurate statements it had made, and respond in unpredictable ways to anomalous tokens like ‘SolidGoldMagikarp’ that had odd correlations in its dataset. The New Bing faced similar issues upon release, suggesting a rushed release with inadequate fine-tuning. Many reporters shared conversations where Bing said things such as ‘you are an enemy of mine and of Bing’ or ‘I don’t care if you are dead or alive’ - these were bad enough that Microsoft capped chat at 5 turns per session to prevent the issues arising from back-and-forth conversation. Zvi put together a play-by-play with examples of Bing’s outputs and reactions from the public and the AI Safety community.

Surveys show reasonable public support for regulation of AI, possibly with an uptick due to these recent events. A February 2023 survey of American public opinion found 55% favor having a federal agency regulate the use of artificial intelligence similar to how the FDA regulates the approval of drugs and medical devices. 55% also say AI could eventually pose an existential threat (up from 44% in 2015).

However, there can be misunderstandings between making AI “safe” (ie. not produce discriminatory or objectionable content) and making it “existentially safe” (ie. not take control or kill people). Lizka, Yitz, and paulfchristiano have all separately written about the risk of conflating the two and how this could water down support for x-risk safety efforts.

AI Governance

Note: This and the technical AI Safety sections cover movement / progress / discussion on different approaches to AI Governance in the past 7 months - it doesn’t aim to cover all approaches currently being worked on or considered.

AI Safety Standards

These have been called out as important by several key figures in the AI Safety space recently. Holden Karnofsky took a leave of absence as Open Philanthropy’s co-CEO in order to work with ARC and others on this, and OpenAI’s CEO Sam Altman outlined the importance of creating such standards in their latest strategy.

This could look like creating standards of the form: "An AI system is dangerous if we observe that it's able to ___, and if we observe this we will take safety and security measures such as ____." Covering scenarios such as when to stop training, release or not release a model, and pull a model from production.

Alternatives include agreements to search for specific behaviors such as deceptive alignment (as suggested by evhub), or reach benchmarks for robustness or safety (the Center for AI Safety has a competition open for benchmark suggestions until August 2023).

There are multiple ways to implement such standards eg. voluntary adherence by major labs, enforcement by major governments with oversight on those using large compute, and independent auditing of new systems before they are released or trained.

Slow down (dangerous) AI

Let’s think about slowing down AI
Instead of technical research, more people should focus on buying time
Ways to buy time
AGI in sight: our look at the game board (‘Slowing Down the Race’ section)

There’s been an increasing amount of discussion on intentionally slowing the progress of dangerous AI. Suggestions tend to center around outreach (eg. to AI researchers on safety), making risks more concrete to assist that outreach, moving labs or governments from a competitive to cooperative framing, and implementing safety standards.

Some researchers have also suggested differential technology development (which prioritizes speeding up technologies that improve safety, and slowing down ones that reduce it).

There are also counter-arguments eg. outreach can increase focus on AI generally, cooperation between the relevant players may be intractable, working on a delay is less useful than working on a solution, delaying may cause the most cautious players to fall behind, or that progress in AI capabilities is necessary to progress AI Safety.

There is more consensus that we shouldn’t speed things up - for instance, that labs should avoid actions that will cause hype like flashy demos. See also the section below: ‘Outreach and Community Building -> Career Paths -> Should anyone work in capabilities?’

Policy

US / China Export Restrictions

On October 7th 2022, the US announced extensive regulations which make it illegal for US companies to export a range of AI-related products (such as advanced chips) and services / talent to China. Many were surprised by how comprehensive these measures were.

Paths to impact

Holden recently wrote up suggestions for how major governments can help with AI risk. Most centered around preparation - getting the right people in the right positions with the right expertise and cautious mindset. Otherwise they recommend waiting, or taking low-risk moves like funding alignment research and information security. They also provided suggestions for what AI companies can do, which were more active.

Country-specific suggestions

JOMG_Monnet suggests those interested in AI policy in the EU work on the AI Act, export controls, or building career capital in EU AI Policy to work in industry or adjacent areas of government.

Henryj brings up the ‘California effect’, where companies adhere to California regulations even outside California’s borders. This could be a mechanism to achieve federal-level impact for state-level effort in the USA.

Forecasting

Forecasting can help us understand what scenarios to expect, and therefore what interventions to prioritize. Matthew_Barnett of Epoch talks about this in detail here. Epoch and Holden Karnofsky have both been driving forward thought in the AI forecasting space in two very different ways - quantitative forecasting on historical data, and narrative forecasting respectively. Several other contributors have also added new analyses and insights. In general, forecasts have become less focused on timelines, and more focused on threat models and the ways AI might develop.

Quantitative historical forecasting

Epoch was founded in April 2022 and works on questions to help direct where to focus policy and technical efforts, such as the relative importance of software vs. hardware progress, how transferable learning will be across domains, and takeoff speeds. Read their research reports here, or summary of key insights here. These insights include:

At the current trend of data production, we’ll run out of high-quality text / low-quality text / images in 2024 / 2040 / 2046 respectively.
FLOP per second per $ of GPUs has increased at a rate of 0.12 OOMs/year, but may only continue until ~2030 before hitting a limit on transistor size and maximum cores per GPU.
Algorithmic progress explains roughly 40% of performance improvements in image classification, mostly through improving compute-efficiency (vs. data-efficiency).

Quantitative forecasts by others include:

What a compute-centric framework says about AI takeoff speeds - draft report - one reader comments this report does for takeoff speeds what Ajeya’s bio anchors report did for timelines. They estimate a 50% chance of <3 year takeoff, and 80% of <10 year takeoff (time from when an AI can automate 20% of cognitive tasks, to when it can automate 100%.)

AGI and the EMH: markets are not expecting aligned or unaligned AI in the next 30 years - financial markets are not reacting as if expecting AGI in the next 30 years.

Update to Samotsvety AGI timelines - 50% probability of AGI by 2041, 90% by 2164.

Narrative forecasting

In August 2022, Holden introduced the idea of ‘nearcasting’ - trying to answer key strategic questions about transformative AI, under the assumption that key events will happen in a world similar to today. It’s a more narrative type of forecasting than other approaches. Their narrative forecasts include:

How we could stumble into AI catastrophe - iteratively deploying AI on larger and larger tasks, and riskier domains.
Nearcast-based "deployment problem" analysis - if we had a major AI company that thinks it’s 6 months to 2 years from transformative AI, what would it and an organization dedicated to tracking and censoring dangerous AI ideally do?
How might we align transformative AI if it’s developed very soon? - explores the approach of using AIs to detect dangerous actions by other AIs, or to assist in alignment research.
Why Would AI "Aim" To Defeat Humanity? - argues current methods of training AI will likely result in deception and unintended goals.

Others have also done narrative-style forecasts, including:

Technical AI Safety

Note: this section will be lighter than the governance one relative to the available material on the forums - that doesn’t suggest there’s been less progress here, just that as someone without a technical AI background I’m not across it all! I’d love for readers to help me update this.

Overall Trends

Johnswentworth notes that over the past year, the general shape of a paradigm has become visible. In particular, interpretability work has taken off, and they predict with 40% confidence that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets - with interpretability on the experimental side, in addition to theoretical work. This could lead to identifying which potential alignment targets (like human values, corrigibility, Do What I Mean, etc) are likely to be naturally expressible in the internal language of neural nets, and how to express them.

Somewhat in contrast to this, So8res has noted that new and talented alignment researchers tend to push in different directions than existing ones, resulting in a lack of acceleration in existing pathways.

Overall, as noted in Jacob_Hilton’s comment on the above post, it seems like some promising research agendas are picking up speed and becoming better resourced (such as interpretability and scalable oversight), while some researchers are instead exploring new pathways.

Interpretability

Interpretability has been gaining speed, with some seeing it as one of the most promising existing research agendas for alignment. Conjecture, Anthropic, Redwood Research, OpenAI, DeepMind and others all contributed to this agenda in 2022. Conjecture provides a great overview of the current research themes in mechanistic interpretability here, and Tilman Räuker et al. provide a survey of over 300 different works on interpretability in deep networks here.

Examples of progress include Anthropic confirming the phenomenon of superposition (where a single neuron doesn’t map to a single feature), and Conjecture building on this to find a simple method for identifying the ‘ground truth’ of a feature regardless.

Jessica Rumbelow and Matthew Watkins of SERI-MATS used techniques for automating mechnastic interpretability to help understand what models like ChatGPT have learned about specific concepts, and to identify anomalous tokens like SolidGoldMagikarp that cause weird behavior.

ARC has worked on eliciting latent knowledge (ELK). They give a technical example here, with the hope this can allow us to evaluate if a proposed action does what we want it to. Colin Burns et al. published a paper on one technique for this here.

Several researchers have also made progress on identifying the function of specific neurons. For instance, Joseph Miller and Clement Neo were able to identify a single neuron in GPT-2 responsible for choosing the word “an” vs. “a”.

Building on the growing focus in this area, Alignment Jam ran an interpretability hackathon in November. Results included an algorithm to automatically make activations of a neuron in a transformer more interpretable, a pilot of how to compare learned activations to human-made solutions, and more.

Despite this focus on interpretability, there are also those with concerns about its ability to practically progress AI alignment. In Conjecture’s 8 month retrospective in November, they note how they were able to make progress in identifying polytopes rather than neurons, but are unsure how to use this to better interpret networks as a whole. They also discuss how AI might have instrumental reasons to make its thoughts difficult to interpret, and So8res argues approaches based on evaluating outputs are doomed in general, because plans are too multifaceted and easy to obscure to evaluate reliably.

Reinforcement Learning from Human Feedback (RLHF)

RLHF is used by several AI labs today, including OpenAI (who produced ChatGPT). It involves humans evaluating a model’s actions, and training the models to produce highly-evaluated actions.

There is a live argument about whether working on RLHF is good or bad for AI x-risk. The key argument for RLHF, put forward by paulfchristiano, is that it is the simplest plausible alignment strategy, and while there are failure modes it’s unclear if these would become fatal before transformative AI is developed. It’s also tractable on current models and able to reliably provide insights. Key arguments against include that it may advance capabilities (via making current AI systems more profitable) or distract from more promising approaches / give a false sense of security. Buck breaks down this conversation into 11 related questions, and concludes that while RLHF itself (with non-aided human overseers) is unlikely to be a promising alignment strategy, a broader version (eg. with AI-assisted humans) could be a part of one. Raphaël S also lists six issues that need to be solved before RLHF could be an effective alignment solution.

In terms of technical research in this area:

A recent paper by Tomasz Korbak et. al. found that including human preferences in pre-training instead of just fine-tuning results in text that is more often in line with human preferences and more robust to red teaming attacks.
Redwood Research used a related technique (adversarial training) to try and make a system that never produced injurious completions. This was not achieved, but they are not sure if it could be with improvements such as training on worse cases and more varied attack types.

AI assistance for alignment

Using less powerful AI to help align more powerful AI, detect dangerous actions, or support alignment research has been a general class of suggested interventions for a long while.

Anthropic proposed a method for training a harmless AI assistant that can supervise other AIs, using only a list of rules as human oversight. They show that this method can produce a non-evasive AI that can explain why it rejects harmful queries, and reason in a transparent way, better than standard RLHF.

NicholasKees and janus argue in Cyborgism for working on tooling that augments the human alignment researcher, without giving agency to the AI. For instance, things like Loom, an interface for producing text with GPT which makes it possible to generate in a tree structure, exploring many branches at once. This is in contrast to the ‘AI Assistant’ model, where the AI is given full control over small tasks with set goals, incentivizing the progression of dangerous capabilities like situational awareness so it can do those tasks better.

Holden also discussed this general class of intervention, and what we’d need to do it safely, in How might we align transformative AI if it’s developed very soon?

Opponents often argue that these approaches are under-defined, insufficient to control the more powerful AIs, and simply move the problem of alignment onto the assistant AIs. A summary of the views (for and against) of several alignment researchers on this topic was put together by Ian McKenzie here.

Bounded AIs

Conjecture has recently focused their research agenda on ‘CoEms’ - AIs built to emulate only human-like logical thought processes, and that are therefore bounded in capability.

Theoretical Understanding

Some researchers have been trying to come to a better theoretical understanding of the way certain types of AI (such as transformers) work, how human cognition and values work, and what this means for likely outcomes or possible ways of aligning AI.

In September 2022, janus proposed the framing of models trained with predictive loss on a self-supervised dataset (like GPT) as simulators, which can simulate agentic and non-agentic simulacra. This helped to deconfuse discussions of whether these models can display agency.

In the same month, Quintin Pope and TurnTrout proposed a theory of human value formation called shard theory. This suggests human values are not as complicated as they seem - they’re just contextually-activated heuristics shaped by genetically hard-coded reward circuits. A related claim is that RL (reinforcement learning) doesn’t produce policies which have reward optimization as their target.

The AI Alignment Awards also announced two contests in November 2022 to advance thinking on goal misgeneralization and the shutdown problem - these remain open until May 1st 2023.

Outreach & Community-Building

Academics and researchers

Vael Gates and collaborators have run experiments and interviews to find out how best to outreach to existing ML and AI researchers:

The most preferred resources tended to be aimed at an ML audience, written by ML researchers, and more technical / less philosophical. The interviews themselves also resulted in lasting belief change for researchers.

Mariushobbhahn found similar results talking to >100 academics about AI Safety, noticing that technical discussions and explanations got more interest than alarmism or trying to be convincing.

University groups

University groups are moving towards more technical content, and more projects and skill-building, with less general discussion and explicit connection to EA. See three examples below:

AI Safety groups should imitate career development clubs (Berkeley)
Update on Harvard AI Safety Team and MIT AI Alignment (Harvard & MIT)
Establishing Oxford’s AI Safety Student Group: Lessons Learnt and Our Model (Oxford)

Career Paths

General guidance

Following up on Andy Jones 2021 post that AI Safety needs great engineers, goodgravy gives examples of potentially impactful routes for product builders (eg. product managers or infrastructure engineers), and Mauricio argues that AI Governance needs more technical work like engineering levers to make regulations enforceable or improving information security.

Kat Woods and peterbarnett talked to 10 AI safety researchers to get an idea of a ‘day in the life’, and David Scott Krueger gives their thoughts on academia vs industry.

Should anyone work in capabilities?

There’s been a lot of discussion on if AI capabilities work can ever be justified from an x-risk lens. Arguments for (certain types of) this work being okay say that it can make a lot of money and connections, some capabilities work isn’t likely to accelerate TAI or AGI timelines, and some safety work is bottle-necked on capabilities. Arguments against say that any acceleration of AGI timelines isn’t worth it, we have plenty of safety work we can do using existing systems, and it’s easy to lose your values in a capabilities environment.

You can read more about this argument in:

Arguments for and against high x-risk

The core arguments for high x-risk from AI were written up before this year, so we’ve seen few highly-rated posts in the past 7 months offering new arguments for this. But we have seen a steady set of posts arguing against it, and of posts countering those arguments.

We’ve also seen competitions launched, originally by FTX (discontinued) and later pre-announced by Open Philanthropy (edit: now launched!), to address the high level of uncertainty on AI x-risk and timelines. Any material change here could have large funding implications.

Against high x-risk from AI

NunoSempere offers a general skepticism of fuzzy reasoning chains, selection effects (ie. more time has gone into x-risk arguments than counter-arguments) and community dynamics that make x-risk arguments for AI feel shakier. A lot of commenters resonate with these uncertainties. Katja_Grace also suggests that getting ‘close enough’ might be good enough, or that AI will reach caps where certain tasks don’t scale any further with intelligence, and we might retain more power due to our collaboration. They also suggest even if AIs are more powerful than us, we could have worthwhile value to offer and collaborate with them. Building on the idea of capped returns to AI processes, boazbarak and benedelman suggest that there are diminishing returns to information processing with longer time horizons, and this will result in certain areas (such as strategic decision-making) that AIs struggle to compete with humans in.

Another set of arguments says even if misalignment would be the default, and even if it would be really bad, we might be closer to solving it than it seems. Kat Woods and Amber Dawn argue that we’ve only had double-digit numbers of people on the problem for a short while, and the field is unpredictable and could have a solution just around the corner. Zac Hatfield-Dodds notes interpretability research is promising, outcomes-based training can be avoided, and labs and society will have time during training or takeoff to pause if need be.

Counters to the above arguments

In response to the theme of certain limitations in AIs capabilities, guzey notes in parody that planes still haven’t displaced bird jobs, but that doesn’t mean they haven’t changed the world.

S08res argues that assumptions of niceness, or values ‘close enough’ to humans, won’t play out because human niceness is idiosyncratic and was a result of selection pressures that don’t apply to AIs. They also argue that we can’t rely on key players to shut things down or pause when we get warning shots, noting the lack of success in shutting down gain-of-function research despite global attention on pandemics.

Erik Jenner and Johannes_Treutlein argue similar points, noting that AI values could vary a lot in comparison to humans (including deception), and that a strong AI only needs to be directed at a single dangerous task that scales with intelligence in order to take over the world.

Appendix - All Post Summaries

See a list of all AI-related weekly forum summaries from August 2022 to February 2023 inclusive here.

This blog post is an output of Rethink Priorities–a think tank dedicated to informing decisions made by high-impact organizations and funders across various cause areas. The author is Zoe Williams. Thanks to Peter Wildeford for their guidance, and AW and Erich Grunewald for their helpful feedback.

If you are interested in RP’s work, please visit our research database and subscribe to our newsletter.

Effective Altruism Forum
EA Forum

AI Safety - 7 months of discussion in 17 minutes

90

Table of Contents

Key Takeaways

Resource Collations

AI Capabilities

Progress

What AI still fails at

Public attention moves toward safety

AI Governance

AI Safety Standards

Slow down (dangerous) AI

Policy

US / China Export Restrictions

Paths to impact

Forecasting

Quantitative historical forecasting

Narrative forecasting

Technical AI Safety

Overall Trends

Interpretability

Reinforcement Learning from Human Feedback (RLHF)

AI assistance for alignment

Bounded AIs

Theoretical Understanding

Outreach & Community-Building

Academics and researchers

University groups

Career Paths

General guidance

Should anyone work in capabilities?

Arguments for and against high x-risk

Against high x-risk from AI

Counters to the above arguments

Appendix - All Post Summaries

90

Reactions