MATS AI Safety Strategy Curriculum v2

DanielFilan

As part of our Summer 2024 Program, MATS ran a series of discussion groups focused on questions and topics we believe are relevant to prioritizing research into AI safety. Each weekly session focused on one overarching question, and was accompanied by readings and suggested discussion questions. The purpose of running these discussions was to increase scholars’ knowledge about the AI safety ecosystem and models of how AI could cause a catastrophe, and hone scholars’ ability to think critically about threat models—ultimately, in service of helping scholars become excellent researchers.

The readings and questions were largely based on the curriculum from the Winter 2023-24 Program, with two changes:

We reduced the number of weeks, since in the previous cohort scholars found it harder to devote time to discussion groups later in the program.
For each week we selected a small number of “core readings”, since many scholars were unable to devote time to read everything in the curriculum, and we thought that some readings were more valuable than others.

In addition, the curriculum was supplemented in two ways:

For some weeks, a summary of each reading was compiled for the benefit of discussion group facilitators.
There were some readings that we felt were valuable, but did not fit nicely into any particular week. These supplements were shared with scholars after the discussion series concluded, and are included in this post. When summaries exist, they are shown underneath the reading they summarize, and the additional readings are at the end of the post. Some summaries have been edited for accuracy since being shown to discussion group facilitators and scholars, but they still may contain flaws.

As in the post about the previous cohort’s curriculum, we think that there is likely significant room to improve this curriculum, and welcome feedback in the comments.

Week 1: How powerful is intelligence?

Core readings

The Power of Intelligence (Rational Animations, Yudkowsky - 7 min)
Superintelligence Chapter 6: cognitive superpowers (up to "Power over nature and agents") (Bostrom - 14 min)
[In retrospect, it was a mistake to assign a reading that people could not read just by clicking on the link. Instead, it would be better to assign this LessWrong post containing a summary and discussion of the relevant section.]
AI could defeat all of us combined (Karnofsky - 20 min)

Discussion questions

Is it true that humans are more impactful than other animals due to our intelligence? What kinds of "intelligence" are relevant?
Is human intelligence located inside one brain, or in organizations and cultures? Will we and AIs be part of the same culture?
How many of Bostrom's "cognitive superpowers" are necessary for AI to be massively impactful? Which are likely to be developed by machine learning?
What is the most likely way Yudkowsky's mail-order DNA scenario could fail? Does this failure generalize to other AI takeover scenarios? A similar question applies to the catastrophes described in "Other readings"
What are the differences between the different stories of AI takeover? Which kinds of stories are more plausible?
What are the limits of intelligence? What do they imply about how powerful AI could be?
Is 'intelligence' a single thing? What kinds of intelligence are more or less powerful?
What follows from the premise that AI could be extremely powerful?

Week 2: How and when will transformative AI be made?

Core readings

Will Scaling Work? (Patel - 16 min)
A 'dialogue' on whether LLMs can be scaled to get superhuman AI. In summary:
- We're running out of language data, and there's a good chance we'll be a few OOMs off what we need. You could do self-play but that's computationally expensive and hard to evaluate. But evaluation gets easier the smarter models are, we could train them on more relevant loss functions, top researchers think it will work.
- Scaling has worked so far at driving loss down
  - But next-token prediction isn't want we care about
  - But you also see scaling on benchmarks like MMLU
  - But MMLU is just memorization of internet stuff, and we're starting to saturate MMLU. Benchmarks that measure autonomous problem-solving see models sucking.
  - GPT-4 did better than plateau believers predicted, not clear why you'd suddenly see a plateau, could spend way more, improve compute efficiency, unhobble, etc.
- LLMs seem like they're modelling the world
  - But they're just compressed knowledge
  - But c'mon, scaling is working!
The Direct Approach (Barnett and Besiroglu - 13 min)
Connects cross-entropy loss to stuff we care about - how many tokens you need to draw from a model before you can distinguish it from human text. Thesis: if a model is indistinguishable from human text, it's as smart as the humans generating that text. Uses this to bound time to AIs that can write human-like.
Biological Anchors: A Trick That Might or Might Not Work, parts I and II (Alexander - 33 min)
Bio-anchors: compare FLOPs used in ML to candidate biological processes that maybe produce AGI in the relevant way, get a decent probability of AGI by 2050. Problem: maybe FLOPs as a measure of intelligence is fake.

Discussion questions

Will scaling big machine learning models work to produce transformative AI? How can we tell?
If scaling does work, what does that mean for AI safety efforts?
How long will it take scaling to produce transformative AI?
What are the strengths and weaknesses of Epoch AI's "direct approach" vs Cotra's "biological anchors" as methods of forecasting the emergence of transformative AI?
How quickly might scaling work?
How much should we weight market prices as forecasts of AI timelines?
Should we expect AI companies to be nationalized?
What parts of modern machine learning support Gwern's speculations about the scaling hypothesis? What parts support it the least?
What do these frameworks say about the time between now and when transformative AI is developed? What things could happen that would tell us which of these frameworks were more reliable?

Week 3: How could we train AIs whose outputs we can’t evaluate?

Core readings

Why I'm excited about AI-assisted human feedback (Leike - 7 min)
RLHF won't scale when humans can't evaluate AI plans - we need AIs to assist humans to do evaluation
X thread on "When Your AIs Deceive You" (Emmons - 2 min)
RLHF can be suboptimal when the AI knows things the human doesn't. Two failure modes: the AI making the human think things are better than they really are, and the AI spending resources to prove to the human that the AI is being useful.
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision, sections 1 (Introduction) and 3 (Methodology) (Burns et al. - 12 min)
Studies the problem of humans supervising superhuman models by trying to use weak models to generate labels on which to fine-tune larger models, and seeing if the larger models can perform better than the weaker models.
The easy goal inference problem is still hard (Christiano - 5 min)
A common alignment plan is to observe human behaviour, infer what we want, and get an AI to optimize for that. Problem: given that humans are not optimal utility maximizers, it's unclear how one could do this with infinite data and compute.

Discussion questions

What failure modes could realistically occur if we do not solve this problem?
How does goal inference fare as a method of training AIs that can make plans we don't understand?
In what ways does this problem show up when training current models? Are there aspects of the problem that aren't present yet?
One plan for alignment is to make aligned AIs that are as smart as ourselves, and make those AIs align smarter AIs. How feasible is this plan?
The scalable oversight approach involves trying to get AIs to help humans to oversee AIs. How promising can this approach be? What do we need to assume about the helper AIs for this approach to work?
How do the guarantees provided by methods like debate or iterated distillation and amplification degrade when training is imperfect?

Week 4: Will AIs fake alignment?

Core readings

Scheming AIs: Will AIs fake alignment during training in order to get power?, abstract and introduction (Carlsmith - 45 min)
[In retrospect, this probably took longer than 45 minutes for most people to read]

Discussion questions

Under what conditions could AIs fake alignment?
How could we gain empirical evidence about the likelihood of AIs faking alignment?
In what conditions are agents motivated to gain power or to preserve the content of their goals? In what situations do they not face that motivation?
Have humans internalized the "goals" of evolution, or are we misaligned? To what degree does this question make sense?
If models generalize their capabilities to new contexts, is that evidence that they will generalize their values and alignment to new contexts?
Do counting arguments provide evidence for AI doom?
How likely are models trained on predictive tasks to fake alignment?
Alex Turner has criticized his post "Seeking Power is Often Convergently Instrumental in MDPs". Why has he done so? If his criticism is valid, how does that affect the relevance of the results for understanding the likelihood of models faking alignment?
How hard is it likely to be to train AIs that will not fake alignment, as opposed to controlling smart AIs that may have faked their own alignment?

Week 5: How should AI be governed?

Core readings

PauseAI Proposal (PauseAI - 4 min)
Set up an international AI Safety Agency that has to approve training and deploying big models. Training of general AI systems should only be allowed if safety can be guaranteed, deployment should only be allowed if no dangerous capabilities are present. Also stop people from publishing algorithm improvements, or increasing effectiveness of computational hardware.
Responsible Scaling Policies (RSPs) (METR - 13 min)
How labs can increase safety: say "we will stop scaling when we make observation O, until we implement adequate protection P". Makes sense under varying estimates of risk, moves attention to specific risk reducing measures, gives practice for evaluation-based regulations.
Ways I Expect AI Regulation To Increase Extinction Risk (1a3orn - 8 min)
Regulations are messy and can easily backfire, by (a) being misdirected and hampering safety effort, (b) favouring things that are legible to the state (which might not be good safety efforts), (c) pushing research to countries that don't regulate, and (d) empowering big companies.

Discussion questions

What technical research would best make an AI pause more feasible, or make it work better?
What technical research would make RSPs work better?
How similar would RSPs be to a pause on frontier AI developments in practice? What are the pros and cons of each approach?
How likely are the negative outcomes of AI regulation that 1a3orn describes? Are there ways of reducing the likelihood of those outcomes?
Are there other plausible serious negative outcomes of AI regulation that 1a3orn does not address?
What would change your mind about which forms of AI regulation were good ideas?
Some are concerned that a pause on frontier AI development would greatly speed progress once the pause was lifted, due to other factors of AI production (e.g. data, computers) still proceeding. How big of a concern is this?
Will countries be able to coordinate on AI regulations without a very powerful world government? Are there types of AI regulation that require less coordination?
How does the approach of using liability law to address risky AI development compare to RSPs?

Readings that did not fit into any specific week

Acknowledgements

Daniel Filan was the primary author of the curriculum (to the extent that it differed from the Winter 2023-24 curriculum) and coordinated the discussion groups. Ryan Kidd scoped, managed, and edited the project. Many thanks to the MATS alumni and other community members who helped as facilitators and to the scholars who showed up and had great discussions!

SummaryBotOct 8 20241

Executive summary: MATS ran a series of AI safety discussion groups for their Summer 2024 Program, covering key topics like AI capabilities, timelines, training challenges, deception risks, and governance approaches to help scholars develop critical thinking skills about AI safety.

Key points:

Curriculum covered 5 weekly topics: AI intelligence/power, transformative AI timelines, training challenges, alignment deception risks, and AI governance approaches.
Core and supplemental readings were provided for each topic, along with discussion questions to facilitate critical analysis.
Curriculum aimed to increase scholars' knowledge of AI safety ecosystem and potential catastrophe scenarios.
Changes from previous version included reducingd after the discussion series concluded.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Effective Altruism Forum
EA Forum

MATS AI Safety Strategy Curriculum v2

29

Week 1: How powerful is intelligence?

Core readings

Other readings

Discussion questions

Week 2: How and when will transformative AI be made?

Core readings

Other readings

Discussion questions

Week 3: How could we train AIs whose outputs we can’t evaluate?

Core readings

Other readings

Discussion questions

Week 4: Will AIs fake alignment?

Core readings

Other readings

On inner and outer alignment

On reasons to think deceptive alignment is likely

Discussion questions

Week 5: How should AI be governed?

Core readings

Other readings

Discussion questions

Readings that did not fit into any specific week

Acknowledgements

29

Reactions

More posts like this