Hide table of contents

We are hiring for several roles in the Scalable Alignment and Alignment Teams at DeepMind, two of the subteams of DeepMind Technical AGI Safety trying to make artificial general intelligence go well.  In brief,

  • The Alignment Team investigates how to avoid failures of intent alignment, operationalized as a situation in which an AI system knowingly acts against the wishes of its designers.  Alignment is hiring for Research Scientist and Research Engineer positions.
  • The Scalable Alignment Team (SAT) works to make highly capable agents do what humans want, even when it is difficult for humans to know what that is.  This means we want to remove subtle biases, factual errors, or deceptive behaviour even if they would normally go unnoticed by humans, whether due to reasoning failures or biases in humans or due to very capable behaviour by the agents.  SAT is hiring for Research Scientist - Machine LearningResearch Scientist - Cognitive ScienceResearch Engineer, and Software Engineer positions.

We elaborate on the problem breakdown between Alignment and Scalable Alignment next, and discuss details of the various positions.

“Alignment” vs “Scalable Alignment”

Very roughly, the split between Alignment and Scalable Alignment reflects the following decomposition:

  1. Generate approaches to AI alignment – Alignment Team
  2. Make those approaches scale – Scalable Alignment Team

In practice, this means the Alignment Team has many small projects going on simultaneously, reflecting a portfolio-based approach, while the Scalable Alignment Team has fewer, more focused projects aimed at scaling the most promising approaches to the strongest models available.

Scalable Alignment’s current approach: make AI critique itself

Imagine a default approach to building AI agents that do what humans want:

  1. Pretrain on a task like “predict text from the internet”, producing a highly capable model such as Chinchilla or Flamingo.
  2. Fine-tune into an agent that does useful tasks, as evaluated by human judgements.

There are several ways this could go wrong:

  1. Humans are unreliable: The human judgements we train against could be flawed: we could miss subtle factual errors, use biased reasoning, or have insufficient context to evaluate the task.
  2. The agent’s reasoning could be hidden: We want to know not just what the system is doing but why, both because that might reveal something about what that we don’t like, and because we expect good reasoning to better generalize to other situations.
  3. Even if the agent is reasoning well, it could fail in other situations: Even if the reasoning is correct this time, the AI could fail to generalize correctly to other situations.

Our current plan to address these problem is (in part):

  1. Give humans help in supervising strong agents: On the human side, provide channels for oversight and advice from peers, experts in various domains, and broader society.  On the ML side, agents should explain their behaviour and reasoning, argue against themselves when wrong, and cite relevant evidence.
  2. Align explanations with the true reasoning process of the agent: Ensure that agent’s are able and incentivized to show their reasoning to human supervisors, either by making reasoning explicit if possible or via methods for interpretability and eliciting latent knowledge.
  3. Red team models to exhibit failure modes that don’t occur in normal use

We believe none of these pieces are sufficient by themselves:

  • (1) without (2) can be rationalization, where an agent decides what to do and produces an explanation after the fact that justifies its answer.
  • (2) without (1) doesn’t scale: The full reasoning trace of the agent might be enormous, it might be terabytes of data even with compression, or exponentially large without compression if the agent is using advanced heuristics which expand into very large human-interpretable reasoning traces.
  • (1)+(2) without (3) will miss rare failures.
  • (3) needs (1)+(2) to define failure.

An example proposal for (1) is debate, in which two agents are trained in a zero-sum game to provide evidence and counterarguments for answers, as evaluated by a human judge.  If we imagine the exponentially large tree of all possible debates, the goals of debate are to (1) engineer the whole tree so that it captures all relevant considerations and (2) train agents so that the chosen single path through the tree reflects the tree as a whole.

Figure 1, AI Safety Needs Social Scientists

The full picture will differ from the pure debate setting in many ways, and we believe the correct interpretation of the debate idea is “agents should critique themselves”. There is a large space of protocols that include agents critiquing agents as a component, and choosing between them will involve

The three goals of “help humans with supervision”, “align explanations with reasoning”, and “red teams” will be blurry once we put the whole picture together.  Red teaming can occur either standalone or as an integrated part of a training scheme such as cross-examination, which allows agents to interrogate opponent behavior along counterfactual trajectories.  Stronger schemes to help humans with supervision should improve alignment with reasoning by themselves, as they grow the space of considerations that can be exposed to humans.  Thus, a key part of the Scalable Alignment Team’s work is planning out how these pieces will fit together.

Examples of our work, involving extensive collaboration with other teams at DeepMind:

  1. Risk analyses, both for long-term alignment risks and harms that exist today:
    1. Kenton et al. 2021, Alignment of language agents
    2. Weidinger et al. 2021, Ethical and social risks of harm from language models
  2. Language model pretraining, analysis, and safety discussion
    1. Rae et al. 2021, Scaling language models: Methods, analysis & insights from training Gopher
    2. Borgeaud et al. 2021, Improving language models by retrieving from trillions of tokens
  3. Safety
    1. Perez et al. 2022, Red teaming language models with language models
    2. Gleave and Irving 2022, Uncertainty Estimation for Language Reward Models
    3. Menick et al. 2022, Teaching language models to support answers with verified quotes
  4. Earlier proposals for debate and human aspects of debate
    1. Irving et al. 2018, AI safety via debate
    2. Irving and Askell 2019, AI safety needs social scientists

We view our recent safety papers as steps towards the broader scalable alignment picture, and continue to build out towards debate and generalizations.  We work primarily with large language models (LLMs), both because LLMs are a tool for safety by enabling human-machine communication and are examples of ML models that may cause both near-term and long-term harms.

Alignment Team’s portfolio of projects

In contrast to the Scalable Alignment Team, the Alignment Team explores a wide variety of possible angles on the AI alignment problem. Relative to Scalable Alignment, we check whether a technique could plausibly scale based on conceptual and abstract arguments. This lets us iterate much faster at the cost of getting less useful feedback from reality. To give you a sense of the variety, here’s some examples of public past work that was led by current team members:

  1. Learning objectives from human feedback on hypothetical behavior
  2. Understanding agent incentives using causal influence diagrams
  3. Examples of specification gaming
  4. Eliciting latent knowledge contest
  5. Avoiding side effects through impact regularization
  6. Improving our philosophical understanding of “agency” using Conway’s game of life
  7. Relating specification problems and Goodhart’s Law
  8. Decoupling approval from actions to avoid tampering

That being said, over the last year there has been some movement away from previous research topics and towards others. To get a sense of our current priorities, here are short descriptions of some projects that we are currently working on:

  1. Primarily conceptual:
    1. Investigate threat models in which due to increasing AI sophistication, humans are forced to rely on evaluations of outcomes (rather than evaluations of process or reasoning).
    2. Investigate arguments about the difficulty of AI alignment, including as a subproblem the likelihood that various AI alignment plans succeed.
    3. Compare various decompositions of the alignment problem to see which one is most useful for guiding future work.
  2. Primarily empirical:
    1. Create demonstrations of inner alignment failures, in a similar style as this paper.
    2. Dig deeper into the grokking phenomenon and give a satisfying account of how and why it happens.
    3. Develop interpretability tools that allow us to understand how large language models work (along similar lines as Anthropic’s work).
    4. Evaluate how useful process-based feedback is on an existing benchmark.

Relative to most other teams at DeepMind, on the Alignment team there is quite a lot of freedom in what you work on. All you need to do to start a project is to convince your manager that it’s worth doing (i.e. reduces x-risk comparably well to other actions you could take), and convince enough collaborators to work on the project.

In many ways the team is a collection of people with very different research agendas and perspectives on AI alignment that you wouldn’t normally expect to work together. What ties us together is our meta-level focus on reducing existential risk through alignment failures:

  1. Every new project must come accompanied by a theory of change that explains how it reduces existential risk; this helps us avoid the failure mode of working on interesting conceptual projects that end up not connecting to the situations we are worried about.
  2. It’s encouraged to talk to people on the team with very different perspectives and try to come to agreement, or at least better understand each other’s positions. This can be an explicit project even though it isn’t “research” in the traditional sense.

Interfacing with the rest of DeepMind

Both Alignment and Scalable Alignment collaborate extensively with people across DeepMind.

For Alignment, this includes both collaborating on projects that we think are useful, and by explaining our ideas to other researchers. As a particularly good example, we recently ran a 2 hour AI alignment “workshop” with over 100 attendees. (That being said, you can opt out of these engagements in order to focus on research, if you prefer.)

As Scalable Alignment’s work with large language models is very concrete, we have tight collaborations with a variety of teams, including large-scale pretraining and other language teams, Ethics and Society, and Strategy and Governance.

The roles

Between our two teams we have open roles for Research Scientists (RSs), Research Engineers (REs), and (for Scalable Alignment) Software Engineers.  Scalable Alignment RSs can have either a machine learning background or a cognitive science background (or equivalent).  The boundaries between these roles are blurry.  There are many skills involved in overall Alignment / Scalable Alignment research success: proposing and leading projects, writing and publishing papers, conceptual safety work, algorithm design and implementation, experiment execution and tuning, design and implementation of flexible, high-performance, maintainable software, and design and analysis of human interaction experiments.  

We want to hire from the Pareto frontier of all relevant skills.  This means RSs are expected to have more research experience and more of a track record of papers, but SWEs are expected to be better at scalable software design / collaboration / implementation, with REs in between, but also that REs can and do propose and lead projects if capable (e.g., this recent paper had an RE as last author).  For more details on the tradeoffs, see the career section of Rohin’s FAQ.

For Scalable Alignment, most of our work focuses on large language models.  For Machine Learning RSs, this means experience with natural language processing is valuable, but not required.  We are also interested in candidates motivated by other types of harms caused by large models, such as those described in Weidinger et al., Ethical and social risks of harm from language models, as long as you are excited by the goal of removing such harms even in subtle cases which humans have difficulty detecting.  For REs and SWEs, a focus on large language models means that experience with high performance computation or large, many-developer codebases is valuable.  For the RE role for Alignment, many of the projects you could work on would involve smaller models that are less of an engineering challenge, though there are still a few projects that work with our largest language models.

Scalable Alignment Cognitive Scientists are expected to have a track record of research in cognitive science, and to design, lead, and implement either standalone human-only experiments to probe uncertainty, or the human interaction components of mixed human / machine experiments.  No experience with machine learning is required, but you should be excited to collaborate with people who do!

Apply now!

We will be evaluating applications on a rolling basis until positions are filled, but we will at least consider all applications that we receive by May 31. Please do apply even if your start date is up to a year in the future, as we probably will not run another hiring round this year. These roles are based in London, with a hybrid work-from-office / work-from-home model. International applications are welcome as long as you are willing to relocate to London.

While we do expect these roles to be competitive, we have found that people often overestimate how much we are looking for. In particular:

  • We do not expect you to have a PhD if you are applying for the Research Engineer or Software Engineer roles. Even for the Research Scientist role, it is fine if you don’t have a PhD if you can demonstrate comparable research skill (though we do not expect to see such candidates in practice).
  • We do not expect you to have read hundreds of blog posts and papers about AI alignment, or to have a research agenda that aims to fully solve AI alignment. We will look for understanding of the basic motivation for AI alignment, and the ability to reason conceptually about future AI systems that we haven’t yet built.
    • If we ask you, say, whether an assistive agent would gradient hack if it learned about its own training process, we’re looking to see how you go about thinking about a confusing and ill-specified question (which happens all the time in alignment research). We aren’t expecting you to give us the Correct Answer, and in fact there isn’t a correct answer; the question isn’t specified well enough for that. We aren’t even expecting you to know all the terms; it would be fine to ask what we mean by “gradient hacking”.
  • As a rough test for the Research Engineer role, if you can reproduce a typical ML paper in a few hundred hours and your interests align with ours, we’re probably interested in interviewing you.
  • We do not expect SWE candidates to have experience with ML, but you should have experience with high performance code and experience with large, collaborative codebases (including the human aspects of collaborative software projects).

Go forth and apply!

Comments


No comments on this post yet.
Be the first to respond.
Curated and popular this week
LintzA
 ·  · 15m read
 · 
Cross-posted to Lesswrong Introduction Several developments over the past few months should cause you to re-evaluate what you are doing. These include: 1. Updates toward short timelines 2. The Trump presidency 3. The o1 (inference-time compute scaling) paradigm 4. Deepseek 5. Stargate/AI datacenter spending 6. Increased internal deployment 7. Absence of AI x-risk/safety considerations in mainstream AI discourse Taken together, these are enough to render many existing AI governance strategies obsolete (and probably some technical safety strategies too). There's a good chance we're entering crunch time and that should absolutely affect your theory of change and what you plan to work on. In this piece I try to give a quick summary of these developments and think through the broader implications these have for AI safety. At the end of the piece I give some quick initial thoughts on how these developments affect what safety-concerned folks should be prioritizing. These are early days and I expect many of my takes will shift, look forward to discussing in the comments!  Implications of recent developments Updates toward short timelines There’s general agreement that timelines are likely to be far shorter than most expected. Both Sam Altman and Dario Amodei have recently said they expect AGI within the next 3 years. Anecdotally, nearly everyone I know or have heard of who was expecting longer timelines has updated significantly toward short timelines (<5 years). E.g. Ajeya’s median estimate is that 99% of fully-remote jobs will be automatable in roughly 6-8 years, 5+ years earlier than her 2023 estimate. On a quick look, prediction markets seem to have shifted to short timelines (e.g. Metaculus[1] & Manifold appear to have roughly 2030 median timelines to AGI, though haven’t moved dramatically in recent months). We’ve consistently seen performance on benchmarks far exceed what most predicted. Most recently, Epoch was surprised to see OpenAI’s o3 model achi
Dr Kassim
 ·  · 4m read
 · 
Hey everyone, I’ve been going through the EA Introductory Program, and I have to admit some of these ideas make sense, but others leave me with more questions than answers. I’m trying to wrap my head around certain core EA principles, and the more I think about them, the more I wonder: Am I misunderstanding, or are there blind spots in EA’s approach? I’d really love to hear what others think. Maybe you can help me clarify some of my doubts. Or maybe you share the same reservations? Let’s talk. Cause Prioritization. Does It Ignore Political and Social Reality? EA focuses on doing the most good per dollar, which makes sense in theory. But does it hold up when you apply it to real world contexts especially in countries like Uganda? Take malaria prevention. It’s a top EA cause because it’s highly cost effective $5,000 can save a life through bed nets (GiveWell, 2023). But what happens when government corruption or instability disrupts these programs? The Global Fund scandal in Uganda saw $1.6 million in malaria aid mismanaged (Global Fund Audit Report, 2016). If money isn’t reaching the people it’s meant to help, is it really the best use of resources? And what about leadership changes? Policies shift unpredictably here. A national animal welfare initiative I supported lost momentum when political priorities changed. How does EA factor in these uncertainties when prioritizing causes? It feels like EA assumes a stable world where money always achieves the intended impact. But what if that’s not the world we live in? Long termism. A Luxury When the Present Is in Crisis? I get why long termists argue that future people matter. But should we really prioritize them over people suffering today? Long termism tells us that existential risks like AI could wipe out trillions of future lives. But in Uganda, we’re losing lives now—1,500+ die from rabies annually (WHO, 2021), and 41% of children suffer from stunting due to malnutrition (UNICEF, 2022). These are preventable d
Rory Fenton
 ·  · 6m read
 · 
Cross-posted from my blog. Contrary to my carefully crafted brand as a weak nerd, I go to a local CrossFit gym a few times a week. Every year, the gym raises funds for a scholarship for teens from lower-income families to attend their summer camp program. I don’t know how many Crossfit-interested low-income teens there are in my small town, but I’ll guess there are perhaps 2 of them who would benefit from the scholarship. After all, CrossFit is pretty niche, and the town is small. Helping youngsters get swole in the Pacific Northwest is not exactly as cost-effective as preventing malaria in Malawi. But I notice I feel drawn to supporting the scholarship anyway. Every time it pops in my head I think, “My money could fully solve this problem”. The camp only costs a few hundred dollars per kid and if there are just 2 kids who need support, I could give $500 and there would no longer be teenagers in my town who want to go to a CrossFit summer camp but can’t. Thanks to me, the hero, this problem would be entirely solved. 100%. That is not how most nonprofit work feels to me. You are only ever making small dents in important problems I want to work on big problems. Global poverty. Malaria. Everyone not suddenly dying. But if I’m honest, what I really want is to solve those problems. Me, personally, solve them. This is a continued source of frustration and sadness because I absolutely cannot solve those problems. Consider what else my $500 CrossFit scholarship might do: * I want to save lives, and USAID suddenly stops giving $7 billion a year to PEPFAR. So I give $500 to the Rapid Response Fund. My donation solves 0.000001% of the problem and I feel like I have failed. * I want to solve climate change, and getting to net zero will require stopping or removing emissions of 1,500 billion tons of carbon dioxide. I give $500 to a policy nonprofit that reduces emissions, in expectation, by 50 tons. My donation solves 0.000000003% of the problem and I feel like I have f