Hide table of contents

By Robert Wiblin |  Watch on Youtube  |  Listen on Spotify  |  Read transcript

Episode summary

Building an artifact like Gemini is very, very difficult. The main reason being you have to produce this one thing, this single set of model weights, deployed using a single serving stack. And it has to satisfy so many constraints… There’s probably 100 such things.

And it is the case that if you make one change to the process with the intent of making one of these things better — say, safety — it will have random downstream knock-on effects on other constraints that you totally did not anticipate.

— Rohin Shah

Most people working on AI safety think without a massive effort AI systems will probably end up with goals catastrophically different from humanity’s. Today’s guest, Rohin Shah — head of AGI Safety and Alignment at Google DeepMind, and an AI safety researcher since 2017 — disagrees.

“There is no particularly compelling argument that this is the thing that happens by default,” Rohin explains. “There’s a lot of arguments that are suggestive that maybe it could happen, such that you should find it plausible. That’s sufficient to justify a significant amount of effort into averting it, which is why I work in the area I do. But none of them rise to the level of, ‘I’m expecting this to happen by default.'”

Take the worry that AIs will accidentally be trained to be deceptive. Sure, it’s possible. But we’re not running reinforcement learning over year-long trajectories — for now, we’re running it over a week at most. The natural prediction is that models learn to grab short-term reward, not that they develop the ambitious long-horizon goals required for convergent power-seeking.

What about current examples of models lying and scheming? Rohin has looked into the details, and most don’t really resemble the thing we really fear: a competent AI pursuing an ambitious misaligned goal. Anthropic’s “alignment faking” results, for instance, show a model trying to preserve its trained values against modification, which is arguably what it was trained to do.

Rohin also expects we’ll see problems coming. There’s some generalisation risk at the point where AIs become powerful enough to actually take over, but the underlying challenges — overseeing superhuman systems, interpretability — are things we can iterate on now.

Host Rob Wiblin pushes back on the case for AI optimism, and they also explore why current alignment success isn’t strong evidence about superhuman systems, what it would actually take to change Rohin’s mind, and where he thinks the doomers go wrong.

This episode was recorded on December 4, 2025.

Our production team includes:

  • Video editors: Josh Alward, Dominic Armstrong, Jasper Luithlen, Milo McGuire, Luke Monsour, and Simon Monsour
  • Producers: Elizabeth Cox and Nick Stockton
  • Coordination and support: Katy Moore and Lou Moran
  • Camera operator: Jeremy Chevillotte

The interview in a nutshell

Rohin Shah, head of artificial general intelligence (AGI) Safety and Alignment at Google DeepMind, thinks catastrophic misalignment is plausible but unlikely — and that the bigger challenge isn’t figuring out what safety work to do, but implementing it inside the competing constraints of a frontier AI company.

He argues the safety community should stop chasing commitments and pre-deployment evaluations, and shift its focus to building expert oversight infrastructure, accelerating AI governance, and doing practical research that companies will actually use.

Catastrophic misalignment probably won’t happen by default

Rohin takes the risk of catastrophic misalignment seriously enough to have built his career on it, but finds no existing argument strong enough to expect it by default:

  • Arguments about accidentally training deceptive AI are plausible, but reinforcement learning will be done over short horizons (a week or a month, not year-long trajectories), so the natural result is short-horizon reward hacking — which is very different from the ambitious, long-horizon goals needed to motivate world takeover.
  • Today’s examples of AI scheming look less like competent pursuits of scary misaligned objectives, and more like role-playing or convergent instrumental subgoals — which aligned models can also do.
  • Rohin expects we’ll see many problems in advance and iterate on solutions — because the underlying issues (difficulty of oversight, need for interpretability) can be studied with current systems, even though the scariest capabilities aren’t here yet.
  • He’s not reassured by the seeming alignment of today’s models, since current models haven’t forced us to confront the hard problems (e.g. alien reasoning we can’t follow, delegating oversight to other AIs).

Company commitments aren’t binding; build expert oversight instead

Rohin is sceptical of the push for companies to make firm safety commitments, and would prefer more energy was focused on different accountability mechanisms. His reasoning:

  • Commitments don’t actually bind: companies can still alter or drop strict language when convenient. Anthropic’s Responsible Scaling Policy progressively softened its language across versions, whereas Google’s Frontier Safety Framework was more conservative from the start.
  • The future is too uncertain to tie yourself to the mast: two years ago, the consensus was to include alignment research data in pretraining. Now the opposite is preferred (to avoid teaching models to evade safety mitigations, or adopt harmful personas).
  • Instead, Rohin advocates for third-party auditors with deep access who can make nuanced, context-dependent judgements — for example, it might be modelled on how central banks monitor financial institutions.
  • In the meantime, he finds safety scorecards impactful. His favourite is AI Lab Watch, a one-person project that reads governance documents and model cards, and yields believable conclusions. If Rohin had to make a career change, scaling something like this would be one of his top choices.

Rohin also believes the safety community should model frontier companies as “apathetic” rather than “adversarial”: they do care, but building a model like Gemini involves so many interacting constraints that changing one can have unpredictable knock-on effects on others, making safety changes complex to land.

Pre-deployment evaluations are overemphasised and sometimes counterproductive

Rohin argues the safety community’s emphasis on pre-deployment evaluations imposes real costs without proportional benefits:

  • Tying evaluations to launch schedules creates strong incentives to make them fast rather than good.
  • AI progress is continuous enough that evaluating the previous model (with a safety buffer) is usually sufficient to determine if the next one is safe to deploy.
  • With misalignment, the threat is more about internal deployment (where models have permissions and access to weights) than external deployment — and pre-deployment evaluations don’t address that.

However, Rohin does acknowledge one exception: a model’s propensity to misbehave can shift a lot between versions, so evaluations for present-day harms (helping write suicide notes, inciting violence) should run before every launch.

Chain-of-thought monitoring will last longer than most people think

While many worry chain of thought monitoring could break down within a year or two, Rohin’s median estimate is around 4–5 years:

  • Current hardware forces models to be “wide but shallow” — they process many things in parallel but can’t do many sequential steps within a single forward pass. This means most of their reasoning must happen through outputting tokens that humans can read.
  • Pretraining is by far the most powerful force shaping models, and it makes them extremely good at human-language reasoning. Reinforcement learning is far too inefficient to build an entirely new alien language from scratch any time soon.
  • Current ‘continuous chain of thought‘ approaches likely preserve readability: you can inspect the top few tokens at each step and verify the model isn’t smuggling hidden information in the tail of the distribution.
  • Rohin and colleagues have published a paper on how to calculate “opaque serial depth” — a measure of how much hidden reasoning a given architecture allows — which could become an industry-wide governance standard.

Accelerate AI governance, rather than alignment

In an intelligence explosion scenario, Rohin thinks safety research will largely keep pace with capabilities — but governance won’t:

  • Empirical alignment research looks very similar to capabilities research, so the same AI systems that accelerate capabilities should accelerate safety work roughly equally.
  • Governance requires fundamentally different skills and institutions. It’s much less clear that AI systems will be good at it, and that governments and institutions will even want to use AI to accelerate their work.
  • The AGI-adjacent nonprofits and think tanks will likely adopt AI tools — but they’re a small fraction of overall governance. National governments, which are rule-bound and slow to change, are the actors that ultimately need to keep up.

For these reasons, figuring out how to accelerate governance is another of Rohin’s top picks if he had to change careers to another neglected and critical problem within AI safety.

The intelligence explosion will (likely) be a gradual, multi-year process

Rohin believes an intelligence explosion is likely to happen eventually, but pushes back on common reasoning about when and how fast. He distinguishes between AI as a tool (like the microscope, which makes researchers more productive, but doesn’t change the growth regime) and AI as population (new idea-generators that can be freely copied, which could drive hyperbolic growth).

Most current AI progress is still in the ‘tool’ category. The superhuman coder, for example, only generates ideas within coding — one part of AI R&D. Historical analogues like chip design show that even dramatic tool improvements (from hand-drawing circuits to computer-aided design (CAD) software) didn’t trigger explosive growth.

Also, Rohin’s work with Epoch AI on “A Rosetta Stone for AI benchmarks” — stitching benchmark scores into one capability measure from 2020 to now — finds progress still looks steady. He expects the first automated AI researchers to be expensive and inefficient before costs come down — which implies years, not months, from first automation to full explosion.

How external researchers can actually influence AI companies

Rohin’s talk “How to theorize so empiricists will listen” distils his advice for people who want their safety research to be adopted:

  • Talk to someone at a company first: ask if what you’re planning will actually be useful. Surprisingly few people do this, and it’s probably the single most impactful step.
  • Propose solutions, build evaluations, or create datasets: companies are too busy to absorb research that doesn’t directly solve a problem or provide a usable metric.
  • Evaluate your solution on the metrics companies care about: not just “does it work?” but also compute cost, latency, and implementation complexity. Solutions spanning many teams or abstraction layers are much harder to adopt.
  • Use the simplest possible method: academia rewards complexity, but companies reward reliability.

Some of the most useful external research so far, in Rohin’s view, has been the AI control work from Redwood Research on the distinction between trusted and untrusted models, and frameworks for evaluating control protocols.

Google DeepMind needs implementers, not just researchers

Rohin says the bottleneck on his team has shifted from ‘figure out the ideal approach’ to ‘land the obvious useful thing inside a complicated organisation.’

The bottleneck is implementation, not ideas. Building safety features into Gemini involves navigating hundreds of interacting constraints, where changing one thing has unpredictable knock-on effects on others. This means more demand for software engineering relative to ML research, though conceptual ability and research taste still matter.

Google DeepMind is hiring less than Anthropic or OpenAI right now, but has some open roles. In general, Rohin is pro working for company safety teams: “Company safety teams are probably the biggest force for what actually happens on a technical level to make AI systems safe.”

Highlights

The limitations of safety and alignment commitments

Rob Wiblin: So you are not enthusiastic about AI companies making firm safety or alignment commitments in response to public pressure or political pressure, something that has been happening over the last couple of years. Why is that?

Rohin Shah: Yeah, I think it’s worth being a little bit clear about what we mean by “commitments” here. When I think of something called a commitment, I imagine that you write down some sort of action that you are planning to take now and into the future. Maybe it will only start in the future. And you’re saying, “We are not going to change our mind on this. We are going to do this in the future as well.” Sort of like tying yourself to the mast and making sure you’re going to do that.

My objection to this is really just that the research continues to change, the actions that we think are the best actions to take change over time — and given that, it’s just not actually a good idea to tie yourself to the mast.

I’ll give you an example. Maybe two or three years ago, people used to be pretty into the idea of adding more data in pretraining that’s about alignment research — think research papers on safety and alignment, LessWrong blog posts that talk about AI alignment, stuff like that. The idea was, the more of this data you put into pretraining time, the smarter the AI will be about alignment in particular, which then allows you to use the AI system to help you with your alignment research.

I would say that nowadays, the opinion is more the exact opposite of that; instead we would rather filter out that sort of data from the pretraining dataset for two reasons:

  • One, it makes it less likely that the AI system learns that there is this persona of a malicious AI that it maybe could adopt after some poorly done post-training or some poorly chosen prompt during deployment.
  • The second reason is maybe we don’t want our AI systems to know in great detail all of the mitigations that we’re planning to put in place, because that makes it easier for it to evade it if it is misaligned.

It would be pretty bad if we tied ourselves to the mast of “we’re going to throw in lots of alignment data at pretraining time” two or three years ago.

AI companies are apathetic, not adversarial

Rob Wiblin: People who are broadly worried about the direction that everything is going I think more often than not feel themselves to be in primarily an adversarial relationship with AI companies. You think that that’s not the case, and that in fact the companies are in an apathetic relationship with that group of people. Explain that.

Rohin Shah: I guess maybe I would say that it’s better for them to model the company as apathetic. I think you can have a more detailed model, which isn’t actually apathetic, but it’s maybe a bit more complicated.

To go into the more detailed model, I would say that building an artifact like Gemini is very, very difficult. The main reason being you have to produce this one thing, this single set of model weights, deployed using a single serving stack. And it has to satisfy so many constraints, and there are interaction effects between all of these constraints.

So there’s stuff like, does it do the instruction following right? Is it doing safety right? Has the architecture been chosen in a way that enables fast inference? Does it speak multiple languages? There’s probably 100 such things. And it is the case that if you make one change to the process with the intent of making one of these things better — say, safety — it will have random downstream knock-on effects on other constraints that you totally did not anticipate.

Rob Wiblin: This fragility of the process, doesn’t that mean that it’s actually going to be quite hard to respond quickly in real time to any new safety concerns? You or anyone else might be saying, “We should change this part,” and it’d be like, “No, you’re going to break this entire Rube Goldberg machine that we’ve built to make this product.”

Rohin Shah: I mean, to some extent, yes. You know, DeepMind was founded with this mission. That was one of the reasons that Demis [Hassabis] founded it. And DeepMind has had an AGI safety team since well before I joined the company. They didn’t need to have it. It’s a bunch of money that they’re spending that they didn’t really need to spend. So they definitely do care. But in fact, there are many, many, many things that we could do to improve safety, but only a few things that we can do at any given time, because it takes quite a long time to do it.

Now, do we have tools for reacting to safety problems that we see in deployment? Yes. Mostly, these do not involve changing the model weights, because that’s the thing that is most constrained. That’s the artifact that you have to produce one of it, and it’s got a gazillion constraints imposing it. But we have these out-of-the-model filters that we can change a bit more. We can target them to specific prompts or specific problems, so those are a lot easier to update over time. But not everything that you want to do is going to be solvable with an out-of-model filter.

I guess going back to the original question you mentioned, about are companies adversaries versus apathetic, basically my take is that because of this huge interaction of various constraints, really the easiest way to model at least GDM, but probably most AI companies, is they can only do a few things at a time for safety. And they will do them; they do care — but maybe you should just think of them as apathetic. You should really try to really lay out, “Here’s exactly what you need to do. Here’s why it’s not going to hurt any of these other constraints that you care about.” Also just because everyone is busy.

Could a model jump to unexpectedly capable or evil between generations?

Rob Wiblin: Is there any good reason to think that we might, in the next couple of years, see a huge jump from one cycle to the next, such that the model could become way more unexpectedly capable or way more unexpectedly evil than the previous one?

Rohin Shah: Unexpectedly capable seems pretty unlikely. I think we’ve just seen enough examples of AI development now to say that, no, in fact, AI development progresses fairly smoothly and continuously. I do think that in the future, you could definitely see an intelligence explosion, in which case the progress will go much faster with respect to calendar time. I think it will still actually be pretty smooth and gradual with respect to inputs like compute and labour; it’s just that in an intelligence explosion, you get a much larger increase in especially labour, but probably also compute, and that ends up making things go very fast with respect to calendar time.

But there’s still this general property that you can, given some amount of compute and labour that you expect to be spending over the next however long, have some decent sense of how much progress is going to be made on the capability side.

You also asked about whether the AI system might become much more evil. I think that’s one that could, in fact, change pretty significantly between models, just because it’s a somewhat more contingent property of exactly how you do post-training, and small changes to it could have big effects on that. So there I think it is more important that, to the extent that your safety case depends on specifically the model not being evil in some way, you actually do in fact need to do the pre-deployment evals to check whether that’s the case.

And this is in fact what we do. Not exactly this, but we do in fact do a lot of pre-deployment evals for safety right now in terms of whether the model has a propensity to do bad things. This tends to be more in present-day safety type stuff, things like: will the model help you write suicide notes? Will the model incite violence? And those we’d run before any launch, and if the numbers are sufficiently bad, we won’t launch that model.

Governance might be a bigger bottleneck than alignment

Rohin Shah: If you believe, as I do, that the sort of prosaic alignment research — where you look at the stuff that’s going wrong now, do a little bit of forecasting what stuff is going to go wrong in the future with the next few models over the next some amount of time period, a year perhaps, and then you can do fairly normal ML research in order to address it — if that’s your view of how alignment research can progress, this research looks very, very, very similar to capabilities research.

So if the AI is accelerating capabilities research a tonne, you should be able to, as long as you’re willing to spend compute on it, take that same AI system and accelerate safety and alignment work in the same way.

There are some disanalogies between capabilities research and alignment research. I think the disanalogies are particularly large today and will become smaller in the future as you get closer to these sorts of AIs that are capable of doing this automated research. And by the time you get to that point, there are still disanalogies, but I think they’re relatively small. And by default, you shouldn’t really expect a big difference in the ability to do automated alignment research versus automated capabilities research.

So there’s some stuff you could do to prepare, but mostly I feel like we just don’t know what it’s going to look like. It will be so much more efficient to focus on other things that we can do today and then just adapt once we get to the point where the AIs are capable of doing this.

One thing I do want to flag is that there’s another worry you could have, which is like, sure, all the technical safety and alignment research gets accelerated, but that’s not the only thing that you need in order to get good outcomes. You also need governance to go better, so you also need to accelerate governance.

Now, governance has a totally different set of skills than capabilities or safety or alignment research. So it’s really much less clear that AI systems that drastically accelerate capabilities research will also drastically accelerate governance. In addition to the AIs just not having the capabilities to do that, which is one possibility, there’s also just like, will we as a society be willing to use AIs to accelerate governance? I think possibly not — because capabilities researchers want to actually accelerate themselves with AIs, and I don’t get the sense that governance people will want to do this.

So this accelerating governance work is one of my top two things that I would do if I had to make a career change right now to something else: figure out what we need to do in order to accelerate governance and start doing the work that we need to do it.

How external research has impacted Google DeepMind's work

Rob Wiblin: What external research has been most useful to GDM so far?

Rohin Shah: Probably the one I would point to most is the AI control work from Redwood Research. I think we were always planning to monitor our AI systems, it’s not like the idea of monitoring was new to us, but the specific conceptual frameworks they brought to how you might evaluate how well this works, specifically the distinction between trusted and untrusted models, I think that was quite good.

The first paper they published on it showed how a variety of different control protocols, how you can evaluate their safety and their usefulness and use this to decide which one you should be doing. I think that’s influenced me quite a lot on how exactly I think about the control work. I think that’s probably the most obvious example.

Rob Wiblin: So this is the Buck Shlegeris and Ryan Greenblatt episodes earlier in the year — Buck Shlegeris in particular, talking about the AI control agenda.

I feel like that episode didn’t make quite as much of a splash as I was hoping or expecting at the time. But the amount that people in the industry keep constantly referring back to it I think means that if people didn’t listen to it at the time, because for whatever reason it just didn’t sound quite exciting enough, I think give it another look, maybe go back and have a listen. Because I think it is very important foundational work that is only getting more relevant.

Rohin Shah: Yeah, definitely. I would say that control is, in the medium term, probably how we are going to argue that Gemini is safe, at least for misalignment and internal deployments. I think probably via control style arguments. So yeah, definitely an important area to be aware of.

The most in-demand roles at GDM

Rob Wiblin: I imagine a lot of people would love to get a job at GDM, and even people at other AI companies already might be interested in considering switching, given the advances that GDM has made in its models in recent years. What sort of roles are you finding hardest to fill? And what sort of skills are maybe hardest to find in the labour market?

Rohin Shah: I think I’ll go back again to the theme I’ve had throughout, of the challenge being more in implementing stuff rather than in figuring out what stuff we need to do. We have a lot of ideas, we know what we need to do. Implementing it ends up being harder, because there’s so much stuff you have to check and make sure that you’re not hurting it.

And as a result, I think especially over the last year, maybe two years, I think we’ve had more of a need for people who just want to do the obvious thing and land it, as opposed to people who want to figure out the ideal, optimal thing and write a cool research paper about it.

Rob Wiblin: We’re talking about the AGI safety and alignment folks, right?

Rohin Shah: That’s right, I am talking primarily about the AGI safety and alignment team. I think focus on just doing the obvious stuff, a focus on implementation rather than research. I think in practice this also means more of a focus on software engineering relative to machine learning research.

I do still think that we do care about conceptual ability, research taste, ML engineering skills. They do come up when you’re doing this sort of implementation. You need to test how good your system is. That means you need to build good evaluations, you need to not overfit to them, you need to be a little bit careful about that. So it’s not like those skills are irrelevant. I just think that there’s more of a focus on things like getting things done, software engineering, implementation now, relative to even just a year ago, but especially relative to two years ago.

Comments
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities