ET

Elliott Thornley

Postdoc @ MIT
1152 karmaJoined Working (0-5 years)
www.elliott-thornley.com

Bio

I work on AI alignment. Right now, I'm using ideas from decision theory to design and train safer artificial agents.

I also do work in ethics, focusing on the moral importance of future generations.

You can email me at [email protected].

Comments
93

Thanks! I think the difference between our proposal and organizations paying workers to report themselves is the order that things come in. In our case, it's payment first and reporting later. The idea is that you could give a risk-averse AI $1,000 (say) and tell it that it can spend the money on anything it wants. Then so long as the risk-averse AI believes that the situation is real, it will spend the money on what it actually wants (e.g. making paperclips) and thereby reveal to you what it actually wants.

I think a human analogy would be something like: 

  • You have an employee that claims that their only terminal value is the success of the company.
  • You want to find out whether this is true.
  • So you give the employee a no-strings-attached $10m.
  • If they give that money back to the company and continue working just as hard, that's strong evidence that they really do only terminally value the success of the company.
  • If they spend the money on something else, that's strong evidence that they don't only value the success of the company.
  1. Thanks!
  2. I'm unsure whether we can successfully train ASIs to be reliably risk-averse, including far OOD. Our claim is just that the chances of success are high enough to make risk aversion worth pursuing as a line of defense. That's the case we try to make in section 10. See also my reply to Ryan's comment. I also think our chances of success are a bit higher for AIs that aren't yet ASIs, and if we succeed in making them risk-averse I think they could help a lot with aligning any later-arising ASIs, by doing this sort of stuff.

Replied over there! There are also a few more comments in the thread now.

I’m surprised you say deals would be way less useful. Can you say more? Here’s my current sense of things:

In favor of keeping AIs risk-neutral

  • Risk-neutral AIs are more likely to attempt takeover. And we want them to attempt takeover, because (i) they’re very unlikely to succeed, and (ii) if we catch them attempting takeover, we can use them to reduce future takeover risk, by doing research on them, using the evidence of their takeover attempt to motivate international coordination, etc. 

Worries about that:

One worry is that we need the risk-neutral AIs to be somewhat likely to successfully take over, otherwise they wouldn’t even attempt takeover and we couldn’t catch them. Taking the numbers from Fabien’s post (which are illustrative but don’t seem off by OOMs), their chance of successful takeover has to be greater than 1%. That’s a fairly big risk we’d be taking in the near term.

It also seems like we’ll only get really convincing evidence of a takeover attempt (the kind that might motivate international coordination) if the attempt gets pretty far along. And conditional on the attempt getting pretty far along, the AI’s chances of successful takeover are likely even higher.

One benefit of risk-averse AI (if it works) is that we don’t have to take that near-term risk.

And then with regard to reducing future takeover risk, I agree that getting to do research on confirmed-misaligned AIs would be a big benefit (but we can also get that with risk-averse AIs). On international coordination, I think most caught takeover attempts probably wouldn’t leave evidence clear enough to motivate it. I think many of the points from this post apply. In reply to evidence of a takeover attempt, people will likely say: shutting down would be costly, AIs do weird stuff sometimes, it was an isolated screw-up by that specific AI company, it’s a doomer plot, etc. And the news might not even get out of the lab. Also it seems fairly likely that AIs can try to take over in a way that would look pretty ambiguous if they failed. They could invent some justification for why their actions were actually in humanity’s best interest, etc.

(Sidenote is that I’m interested in the implications of ‘We want near-future AIs to attempt takeover.’ If that’s true, it seems like the AI safety community should be doing radically different stuff to the stuff it’s currently doing.)

In favor of making AIs risk-averse

  • Risk-averse AIs are less likely to attempt takeover in the near term.

If that were the only benefit, then I think it’d be pretty unclear which of risk-neutral AIs and risk-averse AIs is better. But risk-averse AIs would likely have lots of other benefits too, potentially letting us reduce future takeover risk by a lot. We can pay them to:

  • Reveal misalignment.
    • One idea here is that we give risk-averse AIs a small amount to spend on whatever they want. Then if they spend it on making paperclips (etc.), we’ve got clear evidence of misalignment. We can then do research on these misaligned AIs and use the evidence to motivate international coordination, etc.
    • This evidence of misalignment we get from risk-averse AIs seems about as good for enabling research and motivating international coordination as the evidence we’d get from risk-neutral AIs attempting takeover. And to get this evidence from risk-averse AIs, we don't need to bait them into an (at least somewhat likely to succeed) takeover attempt and hope that we catch them.
  • Reveal collusion signals.
  • Stop sandbagging on easy-to-evaluate tasks.
  • Identify security vulnerabilities.
  • Monitor untrusted AIs.
  • Do alignment research. (Hard to evaluate, of course. We say a bit about this in section 4.2.)
  • Plus other stuff mentioned here, here, and here.

Taken together, all this stuff we can buy from risk-averse AIs seems much better for reducing future takeover risk than catching risk-neutral AIs in a takeover attempt. And we can buy all this stuff from risk-averse AIs without running a significant risk that AIs actually succeed in their takeover attempt.

(I'll reply to the generalization point in another comment.)

Yep, when going into AI safety you should take into account p(you cause doom) along with p(you avert doom).

That's my point! p(influencing the outcome positively) is the right thing to focus on, not p(doom).

Nice post! Miscellaneous thoughts:

if individuals have VNM utility functions, and if the Pareto principle holds over groups, then a version of utilitarianism must be true.

Harsanyi's theorem also requires that the social planner's preferences satisfy the VNM axioms.

Not many philosophical proofs have been written

I think this all depends on what you mean by 'many'. I'd guess maybe 10% of analytic philosophy papers include a proof of some kind, so that at least hundreds of proofs are published every year. And in a sense, every valid (spelled-out) argument is a proof.

I agree that the Claude proofs are pretty bad. The Arrhenius point is fairly obvious: what Arrhenius means by 'theories' in that paper is weak orders on populations, so if after taking into account moral uncertainty you still have a weak order, then the impossibility theorem still applies. (And later Arrhenius theorems relax both completeness and transitivity, so even departing from a weak order doesn't get you off the hook.)

Claude makes this kind of point, but first it introduces an Agreement axiom that the proof never uses. Claude later comes close to admitting this ('Agreement plays almost no role'), tries to walk it back ('But Agreement rules out the escape route...'), and then fully admits it ('the fundamental impossibility holds regardless'). 

Which Claude model did you use? Did you use extended thinking? The flip-flopping above makes me think there was no extended thinking, and maybe a model with extended thinking would do better. (Though not much better I'd guess. I've found LLMs to be surprisingly bad at philosophy, even just the 'understanding the view and its implications' parts.)

I didn't bother checking the second population ethics proof but it looks sloppy:

Axiom (Sufficient Comparability). For any pair of populations A, B that differ by at most some fixed bounded amount (e.g., adding or removing one person, or changing one person's welfare level by a small amount), M(μ) must rank A and B (no incomparability for "local" comparisons)."

Don't any pair of populations "differ by at most some fixed bounded amount"? What is Claude doing including 'e.g.'s in its formal statement of axioms?

With some additional effort, present-day LLMs might be capable of coming up with a good novel proof. If not, then it will likely be possible soon. Most kinds of moral philosophy might be difficult for AIs, but proofs are one area where AI assistance seems promising.

Yes, you'd think so given that they've gotten so good at math! But when I've tried using LLMs to help with formal philosophy, I've found them to be really surprisingly bad, even at parts that seem very math-loaded (e.g. inventing proofs, following arguments, grasping views and their implications, coming up with counterexamples, etc.). I'm not sure why this is. I guess part of it is that it's hard to do RLVR on philosophy in the same way that you can do RLVR on math, but naively I'd expect more generalization from math to formal philosophy. Maybe the following is a factor: pretraining data doesn't contain that much bad mathematical reasoning, but it contains a huge amount of bad philosophical reasoning.

I said a little in another thread. If we get aligned AI, I think it'll likely be a corrigible assistant that doesn't have its own philosophical views that it wants to act on. And then we can use these assistants to help us solve philosophical problems. I'm imagining in particular that these AIs could be very good at mapping logical space, tracing all the implications of various views, etc. So you could ask a question and receive a response like: 'Here are the different views on this question. Here's why they're mutually exclusive and jointly exhaustive. Here are all the most serious objections to each view. Here are all the responses to those objections. Here are all the objections to those responses,' and so on. That would be a huge boost to philosophical progress. Progress has been slow so far because human philosophers take entire lifetimes just to fill in one small part of this enormous map, and because humans make errors so later philosophers can't even trust that small filled-in part, and because verification in philosophy isn't much quicker than generation.

I'm not sure but I think maybe I also have a different view than you on what problems are going to be bottlenecks to AI development. e.g. I think there's a big chance that the world would steam ahead even if we don't solve any of the current (non-philosophical) problems in alignment (interpretability, shutdownability, reward hacking, etc.).

try to make them "more legible" to others, including AI researchers, key decision makers, and the public

Yes, I agree this is valuable, though I think it's valuable mainly because it increases the probability that people use future AIs to solve these problems, rather than because it will make people slow down AI development or try very hard to solve them pre-TAI.

I don't think philosophical difficulty is that much of an increase to the difficulty of alignment, mainly because I think that AI developers should (and likely will) aim to make AIs corrigible assistants rather than agents with their own philosophical views that they try to impose on the world. And I think it's fairly likely that we can use these assistants (if we succeed in getting them and aren't disempowered by a misaligned AI instead) to help a lot with these hard philosophical questions.

Load more