A Quick List of Some Problems in AI Alignment As A Field

Nicholas Kross

This is a linkpost for https://www.thinkingmuchbetter.com/main/alignment-field-problems-2022/

1. MIRI as central point of failure for... a few things...

For the past decade or more, if you read an article saying "AI safety is important", and you thought, "I need to donate or apply to work somewhere", MIRI was the default option. If you looked at FLI or FHI or similar groups, you'd say "they seem helpful, but they're not focused solely on AI safety/alignment, so I should go to MIRI for the best impact."

2. MIRI as central point of failure for learning and secrecy.

MIRI's secrecy (understandable) and their intelligent and creatively-thinking staff (good) have combined into a weird situation: for some research areas, nobody really knows what they've tried and failed/succeeded at, nor the details of how that came to be. Yudkowsky did link some corrigibility papers he labels as failed, but neither he nor MIRI have done similar (or more in-depth) autopsies of their approaches, to my knowledge.

As a result, nobody else can double-check that or learn from MIRI's mistakes. Sure, MIRI people write up their meta-mistakes, but that has limited usefulness, and people still (understandably) disbelieve their approaches anyway. This leads either to making the same meta-mistakes (bad), or to blindly trusting MIRI's approach/meta-approach (bad because...)

3. We need more uncorrelated ("diverse") approaches to alignment.

MIRI was the central point for anyone with any alignment approach, for a very long time. Recently-started alignment groups (Redwood, ARC, Anthropic, Ought, etc.) are different from MIRI, but their approaches are correlated with each other. They all relate to things like corrigibility, the current ML paradigm, IDA, and other approaches that e.g. Paul Christiano would be interested in.

I'm not saying these approaches are guaranteed to fail (or work). I am saying that surviving worlds would have, if not way more alignment groups, definitely way more uncorrelated approaches to alignment. This need not lead to extra risk as long as the approaches are theoretical in nature. Think early-1900s physics gedankenexperiments, and how diverse they may have been.

Or, if you want more hope and less hope at the same time, look at how many wildly incompatible theories have been proposed to explain quantum mechanics. A surviving world would have at least this much of a Cambrian explosion in theories, and would also be better at handling this than we are in real-life handling the actual list of quantum theories (in absence of better experimental evidence).

Simply put, if evidence is dangerous to collect, and every existing theoretical approach is deeply flawed along some axis, then let schools proliferate with little evidence, dammit! This isn't psych, where stuff fails to replicate and people keep doing it. AI alignment is somewhat better coordinated than other theoretical fields... we just overcorrected to putting all our eggs in a few approach baskets.

(Note: if MIRI is willing and able, it could continue being a/the central group for AI alignment, given the points in (1), but it would need to proliferate many schools of thought internally, as per (5) below.)

One problem with this ^[1], is that the AI alignment field as a whole may not have the resources (or the time) to pursue this hits-based strategy. In that case, AI alignment would appear to be bottlenecked on funding, rather than talent directly. That's... news to me. In either case, this requires either more fundraising, and/or more money-efficient ways to get similar effects to what I'm talking about. (If we're too talent-constrained to pursue a hits-based approach strategy, it's even more imperative to fix the talent constraints first, as per (4) below.)

Another problem is whether the "winning" approach might come from deeper searching along the existing paths, rather than broader searching in weirder areas. In that case, it could maybe still make sense to proliferate sub-approaches under the existing paths. The rest of the points (especially (4) below) would still apply, and this still relies on the existing paths being... broken enough to call "doom", but not broken enough to try anything too different. This is possible.

4. How do people get good at this shit?

MIRI wants to hire the most competent people they can. People apply, and are turned away for not being smart/self-taught/security-mindset enough. So far so good.

But then... how do people get good at alignment skills before they're good enough to work at MIRI, or whatever group has the best approach? How they get good enough to recognize, choose, and/or create the best approaches (which, remember, we need more of)?

Academia is loaded with problems. Existing orgs are already small and selective. Independent research is promising, yet still relies on a patchwork of grants and stuff. By the time you get good enough to get a grant, you have to have spent a lot of time studying this stuff. Unpaid, mind you, and likely with another job/school/whatever taking up your brain cycles.

Here's a (failure?) mode that I and others are already in, but might be too embarrassed to write about: taking weird career/financial risks, in order to obtain the financial security, to work on alignment full-time ^[2]. Anyone more risk-averse (good for alignment!) might just... work a normal job for years to save up, or modestly conclude they're not good enough to work in alignment altogether. If security mindset can be taught at all, this is a shit equilibrium.

Yes, I know EA and the alignment community are both improving at noob-friendliness. I'm glad of this. I'd be more glad if I saw non-academic noob-friendly programs that pay people, with little legible evidence of their abilities, to upskill full-time. IQ or other tests are legal, certainly in a context like this. Work harder on screening for whatever's unteachable, and teaching what is.

5. Secret good ideas + collaboration + more work needed = ???

The good thing about having a central org to coordinate around, is it solves the conflicting requirements of "intellectual sharing" and "infohazard secrecy". One org where the best researchers go, open on the inside, closed to the outside. Good good.

But, as noted in (1), MIRI has not lived up to its potential in this regard ^[3]. MIRI could kill two birds with one stone, and act as a secrecy/collaboration coordination point while also having multiple small internal teams working on disparate approaches and thus having a high absolute headcount (helping (5) and (4)) while avoiding many issues common to big gangly organizations.

Then again, Zvi and others have written extensively on why big organizations are doomed to cancer and maybe theoretically impossible to align. Okay. Not promising. Then maybe we need approaches that get similar benefits (secrecy, collaboration, coordination, many schools) without making a large group. Perhaps a big closed-door annual conference? More MIRIx chapters? Something?

6. The hard problem of smart people working on a hard problem.

Remember "The Bitter Lesson"? Where AI researchers go for approaches using human expertise and galaxy-brained solutions, instead of brute scale?

Sutton's reasoning for this is (at least partly) that researchers have human vanity. "I'm a smart person, therefore my solution should be sufficiently-complicated." ^[4]

I think similar reasons of vanity (and related social-status) reasons are holding back some AI alignment progress.

I think people are afraid to suggest sufficiently weird/far-out ideas (which, recall, need to be quite different from existing flawed approaches), because they have a mental model of semi-adequate MIRI trying and failing something, and then not prioritizing writing-up-the-failure (or keeping the failure secret for some reason).

Sure, there are good security-mindset and iffy-teachability reasons why many new ideas can and should be rejected on-sight. But, as noted in (4), these problems should not be impossible to get around. And in actual cybersecurity and cryptography, where people are presumably selected at least a tad for having security mindset, there's not exactly a shortage of creative ideas and moon math solutions. Given our field's relatively-high coordination and self-reflection, surely we can do better?

This relates to a point I've made elsewhere, that in the face of lots of things not working, we need to try more hokey, wacky, cheesy, low-hanging, "dumb" ideas. I'm disappointed that I couldn't find any LessWrong post suggesting like "Let's divvy up team members where each one represents a cortex of the brain, then we can divide intellectual labor!". The idea is dumb, it likely won't work, but surviving worlds don't leave that stone unturned. If famously-wacky early LessWrong didn't have this lying around, how do I know MIRI hasn't secretly tried and failed at it?

Related to division of intellectual labor: I also think Yudkowsky's example of Einstein, in the Sequences, may make people afraid to offer incremental ideas, critiques, solutions, etc. "If I can't solve all of alignment (or all of [big alignment subproblem]) in one or two groundbreaking papers, like Einstein did with Relativity, I'm not smart enough to work in alignment." So, uh, don't be afraid to take even half-baked ideas to the level of a LaTeX-formatted paper. (If you can solve alignment in one paper, obviously do that!)

7. Concluding paragraph because you have a crippling addiction to prose (ok, same, fair).

Here's an example of something that combines many solution-ideas noted in (6). If it becomes more accepted to write ideas in bullet points, then:

It lowers the barrier to entry for people who think better/more easily than they write.
It lowers the mental "status-grab" barrier for people who are subtly intimidated by prose quality.
- This, in turn, signals to more people who already don't care about status, that their blunt ideas are welcome on alignment spaces.
It makes prose quality less able to influence readers' evaluations of idea quality, which is good for examining ideas' truth values.
It may be easier even for people who already have little problem writing prose.
People can (and probably should) still write prose when they're more comfortable with it / when needed for other purposes (explicitly persuading people?) anyway. Making bullet points more common does not necessarily entail forcibly limiting prose.

H/T my co-blogger Devin, as is the case with my articles' editing in general, and noticing gaps in my logic in particular. ↩︎
If you're in this situation, DM me for moral support and untested advice. ↩︎
Or maybe it has! We don't know! See (2)! ↩︎
See also, uh, that list of explanations of quantum mechanics. ↩︎

16 Reactions

More posts like this

Comments10

Sorted by

New & upvoted

Click to highlight new comments since: Today at 10:11 AM

Rubi J. HudsonJun 21 202219

their approaches are correlated with each other. They all relate to things like corrigibility, the current ML paradigm, IDA, and other approaches that e.g. Paul Christiano would be interested in.

You need to explain better how these approaches are correlated, and what an uncorrelated approach might look like. It seems to me that, for example, MIRI's agent foundations and Anthropic's prosaic interpretability approaches are wildly different!

By the time you get good enough to get a grant, you have to have spent a lot of time studying this stuff. Unpaid, mind you, and likely with another job/school/whatever taking up your brain cycles.

I think you are wildly underestimating how easy it is for broadly competent people with an interest in AI alignment but no experience to get funding to skill up. I'd go so far as to say it's a strength of the field.

Nicholas KrossJun 22 20221

Point 1: I said "Different from MIRI but correlated with each other". You're right that I should've done a better job of explaining that. Basically, "Yudkowksy approaches (MIRI) vs Christiano approaches (my incomplete read of most of the non-MIRI orgs). I concede 60% of this point.

Point 2: !!! Big if true, thank you! I read most of johnswentworths' guide to being an independent researcher, and the discussion of grants was promising. I'm getting a visceral sense of this from seeing (and entering) more contests, bounties, prizes, etc. for alignment work. I'm working towards the day when I can 100% concede this point. (And, based on other feedback and encouragement I've gotten, that day is coming soon.)

Guy RavehJun 21 20226

I think the fact that MIRI has not managed to get even close to solving the problems it set out to solve - combined with their ideas for how the world would fare if those aren't solved - speak very strongly against the secrecy. You termed the secrecy understandable, but I don't really think it is. It comes from the assumption that the risk from not telling anyone (and thus not having them collaborate with you) is smaller than the risk of telling them (and having someone somewhere misuse your ideas). This doesn't hold up to reality.

I upvoted your post, but would like to strongly object to your last point about writing and prose. Resorting to technical styles of communication would make it much easier for some people, while, I suspect, making it much harder for a lot more people. It's hard for research that isn't communicated clearly to have a significant contribution to the common knowledge.

cf. Mochizuki's "proof" of the ABC conjecture that he almost entirely refused to explain and ended up wasting years of the mathematical community's time and eventually being refuted.

Charles HeJun 21 20225

I don't know anything about AI safety or machine learning, and also I think I view your comment, sentiments, as well as the post being valuable.

However, I don't see how this is right:

You termed the secrecy understandable, but I don't really think it is.

Given the worldview/theory of change/beliefs behind MIRI, secrecy seems justified. (There might be a massive, inexcusable defect that produced these beliefs, as well very ungenerous alternate reasons for the secrecy) but taking at face value the beliefs or infohazards claimed, it seems valuable ex ante.

Guy RavehJun 22 20221

I can accept that it seemed to make sense at the start, but can you explain how it would still make sense now given what's happened (or, rather, didn't happen) in the meantime?

Charles HeJun 22 20222

Basically, I don’t know . I think it’s good to start off by emphatically stating I don’t have any real knowledge of MIRI.

A consideration is that the beliefs in MIRI are still on very short timelines. A guess is that because of the nature of some work relevant to short timelines, maybe some projects could have bad consequences if made public (or just don’t make sense to ever make public).

Again, this is presumptuous, but my instinct is not to have attitudes of instructing org policy in a situation like this, because of dependencies we don’t see. (Just so this doesn’t read like a statement that nothing can ever change: I guess the change here would be a new org or new leaders, obviously this is hard).

Also, to be clear, this is accepting the premise of MIRI. IMO one should take seriously the premise of shorter timelines, like, it’s a valid belief. Under this premise, the issue here is really bad execution, like actively bad.

If your comment was alluding to shifting of beliefs away from short timelines, that seems like a really different discussion.

Guy RavehJun 22 20221

No, I'm saying the nearer and more probable you thing doom-causing AGI is, and the longer you stagnate on solving the problem, the less it makes sense to not let the rest of the world in on the work. If you don't, you're very probably doomed. If you do, you're still very probably doomed, but at least you have orders of magnitude more people collaborating with you to prevent it, this increasing the chance of success.

Charles HeJun 22 20223

I think what you said makes sense.

(As a presumptuous comment) I don’t have a positive view about the work from strong circumstantial evidence. However, as sort of devils advocate:

There are very few good theories of change for very short timelines and one of them is build it yourself. So, I don’t see how that’s good to share.

Alignment might be entangled in this to the degree that sharing even alignment might be capabilities research.

The above might be awful beliefs but I don’t see how it’s wrong.

By the way, just to calibrate so people can read if I’m crazy:

It reads like MIRI or closely related people have tried to build AGI or find the requisite knowledge, many times over the years. The negative results seems to be an update about their beliefs.

Guy RavehJun 22 20223

Thanks. That kinda sorta makes sense. I still think if they're trying to build an aligned AGI, it's arrogant and unrealistic to think you can achieve it with a small group that's not collaborating with others, faster than the entire AI capabilities community who are basically collaborating together can.

Nicholas KrossJun 21 20221

Good point about the secrecy, I hadn't heard of the ABC thing. The secrecy is "understandable" to the extent that AI safety is analogous to the Manhattan Project, but less useful to the extent that AIS is analogous to... well, the development of theoretical physics.