Hide table of contents

Eliezer Yudkowsky periodically complains about people coming up with questionable plans with questionable assumptions to deal with AI, and then either:

  • Saying "well, if this assumption doesn't hold, we're doomed, so we might as well assume it's true."
  • Worse: coming up with cope-y reasons to assume that the assumption isn't even questionable at all. It's just a pretty reasonable worldview.

Sometimes the questionable plan is "an alignment scheme, which Eliezer thinks avoids the hard part of the problem." Sometimes it's a sketchy reckless plan that's probably going to blow up and make things worse.

Some people complain about Eliezer being a doomy Negative Nancy who's overly pessimistic.

I had an interesting experience a few months ago when I ran some beta-tests of my Planmaking and Surprise Anticipation workshop, that I think are illustrative.


i. Slipping into a more Convenient World

I have an exercise where I give people the instruction to play a puzzle game ("Baba is You"), but where you normally have the ability to move around and interact with the world to experiment and learn things, instead, you need to make a complete plan for solving the level, and you aim to get it right on your first try.

In the exercise, I have people write down the steps of their plan, and assign a probability to each step. 

If there is a part of the puzzle-map that you aren't familiar with, you'll have to make guesses. I recommend making 2-3 guesses for how a new mechanic might work. (I don't recommend making a massive branching tree for every possible eventuality. For the sake of the exercise not taking forever, I suggest making 2-3 branching path plans)

Several months ago, I had three young-ish alignment researchers do this task (each session was a 1-1 with just me and them).

The participants varied in how experienced they were with Baba is You. Two of them were new to the game, and completed the the first couple levels without too much difficulty, and then got to a harder level. The third participant had played a bit of the game before, and started with a level near where they had left off.

Each of them looked at their level for awhile and said "Well, this looks basically impossible... unless this [questionable assumption I came up with that I don't really believe in] is true. I think that assumption is... 70% likely to be true."

Then they went an executed their plan.

It failed. The questionable assumption was not true.

Then, each of them said, again "okay, well here's a different sketchy assumption that I wouldn't have thought was likely except if it's not true, the level seems unsolveable."

I asked "what's your probability for that one being true?"

"70%"

"Okay. You ready to go ahead again?" I asked.

"Yep", they said.

They tried again. The plan failed again.

And, then they did it a third time, still saying ~70%.

This happened with three different junior alignment researchers, making a total of 9 predictions, which were wrong 100% of the time.

(The third guy, on the the second or third time, said "well... okay, I was wrong last time. So this time let's say it's... 60%.")


My girlfriend ran a similar exercise with another group of young smart people, with similar results. "I'm 90% sure this is going to work" ... "okay that didn't work."


Later I ran the exercise again, this time with a mix of younger and more experienced AI safety folk, several of whom leaned more pessimistic. I think the group overall did better. 

One of them actually made the correct plan on the first try. 

One them got it wrong, but gave an appropriately low estimate for themselves.

Another of them (call them Bob) made three attempts, and gave themselves ~50% odds on each attempt. They went into the experience thinking "I expect this to be hard but doable, and I believe in developing the skill of thinking ahead like this." 

But, after each attempt, Bob was surprised by how out-of-left field their errors were. They'd predicted they'd be surprised... but they were surprised in surprising ways – even in a simplified, toy domain that was optimized for being a solveable puzzle, where they had lots of time to think through everything. They came away feeling a bit shook up by the experience, and not sure if they believed in longterm planning at all, and feeling a bit alarmed at a lot of people around who confidently talked as if they were able to model things multiple steps out.


ii. Finding traction in the wrong direction.

A related (though distinct) phenomena I found, in my own personal experiments using Baba Is You, or Thinking Physics, or other puzzle exercises as rationality training:

It's very easy to spend a lot of time optimizing within the areas I feel some traction, and then eventually realize this was wasted effort. A few different examples:

Forward Chaining instead of Back Chaining. 

In Baba-is-You levels, there will often be parts of the world that are easy to start fiddling around with and manipulating, and maneuvering into a position that looks like it'll help you navigate the world. But, often, these parts are red herrings. They open up your option-space within the level... but not the parts you needed to win. 

It's often faster to find the ultimately right solution if you're starting from the end and backchaining, rather than forward chaining with whatever bits are easiest to fiddle around with.

Moving linearly, when you needed to be exponential. 

Often in games I'll be making choices that improve my position locally, and are clearly count as some degree of "progress." I'll get 10 extra units of production, or damage. But, then I reach the next stage, and it turns out I really needed 100 extra units to survive. And the thought-patterns that would have been necessary to "figure out how to get 100 units" on my first playthrough are pretty different from the ones I was actually doing. 

It should have occurred to me to ask "will the game ever throw a bigger spike in difficulty at me?", and "is my current strategy of tinkering around going to prepare me for such difficulty?".

Doing lots of traction-y-feeling reasoning that just didn't work. 

On my first Thinking Physics problem last year, I brainstormed multiple approaches to solving the problem, and tried each of them. I reflected on considerations I might have missed, and then incorporated them. I made models and estimations. It felt very productive and reasonable.

I got the wrong answer, though.

My study partner did get the right answer. Their method was more oriented around thought experiments. And in retrospect their approach seemed more useful for this sort of problem. And it's noteworthy that my subjective feeling of "making progress" didn't actually correspond to making the sort of progress that mattered.


Takeaways

Obviously, an artificial puzzle is not the same as a real, longterm research project. Some differences include:

  • It's designed to be solveable
  • But, also, it's designed to be sort of counterintuitive and weird
  • It gives you a fairly constrained world, and tells you what sort of questions you're trying to ask.
  • It gives you clear feedback when you're done.

Those elements push in different directions. Puzzles are more deliberately counterintuitive than reality is, on average, so it's not necessarily "fair" when you fall for a red herring. But they are nonetheless mostly easier and clearer than real science problem.

What I found most interesting was people literally saying the words out loud, multiple times "Well, if this [assumption] isn't true, then this is impossible" (often explicitly adding "I wouldn't [normally] think this was that likely... but..."). And, then making the mental leap all the way towards "70% that this assumption is true." Low enough for some plausible deniability, high enough to justify giving their plan a reasonable likelihood of success.

It was a much clearer instance of mentally slipping sideways  into a more convenient world, than I'd have expected to get.

I don't know if the original three people had done calibration training of any kind beforehand.  I know my own experience doing the OpenPhil calibration game was that I got good at it within a couple hours... but that it didn't transfer very well to when I started making PredictionBook / Fatebook questions about topics I actually cared about.

I expect forming hypotheses in a puzzle game to be harder than the OpenPhil Calibration game, but easier than making longterm research plans. It requires effort to wrangle your research plans into a bet-able form, and then actually make predictions about it. I bet most people do not do that. 

Now, I do predict that people who do real research in a given field will get at least decent at implicitly predicting research directions within their field (via lots of trial-and-error, and learning from mentors). This is what "research taste" is. But, I don't think this is that reliable if you're not deliberately training your calibration. (I have decades of experience passively predicting stuff happening in my life, but I nonetheless was still miscalibrated when I first started making explicit PredictionBook predictions about them). 

And moreover, I don't think this transfers much to new fields you haven't yet mastered. Stereotypes come to mind of brilliant physicists who assume their spherical-cow-simplifications will help them model other fields. 

This seems particularly important for existentially-relevant alignment research. We have examples of people who have demonstrated "some kind of traction and results" (for example, doing experiments on modern ML systems. Or, for that matter, coming up with interesting ideas like Logical Induction). But we don't actually have direct evidence that this productivity will be relevant to superintelligent agentic AI. 

When it comes to "what is good existential safety research taste?", I think we are guessing.

I think you should be scared about this, if you're the sort of theoretic researcher, who's trying to cut at the hardest parts of the alignment problem (whose feedback loops are weak or nonexistent) 

I think you should be scared about this, if you're the sort of Prosaic ML researcher who does have a bunch of tempting feedback loops for current generation ML, but a) it's really not clear whether or how those apply to aligning superintelligent agents, b) many of those feedback loops also basically translate into enhancing AI capabilities and moving us toward a more dangerous world.

I think you should be scared about this, if you're working in policy, either as a research wonk or an advocate, where there are some levers of power you can sort-of-see, but how the levers fit together and whether they actually connect to longterm existential safety is unclear.

Unfortunately, "be scared" isn't that useful advice. I don't have a great prescription for what to do.

My dissatisfaction with this situation is what leads me to explore Feedbackloop-first Rationality, basically saying "Well the problem is our feedback loops suck – either they don't exist, or they are temptingly goodharty. Let's try to invent better ones." But I haven't yet achieved an outcome here I can point to and say "okay this clearly helps."

But, meanwhile, my own best guess is:

I feel a lot more hopeful about researchers who have worked on a few different types of problems, and gotten more calibrated on where the edges of their intuitions' usefulness are. I'm exploring the art of operationalizing cruxy predictions, because I hope that can eventually feed into an the art of having calibrated, cross-domain research taste, if you are deliberately attempting to test your transfer learning. 

I feel more hopeful about researchers that make lists of their foundational assumptions, and practiced of staring into the abyss, confronting "what would I actually do if my core assumptions were wrong, and my plan doesn't work?", and grieving for assumptions that seem, on reflection, to have been illusions.

I feel more hopeful about researchers who talk to mentors with different viewpoints, learning different bits of taste and hard-earned life lessons, and attempt to integrate them into some kind of holistic AI safety research taste.

And while I don't think it's necessarily right for everyone to set themselves the standard of "tackle the hardest steps in the alignment problem and solve it in one go", I feel much more optimistic about people who have thought through "what are all the sorts of things that need to go right, for my research to actually pay off in an existential safety win?"

And I'm hopeful by people who look at all of this advice, and think "well, this still doesn't actually feel sufficient for me to be that confident my plans are really going to accomplish anything", and set out to brainstorm new ways to shore up their chances.

Comments1


Sorted by Click to highlight new comments since:

Executive summary: Researchers and planners often make overly optimistic assumptions when faced with difficult problems, which can lead to flawed strategies and overconfidence in uncertain domains like AI alignment.

Key points:

  1. People tend to slip into more "convenient" worldviews when faced with seemingly impossible challenges, often overestimating the likelihood of favorable assumptions.
  2. Experiments show that even smart researchers repeatedly make overconfident predictions when solving unfamiliar puzzles.
  3. It's easy to focus on areas where one feels traction, even if those areas aren't actually crucial to solving the core problem.
  4. Real-world research, especially in AI alignment, lacks clear feedback loops, making it difficult to develop reliable "research taste."
  5. Recommendations for researchers include: diversifying problem-solving experience, critically examining assumptions, seeking mentorship, and considering the full chain of events needed for research to have impact.
  6. The author emphasizes the need for better feedback loops in rationality and research, though admits not having a clear solution yet.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
LintzA
 ·  · 15m read
 · 
Cross-posted to Lesswrong Introduction Several developments over the past few months should cause you to re-evaluate what you are doing. These include: 1. Updates toward short timelines 2. The Trump presidency 3. The o1 (inference-time compute scaling) paradigm 4. Deepseek 5. Stargate/AI datacenter spending 6. Increased internal deployment 7. Absence of AI x-risk/safety considerations in mainstream AI discourse Taken together, these are enough to render many existing AI governance strategies obsolete (and probably some technical safety strategies too). There's a good chance we're entering crunch time and that should absolutely affect your theory of change and what you plan to work on. In this piece I try to give a quick summary of these developments and think through the broader implications these have for AI safety. At the end of the piece I give some quick initial thoughts on how these developments affect what safety-concerned folks should be prioritizing. These are early days and I expect many of my takes will shift, look forward to discussing in the comments!  Implications of recent developments Updates toward short timelines There’s general agreement that timelines are likely to be far shorter than most expected. Both Sam Altman and Dario Amodei have recently said they expect AGI within the next 3 years. Anecdotally, nearly everyone I know or have heard of who was expecting longer timelines has updated significantly toward short timelines (<5 years). E.g. Ajeya’s median estimate is that 99% of fully-remote jobs will be automatable in roughly 6-8 years, 5+ years earlier than her 2023 estimate. On a quick look, prediction markets seem to have shifted to short timelines (e.g. Metaculus[1] & Manifold appear to have roughly 2030 median timelines to AGI, though haven’t moved dramatically in recent months). We’ve consistently seen performance on benchmarks far exceed what most predicted. Most recently, Epoch was surprised to see OpenAI’s o3 model achi
Dr Kassim
 ·  · 4m read
 · 
Hey everyone, I’ve been going through the EA Introductory Program, and I have to admit some of these ideas make sense, but others leave me with more questions than answers. I’m trying to wrap my head around certain core EA principles, and the more I think about them, the more I wonder: Am I misunderstanding, or are there blind spots in EA’s approach? I’d really love to hear what others think. Maybe you can help me clarify some of my doubts. Or maybe you share the same reservations? Let’s talk. Cause Prioritization. Does It Ignore Political and Social Reality? EA focuses on doing the most good per dollar, which makes sense in theory. But does it hold up when you apply it to real world contexts especially in countries like Uganda? Take malaria prevention. It’s a top EA cause because it’s highly cost effective $5,000 can save a life through bed nets (GiveWell, 2023). But what happens when government corruption or instability disrupts these programs? The Global Fund scandal in Uganda saw $1.6 million in malaria aid mismanaged (Global Fund Audit Report, 2016). If money isn’t reaching the people it’s meant to help, is it really the best use of resources? And what about leadership changes? Policies shift unpredictably here. A national animal welfare initiative I supported lost momentum when political priorities changed. How does EA factor in these uncertainties when prioritizing causes? It feels like EA assumes a stable world where money always achieves the intended impact. But what if that’s not the world we live in? Long termism. A Luxury When the Present Is in Crisis? I get why long termists argue that future people matter. But should we really prioritize them over people suffering today? Long termism tells us that existential risks like AI could wipe out trillions of future lives. But in Uganda, we’re losing lives now—1,500+ die from rabies annually (WHO, 2021), and 41% of children suffer from stunting due to malnutrition (UNICEF, 2022). These are preventable d
Rory Fenton
 ·  · 6m read
 · 
Cross-posted from my blog. Contrary to my carefully crafted brand as a weak nerd, I go to a local CrossFit gym a few times a week. Every year, the gym raises funds for a scholarship for teens from lower-income families to attend their summer camp program. I don’t know how many Crossfit-interested low-income teens there are in my small town, but I’ll guess there are perhaps 2 of them who would benefit from the scholarship. After all, CrossFit is pretty niche, and the town is small. Helping youngsters get swole in the Pacific Northwest is not exactly as cost-effective as preventing malaria in Malawi. But I notice I feel drawn to supporting the scholarship anyway. Every time it pops in my head I think, “My money could fully solve this problem”. The camp only costs a few hundred dollars per kid and if there are just 2 kids who need support, I could give $500 and there would no longer be teenagers in my town who want to go to a CrossFit summer camp but can’t. Thanks to me, the hero, this problem would be entirely solved. 100%. That is not how most nonprofit work feels to me. You are only ever making small dents in important problems I want to work on big problems. Global poverty. Malaria. Everyone not suddenly dying. But if I’m honest, what I really want is to solve those problems. Me, personally, solve them. This is a continued source of frustration and sadness because I absolutely cannot solve those problems. Consider what else my $500 CrossFit scholarship might do: * I want to save lives, and USAID suddenly stops giving $7 billion a year to PEPFAR. So I give $500 to the Rapid Response Fund. My donation solves 0.000001% of the problem and I feel like I have failed. * I want to solve climate change, and getting to net zero will require stopping or removing emissions of 1,500 billion tons of carbon dioxide. I give $500 to a policy nonprofit that reduces emissions, in expectation, by 50 tons. My donation solves 0.000000003% of the problem and I feel like I have f