Yarrow Bouchard 🔸

1370 karmaJoined Canadastrangecosmos.substack.com

Bio

Pronouns: she/her or they/them. 

Parody of Stewart Brand’s whole Earth button.

I got interested in effective altruism back before it was called effective altruism, back before Giving What We Can had a website. Later on, I got involved in my university EA group and helped run it for a few years. Now I’m trying to figure out where effective altruism can fit into my life these days and what it means to me.

I write on Substack, and used to write on Medium.

Sequences
2

Criticism of specific accounts of imminent AGI
Skepticism about near-term AGI

Comments
680

Topic contributions
6

All the things you mentioned aren’t uniquely evidence for the simulation hypothesis but are equally evidence for a number of other hypotheses, such as the existence of a supernatural, personal God who designed and created the universe. (There are endless variations on this hypothesis, and we could come up endless more.)

The fine-tuning argument is a common argument for the existence of a supernatural, personal God. The appearance of fine-tuning supports this conclusion equally as well it supports the simulation hypothesis.

Some young Earth creationists believe that dinosaur fossils and other evidence of an old Earth were intentionally put there by God to test people’s faith. You might also think that God tests our faith in other ways, or plays tricks, or gets easily bored, and creates the appearance of a long history or a distant future that isn’t really there. (I also think it’s just not true that this is the most interesting point in history.)

Similarly, the book of Genesis says that God created humans in his image. Maybe he didn’t create aliens with high-tech civilizations because he’s only interested in beings with high technology made in his image. 

It might not be God who is doing this, but in fact an evil demon, as Descartes famously discussed in his Meditations around 400 years ago. Or it could be some kind of trickster deity like Loki who is neither fully good or fully evil. There are endless ideas that would slot in equally well to replace the simulation hypothesis.

You might think the simulation hypothesis is preferable because it’s a naturalistic hypothesis and these are supernatural hypotheses. But this is wrong, the simulation hypothesis is a supernatural hypothesis. If there are simulators, the reality they live in is stipulated to have different fundamental laws of nature, such as the laws of physics, than exist in what we perceive to be the universe. For example, in the simulators’ reality, maybe the fundamental relationship between consciousness and physical phenomena such as matter, energy, space, time, and physical forces is such that consciousness can directly, automatically shape physical phenomena to its will. If we observed this happening in our universe, we would describe this as magic or a miracle. 

Whether you call them "simulators" or "God" or an "evil demon" or "Loki", and whether you call it a "simulation" or an "illusion" or a "dream", these are just different surface-level labels for substantially the same idea. If you stipulate laws of nature radically other than the ones we believe we have, what you’re talking about is supernatural. 

If you try to assume that the physics and other laws of nature in the simulators’ reality is the same as in our perceived reality, then the simulation argument runs into a logical self-contradiction, as pointed out by the physicist Sean Carroll. Endlessly nested levels of simulation means computation in the original simulators’ reality will run out. Simulations at the bottom of the nested hierarchy, which don’t have enough computation to run still more simulations inside them, will outnumber higher-level simulations. Since the simulation argument says, as one of its key premises, that in our perceived reality we will be able to create simulations of worlds or universes filled with many digital minds, but the simulation hypothesis implies this is actually impossible, then the simulation argument’s conclusion contradicts one of its premises.

There are other strong reasons to reject the simulation argument. Remember that a key premise is that we ourselves or our descendants will want to make simulations. Really? They’ll want to simulate the Holocaust, malaria, tsunamis, cancer, cluster headaches, car crashes, sudden infant death syndrome, and Guantanamo Bay? Why? On our ethical views today, we would not see this as permissible, but rather the most grievous evil. Why would our descendants feel differently? 

Less strongly, computation is abundant in the universe but still finite. Why spend computation on creating digital minds inside simulations when there is always a trade-off between doing that and creating digital minds in our universe, i.e. the real world? If we or our descendants think marginally and hold as one of our highest goals to maximize the number of future lives with a good quality of life, using huge amounts of computation on simulations might be seen as going against that goal. Plus, there are endlessly more things we could do with our finite resource of computation, most we can’t imagine today. Where would creating simulations fall on the list? 

You can argue that creating simulations would be a small fraction of overall resources. I’m not sure that’s actually true; I haven’t done the math. But just because something is a small fraction of overall resources doesn’t mean it will be likely be done. In an interstellar, transhumanist scenario, our descendants could create a diamond statue of Hatsune Miku the size of the solar system and this would take a tiny percentage of overall resources, but that doesn’t mean it will likely happen. The simulation argument specifically claims that making simulations of early 21st century Earth will interest our descendants more than alternative uses of resources. Why? Maybe they’ll be more interested in a million other things.

Overall, the simulation hypothesis is undisprovable but no more credible than an unlimited number of other undisprovable hypotheses. If something seems nuts, it probably is. Initially, you might not be able to point out the specific logical reasons it’s nuts. But that’s to be expected — the sort of paradoxes and thought experiments that get a lot of attention (that "go viral", so to speak) are the ones that are hard to immediately counterargue.

Philosophy is replete with oddball ideas that are hard to convincingly refute at first blush. The Chinese Room is a prime example. Another random example is the argument that utilitarianism is compatible with slavery. With enough time and attention, refutations may come. I don't think one's inability to immediately articulate the logical counterargument is a sign that an oddball idea is correct. It's just that thinking takes time and, usually, by the time an oddball idea reaches your desk, it's proven to be resistant to immediate refutation. So, trust that intuition that something is nuts. 

My experience is similar. LLMs are powerful search engines but nearly completely incapable of thinking for themselves. I use these custom instructions for ChatGPT to make it much more useful for my purposes:

When asked for information, focus on citing sources, providing links, and giving direct quotes. Avoid editorializing or doing original synthesis, or giving opinions. Act like a search engine. Act like Google.

There are still limitations:

  • You still have to manually check the cited links to verify the information yourself.
  • ChatGPT is, for some reason, really bad at actually linking to the correct webpage it’s quoting from. This wastes time and is frustrating.
  • ChatGPT is limited to short quotes and often gives even shorter quotes than necessary, which is annoying. It often makes it hard to understand what the quote actually even says, which almost defeats the purpose.
  • It’s common for ChatGPT to misunderstand what it’s quoting and take something out of context, or it quotes something inapplicable. This often isn’t obvious until you actually go check the source (especially with the truncated quotes). You can get tricked by ChatGPT this way.
  • Every once in a while, ChatGPT completely fabricates or hallucinates a quote or a source.

The most one-to-one analogy for LLMs in this use case is Google. Google is amazingly useful for finding webpages. But when you Google something (or search on Google Scholar), you get a list of results, many of which are not what you’re looking for, and you have to pick which results to click on. And then, of course, you actually have to read the webpages or PDFs. Google doesn’t think for you; it’s just an intermediary between you and the sources.

I call LLMs SuperGoogle because they can do semantic search on hundreds of webpages and PDFs in a few minutes while you’re doing something else. LLMs as search engines is a geniune innovation.

On the other hand, when I’ve asked LLMs to respond to the reasoning or argument in a piece of writing or even just do proofreading, they have given incoherent responses, e.g. making hallucinatory "corrections" to words or sentences that aren’t in the text they’ve been asked to review. Run the same text by the same LLM twice and it will often give the opposite opinion of the reasoning or argument. The output is also often self-contradictory, incoherent, incomprehensibly vague, or absurd.

"Crux" is its own word; it isn’t short for "crucial consideration" or anything else. From Merriam-Webster:

In Latin, crux referred literally to an instrument of torture, often a cross or stake, and figuratively to the torture and misery inflicted by means of such an instrument. Crux eventually developed the sense of "a puzzling or difficult problem"; that was the first meaning that was used when the word entered English in the early 18th century. Later, in the late 19th century, crux began to be used more specifically to refer to an essential point of a legal case that required resolution before the case as a whole could be resolved. Today, the verdict on crux is that it can be used to refer to any important part of a problem or argument, inside or outside of the courtroom.

The EA Forum wiki also notes that a "crucial consideration" is a related concept, not a synonym of "crux" or a longhand version of the term.

Where do you get that 5-20x figure from?

I recall Elon Musk once said the goal was to get to an average of one intervention per million miles of driving. I think this is based on the statistic of one crash per 500,000 miles on average.

I believe interventions currently happen more than once per 100 miles on average. If so, and if one intervention per million miles is what Tesla is indeed targeting, then Tesla is more than 10,000x off from its goal.

There are other ways of measuring Tesla’s FSD software’s performance compared to average human driving performance and getting another number. I am skeptical it would be possible to use real, credible numbers and come to the conclusion that Tesla is currently less than 100x away from human-level driving.

I very much doubt that Hardware 5/AI5 is going to provide what it takes for Tesla to achieve SAE Level 4/5 autonomy at human-level or better performance, or that Tesla will achieve that goal (in any robust, meaningful sense) within the next 2 years. I still think what I said is true — Tesla, internally, would have evidence of this if it were true (or would be capable of obtaining it), and would be incentivized to show off that evidence.

Andrej Karpathy understands this topic better than almost anyone else in the world, and he is clear that he thinks fully autonomous driving is not solved (at Tesla, Waymo, or elsewhere) and there’s long way to go still. There’s good reason to listen to Karpathy on this.

I also very much doubt that the best AI models in 2 years will be capable of writing 90% of commercial, production code, let alone that this will happen within six months. I think there’s essentially no chance of this happening in 2026. As far as I can see, there is no good evidence currently available that would suggest this is starting to happen or should be possible soon. Extrapolating from performance on narrow, contrived benchmark tasks to real world performance is just a mistake. And the evidence about real world use of AI for coding does not support this.

On Tesla: I don't think training a special model for expensive test cars makes sense. They're not investing in a method that's not going to be scalable. The relevant update will come when AI5 ships (end of this year reportedly), with ~9x the memory. I'd be surprised if they don't solve it on that hardware.

Do you mean you think Tesla will immediately solve human-level fully autonomous (i.e. SAE Level 4 or Level 5) driving as soon as they deploy Hardware 5/AI5? Or that it will happen some number of years down the line?

Tesla presumably already has some small number of Hardware 5/AI5 units now. It knows what the specs for Hardware 5/AI5 will be. So, it can train a larger model (or set of models) now for that 10x more powerful hardware. Maybe it has already done so. I would imagine Tesla would want to be already testing the 10x larger model (or models) on the new hardware now, before the new hardware enters mass production.

If the 10x more powerful hardware were sufficient so solve full autonomy, Tesla should be able to demonstrate something impressive now with the new hardware units it presumably already has. Moreover, Tesla is massively incentivized to do so.

I don't see any strong reason why 10x more powerful hardware or a 10x larger model (or set of models) would be enough to get the 100x or 1,000x or 10,000x or whatever it is boost in performance Tesla's FSD software needs. The trend in scaling the compute and data used for neural networks is that performance tends to improve by less, proportionally, than the increase in compute and data. So, a 10x increase in compute or model size would tend to get less than a 10x increase in performance.

But if it is true that the 10x powerful hardware is sufficient to solve the remainder of the problem, Tesla would have compelling evidence of that by now, or would easily be able to obtain that evidence. I think Tesla would be eager to show that evidence off if it had it, or knew it could get it.

What Hinton and others got wrong wasn't the capability prediction, it was assuming the job consisted entirely of the task AI was getting good at. Turns out radiologists do more than read images, translators do more than translate sentences, and AI ends up complementary rather than substitutive.

I've seen some studies that have found AI models simply underperform human radiologists, although the results are mixed. More importantly, the results are for clean, simplified benchmarks, but those benchmarks don't generalize well to real world conditions anyway.

I haven't spent much time looking into studies on human translation vs. post-LLM machine translation. However, I found one study of GPT-4o, open source LLMs, Google Translate, and DeepL that found (among other things):

LLMs still need to address the issue of overly literal outputs, and a substantial gap remains between LLM and human quality in literary translation, despite the clear advancements of recent models.

Since studies take so long to conduct, write up, and get published, we will tend to see studies lagging behind the latest versions of LLMs. That's a drawback, but I don't know of a better way to get this kind of high-quality data and analysis. More up-to-date information, like firm-level data or other economic data, is more recent but doesn't tell as much about the why.

Consulting firms like McKinsey release data based on interviews with people in management positions at companies; I don't think they've specifically covered radiology or translation, but you might be able to find similar reports for those domains based on interviews. This is another way to get more up-to-date information, but interviews or surveys have drawbacks relative to academic studies.

The benchmarks aren't perfect but they consistently point to rapid progress, from METR time horizons, SWE-bench, GDPval...

Performance on these benchmarks don't generalize very well to real world performance. I think "aren't perfect" is an understatement.

There is much to criticize about the way the METR time horizons graph, specifically, has been interpreted. It's not clear how much METR is responsible for this interpretation; sometimes people at METR give good caveats, sometimes they don't. In any case, the graph only says something very narrow and contrived, and it doesn't necessarily tell us much about how good AI is at coding in a practical, realistic, economic sense (or how good it will be in a year or two).

On the 90% prediction: my somewhat conservative view is that AI could write 90%+ of production code this year and will next year.

I very much doubt AI will write 90% of production code by December 2027. But already, you seem to be pushing out the timeline. You started by saying Dario Amodei was "off by a few months" in his prediction that 90% of code would be AI-written by mid-September 2025. (It's already been nearly 4 months since then.) Pushing out the timeline into 2027 makes him off by at least 1 year and 3 months. If the timeline is late 2027, then he's off by at least 2 years.

Ah, okay, that is tricky! I totally missed one of the rules that the examples are telling us about. Once you see it, it seems simple and obvious, but it's easy to miss. If you want to see the solution, it's here

I believe all ARC-AGI-2 puzzles contain (at least?) two different rules that you have to combine. I forgot about that part! I was trying to solve the puzzle as if there was just one rule to figure out.

I tried the next puzzle and was able to solve it right away, on the first try, keeping in mind the 'two rules' thing. These puzzles are actually pretty fun, I might do more.

ARC-AGI-2 is not a test of whether a system is an AGI or not. Getting 100% on ARC-AGI-2 would not imply a system is AGI. I guess the name is potentially misleading in that regard. But Chollet et al. are very clear about this.

The arxiv.org pre-print explains how the human testing worked. See the section "Human-facing calibration testing" on page 5. The human testers only had a maximum of 90 minutes:

Participants completed a short survey and interface tutorial prior to being assigned tasks. Participants received a base compensation of $115-150 for participation in a 90-minute test session, plus a $5 incentive reward per correctly completed task. Three testing sessions were held between November 2024 and May 2025.

The median time spent attempting or solving each task was around 2 minutes:

The median time spent on attempted test pairs was 2.3 minutes, while successfully completed tasks required a median of 2.2 minutes (Figure 3).

I'm still not entirely sure how the human test process worked from the description in the pre-print, but maybe rather than giving up and walking away, testers gave up on individual tasks in order to solve as many as possible in their allotted time.

I think you're probably right about how they're defining "human panel", but I wish this were more clearly explained in the pre-print, on the website, or in the presentations they've done.


I can't respond to your comments in the other thread because of the downvoting, so I’ll reply here:

1) Metaculus and Manifold have a huge overlap with the EA community (I'm not familiar with Kashi) and, outside the EA community, people who are interested in AGI often far too easily accept the same sort of extremely flawed stuff that presents itself as way more serious and scientific than it really is (e.g. AI 2027, Situational Awareness, Yudkowsky/MIRI's stuff).

2) I think it's very difficult to know if one is engaging in motivated reasoning, or what other psychological biases are in play. People engage in wishful thinking to avoid unpleasant realities or possibilities, but people also invent unpleasant realities/possibilities, including various scenarios around civilizational collapse or the end of the world (e.g. a lot of doomsday preppers seem to believe in profoundly implausible, pseudoscientific, or fringe religious doomsday scenarios). People seem to be both biased toward believing pleasant and unpleasant things. (There is also something psychologically grabbing about believing that one belongs to an elite few who possess esoteric knowledge about cosmic destiny and may play a special role in determining the fate of the world.)

My explicit, conscious reasoning is complex and can't be summarized in one sentence (see the posts on my profile for the long version), but it's less along the lines of 'I don’t want to believe unpleasant things' and more along the lines of: a lot people preaching AGI doom lack expertise in AI, have a bad track record of beliefs/predictions on AI and/or other topics, say a lot of suspiciously unfalsifiable and millennialist things, and don't have clear, compelling answers to objections that have been publicly raised, some for years now.

Tesla’s production fleet is constrained by the costs of production hardware, but their internal test fleet or robotaxi fleet could easily use $100,000+ hardware if they wanted. If this were enough for dramatically better performance, that would make for a flashy demo, which would probably be great for Tesla’s share price, so they are incentivized to do this.

What’s your prediction about when AI will write 90% of commercial, production code? If you think it’s within in a year from now, you can put me on the record as predicting that won’t happen.

It’s not just self-driving or coding where AI isn’t living up to the most optimistic expectations. There has been very little success in using LLMs and generative AI tools for commercial applications across the board. Demand for human translators has continued to increase since GPT-4 was released (although counterfactually it may have grown less than it would have otherwise). You’d think if generative AI were good at any commercially valuable task, it would be translation. (Customer support chat is another area with some applicability, but results are mixed, and LLMs are only an incremental improvement over the Software 1.0 chatbots and pre-LLM chatbots that already existed.) This is why I say we’re most likely in an AI bubble. It’s not just optimistic expectations in a few domains that have gotten ahead of their skis, it’s across the aggregate of all commercially relevant domains.

One more famous AI prediction I didn’t mention in this post is the Turing Award-winning AI researcher Geoffrey Hinton’s prediction in 2016 that deep learning would automate all radiology jobs by 2021. Even in 2026, he couldn’t be more wrong. Demand for radiologists and radiologists’ salaries have been on the rise. We should be skeptical of brazen predictions about what AI will soon be able to do, even from AI luminaries, given how wrong they’ve been before.

In footnote 2 on this post, I said I wouldn’t be surprised if, on January 1, 2026, the top score on ARC-AGI-2 was still below 60%. It did turn out to be below 60%, although only by 6%. (Elon Musk’s prediction of AGI in 2025 was wrong, obviously.)

The score the ARC Prize Foundation ascribes to human performance is 100%, rather than 60%. 60% is the average for individual humans, but 100% is the score for a "human panel", i.e. a set of at least two humans. Note the large discrepancy between the average human and the average human panel. The human testers were random people off the street who got paid $115-150 to show up and then an additional $5 per task they solved. I believe the ARC Prize Foundation’s explanation for the 40-point discrepancy is that many of the testers just didn’t feel that motivated to solve the tasks and gave up. (I vaguely remember this being mentioned in a talk or interview somewhere.)

ARC’s Grand Prize requires scoring 85% (and abiding by certain cost/compute efficiency limits). They say the 85% target score is "somewhat arbitrary".

I decided to go with the 60% figure in this post to go easy on the LLMs.

If you haven’t already, I recommend looking at some examples of ARC-AGI-2 tasks. Notice how simple they are. These are just little puzzles. They aren’t that complex. Anyone can do one in a few minutes, even a kid. It helps to see what we’re actually measuring here. 

The computer scientist Melanie Mitchell has a great recent talk on this. The whole talk is worth watching, but the part about ARC-AGI-1 and ARC-AGI-2 starts at 21:50. She gives examples of the sort of mistakes LLMs (including o1-pro) make on ARC-AGI tasks and her team’s variations on them. These are really, really simple mistakes. I think you should really look at the example tasks and the example mistakes to get a sense of how rudimentary LLMs’ capabilities are.

I am interested to watch when ARC-AGI-3 launches. ARC-AGI-3 is interactive and there is more variety in the tasks. Just as AI models themselves need to be iterated, benchmarks need to be iterated. It is difficult to make a perfect product or technology on the first try. So, hopefully François Chollet and his colleagues will make better and better benchmarks with each new version of ARC-AGI.

Unfortunately, the AI researcher Andrew Karpathy has been saying some pretty discouraging things about benchmarks lately. From a tweet from November:

I usually urge caution with public benchmarks because imo they can be quite possible to game. It comes down to discipline and self-restraint of the team (who is meanwhile strongly incentivized otherwise) to not overfit test sets via elaborate gymnastics over test-set adjacent data in the document embedding space. Realistically, because everyone else is doing it, the pressure to do so is high.

I guess the most egregious publicly known example of an LLM company juicing its numbers on benchmarks was when Meta gamed (cheated on?) some benchmarks with Llama 4. Meta AI’s former chief scientist, Yann LeCun, said in a recent interview that Mark Zuckerberg "basically lost confidence in everyone who was involved in this" (which didn’t include LeCun, who worked in a different division), many of whom have since departed the company. 

However, I don’t know where LLM companies draw the line between acceptable gaming (or cheating) and unacceptable gaming (or cheating). For instance, I don’t know if LLM companies are creating their own training datasets with their own versions of ARC-AGI-2 tasks and training on that. It may be that the more an LLM company pays attention to and cares about a benchmark, the less meaningful a measurement it is (and vice versa). 

Karpathy again, this time in his December LLM year in review post:

Related to all this is my general apathy and loss of trust in benchmarks in 2025. The core issue is that benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR and weaker forms of it via synthetic data generation. In the typical benchmaxxing process, teams in LLM labs inevitably construct environments adjacent to little pockets of the embedding space occupied by benchmarks and grow jaggies to cover them. Training on the test set is a new art form.

I think probably one of the best measures of AI capabilities is AI’s ability to do economically useful or valuable tasks, in real world scenarios, that can increase productivity or generate profit. This is a more robust test — it isn’t automatically gradable, and it would be very difficult to game or cheat on. To misuse the roboticist Rodney Brooks’ famous phrase, "The world is its own best model." Rather than test on some simplified, contrived proxy for real world tasks, why not just test on real world tasks?

Moreover, someone has to pay for people to create benchmarks, and to maintain, improve, and operate them. There isn’t a ton of money to do so, especially not for benchmarks like ARC-AGI-2. But there’s basically unlimited money incentivizing companies to measure productivity and profitability, and to try out allegedly labour-saving technologies. After the AI bubble pops (which it inevitably will, probably sometime within the next 5 years or so), this may become less true. But for now, companies are falling over themselves to try to implement and profit from LLMs and generative AI tools. So, funding to test AI performance in real world contexts is currently in abundant supply.

Load more