Pronouns: she/her or they/them.Ā
I got interested in effective altruism back before it was called effective altruism, back before Giving What We Can had a website. Later on, I got involved in my university EA group and helped run it for a few years. Now Iām trying to figure out where effective altruism can fit into my life these days and what it means to me.
I write on Substack, and used to write on Medium.
Where do you get that 5-20x figure from?
I recall Elon Musk once said the goal was to get to an average of one intervention per million miles of driving. I think this is based on the statistic of one crash per 500,000 miles on average.
I believe interventions currently happen more than once per 100 miles on average. If so, and if one intervention per million miles is what Tesla is indeed targeting, then Tesla is more than 10,000x off from its goal.
There are other ways of measuring Teslaās FSD softwareās performance compared to average human driving performance and getting another number. I am skeptical it would be possible to use real, credible numbers and come to the conclusion that Tesla is currently less than 100x away from human-level driving.
I very much doubt that Hardware 5/AI5 is going to provide what it takes for Tesla to achieve SAE Level 4/5 autonomy at human-level or better performance, or that Tesla will achieve that goal (in any robust, meaningful sense) within the next 2 years. I still think what I said is true ā Tesla, internally, would have evidence of this if it were true (or would be capable of obtaining it), and would be incentivized to show off that evidence.
Andrej Karpathy understands this topic better than almost anyone else in the world, and he is clear that he thinks fully autonomous driving is not solved (at Tesla, Waymo, or elsewhere) and thereās long way to go still. Thereās good reason to listen to Karpathy on this.
I also very much doubt that the best AI models in 2 years will be capable of writing 90% of commercial, production code, let alone that this will happen within six months. I think thereās essentially no chance of this happening in 2026. As far as I can see, there is no good evidence currently available that would suggest this is starting to happen or should be possible soon. Extrapolating from performance on narrow, contrived benchmark tasks to real world performance is just a mistake. And the evidence about real world use of AI for coding does not support this.
On Tesla: I don't think training a special model for expensive test cars makes sense. They're not investing in a method that's not going to be scalable. The relevant update will come when AI5 ships (end of this year reportedly), with ~9x the memory. I'd be surprised if they don't solve it on that hardware.
Do you mean you think Tesla will immediately solve human-level fully autonomous (i.e. SAE Level 4 or Level 5) driving as soon as they deploy Hardware 5/AI5? Or that it will happen some number of years down the line?
Tesla presumably already has some small number of Hardware 5/AI5 units now. It knows what the specs for Hardware 5/AI5 will be. So, it can train a larger model (or set of models) now for that 10x more powerful hardware. Maybe it has already done so. I would imagine Tesla would want to be already testing the 10x larger model (or models) on the new hardware now, before the new hardware enters mass production.
If the 10x more powerful hardware were sufficient so solve full autonomy, Tesla should be able to demonstrate something impressive now with the new hardware units it presumably already has. Moreover, Tesla is massively incentivized to do so.
I don't see any strong reason why 10x more powerful hardware or a 10x larger model (or set of models) would be enough to get the 100x or 1,000x or 10,000x or whatever it is boost in performance Tesla's FSD software needs. The trend in scaling the compute and data used for neural networks is that performance tends to improve by less, proportionally, than the increase in compute and data. So, a 10x increase in compute or model size would tend to get less than a 10x increase in performance.
But if it is true that the 10x powerful hardware is sufficient to solve the remainder of the problem, Tesla would have compelling evidence of that by now, or would easily be able to obtain that evidence. I think Tesla would be eager to show that evidence off if it had it, or knew it could get it.
What Hinton and others got wrong wasn't the capability prediction, it was assuming the job consisted entirely of the task AI was getting good at. Turns out radiologists do more than read images, translators do more than translate sentences, and AI ends up complementary rather than substitutive.
I've seen some studies that have found AI models simply underperform human radiologists, although the results are mixed. More importantly, the results are for clean, simplified benchmarks, but those benchmarks don't generalize well to real world conditions anyway.
I haven't spent much time looking into studies on human translation vs. post-LLM machine translation. However, I found one study of GPT-4o, open source LLMs, Google Translate, and DeepL that found (among other things):
LLMs still need to address the issue of overly literal outputs, and a substantial gap remains between LLM and human quality in literary translation, despite the clear advancements of recent models.
Since studies take so long to conduct, write up, and get published, we will tend to see studies lagging behind the latest versions of LLMs. That's a drawback, but I don't know of a better way to get this kind of high-quality data and analysis. More up-to-date information, like firm-level data or other economic data, is more recent but doesn't tell as much about the why.
Consulting firms like McKinsey release data based on interviews with people in management positions at companies; I don't think they've specifically covered radiology or translation, but you might be able to find similar reports for those domains based on interviews. This is another way to get more up-to-date information, but interviews or surveys have drawbacks relative to academic studies.
The benchmarks aren't perfect but they consistently point to rapid progress, from METR time horizons, SWE-bench, GDPval...
Performance on these benchmarks don't generalize very well to real world performance. I think "aren't perfect" is an understatement.
There is much to criticize about the way the METR time horizons graph, specifically, has been interpreted. It's not clear how much METR is responsible for this interpretation; sometimes people at METR give good caveats, sometimes they don't. In any case, the graph only says something very narrow and contrived, and it doesn't necessarily tell us much about how good AI is at coding in a practical, realistic, economic sense (or how good it will be in a year or two).
On the 90% prediction: my somewhat conservative view is that AI could write 90%+ of production code this year and will next year.
I very much doubt AI will write 90% of production code by December 2027. But already, you seem to be pushing out the timeline. You started by saying Dario Amodei was "off by a few months" in his prediction that 90% of code would be AI-written by mid-September 2025. (It's already been nearly 4 months since then.) Pushing out the timeline into 2027 makes him off by at least 1 year and 3 months. If the timeline is late 2027, then he's off by at least 2 years.
Ah, okay, that is tricky! I totally missed one of the rules that the examples are telling us about. Once you see it, it seems simple and obvious, but it's easy to miss. If you want to see the solution, it's here.Ā
I believe all ARC-AGI-2 puzzles contain (at least?) two different rules that you have to combine. I forgot about that part! I was trying to solve the puzzle as if there was just one rule to figure out.
I tried the next puzzle and was able to solve it right away, on the first try, keeping in mind the 'two rules' thing. These puzzles are actually pretty fun, I might do more.
ARC-AGI-2 is not a test of whether a system is an AGI or not. Getting 100% on ARC-AGI-2 would not imply a system is AGI. I guess the name is potentially misleading in that regard. But Chollet et al. are very clear about this.
The arxiv.org pre-print explains how the human testing worked. See the section "Human-facing calibration testing" on page 5. The human testers only had a maximum of 90 minutes:
Participants completed a short survey and interface tutorial prior to being assigned tasks. Participants received a base compensation of $115-150 for participation in a 90-minute test session, plus a $5 incentive reward per correctly completed task. Three testing sessions were held between November 2024 and May 2025.
The median time spent attempting or solving each task was around 2 minutes:
The median time spent on attempted test pairs was 2.3 minutes, while successfully completed tasks required a median of 2.2 minutes (Figure 3).
I'm still not entirely sure how the human test process worked from the description in the pre-print, but maybe rather than giving up and walking away, testers gave up on individual tasks in order to solve as many as possible in their allotted time.
I think you're probably right about how they're defining "human panel", but I wish this were more clearly explained in the pre-print, on the website, or in the presentations they've done.
I can't respond to your comments in the other thread because of the downvoting, so Iāll reply here:
1) Metaculus and Manifold have a huge overlap with the EA community (I'm not familiar with Kashi) and, outside the EA community, people who are interested in AGI often far too easily accept the same sort of extremely flawed stuff that presents itself as way more serious and scientific than it really is (e.g. AI 2027, Situational Awareness, Yudkowsky/MIRI's stuff).
2) I think it's very difficult to know if one is engaging in motivated reasoning, or what other psychological biases are in play. People engage in wishful thinking to avoid unpleasant realities or possibilities, but people also invent unpleasant realities/possibilities, including various scenarios around civilizational collapse or the end of the world (e.g. a lot of doomsday preppers seem to believe in profoundly implausible, pseudoscientific, or fringe religious doomsday scenarios). People seem to be both biased toward believing pleasant and unpleasant things. (There is also something psychologically grabbing about believing that one belongs to an elite few who possess esoteric knowledge about cosmic destiny and may play a special role in determining the fate of the world.)
My explicit, conscious reasoning is complex and can't be summarized in one sentence (see the posts on my profile for the long version), but it's less along the lines of 'I donāt want to believe unpleasant things' and more along the lines of: a lot people preaching AGI doom lack expertise in AI, have a bad track record of beliefs/predictions on AI and/or other topics, say a lot of suspiciously unfalsifiable and millennialist things, and don't have clear, compelling answers to objections that have been publicly raised, some for years now.
Teslaās production fleet is constrained by the costs of production hardware, but their internal test fleet or robotaxi fleet could easily use $100,000+ hardware if they wanted. If this were enough for dramatically better performance, that would make for a flashy demo, which would probably be great for Teslaās share price, so they are incentivized to do this.
Whatās your prediction about when AI will write 90% of commercial, production code? If you think itās within in a year from now, you can put me on the record as predicting that wonāt happen.
Itās not just self-driving or coding where AI isnāt living up to the most optimistic expectations. There has been very little success in using LLMs and generative AI tools for commercial applications across the board. Demand for human translators has continued to increase since GPT-4 was released (although counterfactually it may have grown less than it would have otherwise). Youād think if generative AI were good at any commercially valuable task, it would be translation. (Customer support chat is another area with some applicability, but results are mixed, and LLMs are only an incremental improvement over the Software 1.0 chatbots and pre-LLM chatbots that already existed.) This is why I say weāre most likely in an AI bubble. Itās not just optimistic expectations in a few domains that have gotten ahead of their skis, itās across the aggregate of all commercially relevant domains.
One more famous AI prediction I didnāt mention in this post is the Turing Award-winning AI researcher Geoffrey Hintonās prediction in 2016 that deep learning would automate all radiology jobs by 2021. Even in 2026, he couldnāt be more wrong. Demand for radiologists and radiologistsā salaries have been on the rise. We should be skeptical of brazen predictions about what AI will soon be able to do, even from AI luminaries, given how wrong theyāve been before.
In footnote 2 on this post, I said I wouldnāt be surprised if, on January 1, 2026, the top score on ARC-AGI-2 was still below 60%. It did turn out to be below 60%, although only by 6%.Ā (Elon Muskās prediction of AGI in 2025 was wrong, obviously.)
The score the ARC Prize Foundation ascribes to human performance is 100%, rather than 60%. 60% is the average for individual humans, but 100% is the score for a "human panel", i.e. a set of at least two humans. Note the large discrepancy between the average human and the average human panel. The human testers were random people off the street who got paid $115-150 to show up and then an additional $5 per task they solved. I believe the ARC Prize Foundationās explanation for the 40-point discrepancy is that many of the testers just didnāt feel that motivated to solve the tasks and gave up.Ā (I vaguely remember this being mentioned in a talk or interview somewhere.)
ARCās Grand Prize requires scoring 85% (and abiding by certain cost/compute efficiency limits). They say the 85% target score is "somewhat arbitrary".
I decided to go with the 60% figure in this post to go easy on the LLMs.
If you havenāt already, I recommend looking at some examples of ARC-AGI-2 tasks. Notice how simple they are. These are just little puzzles. They arenāt that complex. Anyone can do one in a few minutes, even a kid. It helps to see what weāre actually measuring here.Ā
The computer scientist Melanie Mitchell has a great recent talk on this. The whole talk is worth watching, but the part about ARC-AGI-1 and ARC-AGI-2 starts at 21:50. She gives examples of the sort of mistakes LLMs (including o1-pro) make on ARC-AGI tasks and her teamās variations on them. These are really, really simple mistakes. I think you should really look at the example tasks and the example mistakes to get a sense of how rudimentary LLMsā capabilities are.
I am interested to watch when ARC-AGI-3 launches. ARC-AGI-3 is interactive and there is more variety in the tasks. Just as AI models themselves need to be iterated, benchmarks need to be iterated. It is difficult to make a perfect product or technology on the first try. So, hopefully FranƧois Chollet and his colleagues will make better and better benchmarks with each new version of ARC-AGI.
Unfortunately, the AI researcher Andrew Karpathy has been saying some pretty discouraging things about benchmarks lately. From a tweet from November:
I usually urge caution with public benchmarks because imo they can be quite possible to game. It comes down to discipline and self-restraint of the team (who is meanwhile strongly incentivized otherwise) to not overfit test sets via elaborate gymnastics over test-set adjacent data in the document embedding space. Realistically, because everyone else is doing it, the pressure to do so is high.
I guess the most egregious publicly known example of an LLM company juicing its numbers on benchmarks was when Meta gamed (cheated on?) some benchmarks with Llama 4. Meta AIās former chief scientist, Yann LeCun, said in a recent interview that Mark Zuckerberg "basically lost confidence in everyone who was involved in this" (which didnāt include LeCun, who worked in a different division), many of whom have since departed the company.Ā
However, I donāt know where LLM companies draw the line between acceptable gaming (or cheating) and unacceptable gaming (or cheating). For instance, I donāt know if LLM companies are creating their own training datasets with their own versions of ARC-AGI-2 tasks and training on that. It may be that the more an LLM company pays attention to and cares about a benchmark, the less meaningful a measurement it is (and vice versa).Ā
Karpathy again, this time in his December LLM year in review post:
Related to all this is my general apathy and loss of trust in benchmarks in 2025. The core issue is that benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR and weaker forms of it via synthetic data generation. In the typical benchmaxxing process, teams in LLM labs inevitably construct environments adjacent to little pockets of the embedding space occupied by benchmarks and grow jaggies to cover them. Training on the test set is a new art form.
I think probably one of the best measures of AI capabilities is AIās ability to do economically useful or valuable tasks, in real world scenarios, that can increase productivity or generate profit. This is a more robust test ā it isnāt automatically gradable, and it would be very difficult to game or cheat on. To misuse the roboticist Rodney Brooksā famous phrase, "The world is its own best model." Rather than test on some simplified, contrived proxy for real world tasks, why not just test on real world tasks?
Moreover, someone has to pay for people to create benchmarks, and to maintain, improve, and operate them. There isnāt a ton of money to do so, especially not for benchmarks like ARC-AGI-2. But thereās basically unlimited money incentivizing companies to measure productivity and profitability, and to try out allegedly labour-saving technologies. After the AI bubble pops (which it inevitably will, probably sometime within the next 5 years or so), this may become less true. But for now, companies are falling over themselves to try to implement and profit from LLMs and generative AI tools. So, funding to test AI performance in real world contexts is currently in abundant supply.
The economist Tyler Cowen linked to my post on self-driving cars, so it ended up getting a lot more readers than I ever expected. I hope that more people now realize, at the very least, self-driving cars are not an uncontroversial, uncomplicated AI success story. In discussions around AGI, people often say things along the lines of: ādeep learning solved self-driving cars, so surely it will be able to solve many other problems'. In fact, the lesson to draw is the opposite: self-driving is too hard a problem for the current cutting edge in deep learning (and deep reinforcement learning), and this should make us think twice before cavalierly proclaiming that deep learning will soon be able to master even more complex, more difficult tasks than driving.
Thanks.
Unfortunately, patient philanthropy is the sort of topic where it seems like what a person thinks about it depends a lot on some combination of a) their intuitions about a few specific things and b) a few fundamental, worldview-level assumptions. I say "unfortunately" because this means disagreements are hard to meaningfully debate.
For instance, there are places where the argument either pro or con depends on what a particular number is, and since we donāt know what that number actually is and canāt find out, the best we can do is make something up. (For example, whether, in what way, and by how much foundations created today will decrease in efficacy over long timespans.)
Many people in the EA community are content to say, e.g., the chance of something is 0.5% rather than 0.05% or 0.005%, and rather than 5% or 50%, simply based on an intuition or intuitive judgment, and then make life-altering, aspirationally world-altering decisions based on that. My approach is more similar to the approach of mainstream academic publishing, in which if you canāt rigorously justify a number, you canāt use it in your argument ā it isnāt admissible.
So, this is a deeper epistemological, philosophical, or methodological point.
One piece of evidence that supports my skepticism of numbers derived from intuition is a forecasting exercise where a minor difference in how the question was framed changed the number people gave by 5-6 orders of magnitude (750,000x). And thatās only one minor difference in framing. If different people disagree on multiple major, substantive considerations relevant to deriving a number, perhaps in some cases their numbers could differ by much more. If we canāt agree on whether a crucial number is a million times higher or lower, how constructive are such discussions going to be? Can we meaningfully say we are producing knowledge in such instances?
So, my preferred approach when an argument depends on an unknowable number is to stop the argument right there, or at least slow it down and proceed with caution. And the more of these numbers an argument depends on, the more I think the argument just canāt meaningfully support its conclusion, and, therefore, should not move us to think or act differently.