On the Dwarkesh/Chollet Podcast, and the cruxes of scaling to AGI

JWS 🔸

I agree with Chollet (and OP) that LLMs will probably plateau, but I’m also big into AGI safety—see e.g. my post AI doom from an LLM-plateau-ist perspective.

(When I say “AGI” I think I’m talking about the same thing that you called digital “beings” in this comment.)

Here are a bunch of agreements & disagreements.

if François is right, then I think this should be considered strong evidence that work on AI Safety is not overwhelmingly valuable, and may not be one of the most promising ways to have a positive impact on the world.

I think François is right, but I do think that work on AI safety is overwhelmingly valuable.

Here’s an allegory:

There’s a fast-breeding species of extraordinarily competent and ambitious intelligent aliens. They can do science much much better than Einstein, they can run businesses much much better than Bezos, they can win allies and influence much much better than Hitler or Stalin, etc. And they’re almost definitely (say >>90% chance) coming to Earth sooner or later, in massive numbers that will keep inexorably growing, but we don’t know exactly when this will happen, and we also don’t know in great detail what these aliens will be like—maybe they will have callous disregard for human welfare, or maybe they’ll be great. People have been sounding the alarm for decades that this is a big friggin’ deal that warrants great care and advanced planning, but basically nobody cares.

Then some scientist Dr. S says “hey those dots in the sky—maybe they’re the aliens! If so they might arrive in the next 5-10 years, and they’ll have the following specific properties”. All of the sudden there’s a massive influx of societal interest—interest in the dots in particular, and interest in alien preparation in general.

But it turns out that Dr. S was wrong! The dots are small meteors. They might hit earth and cause minor damage but nothing unprecedented. So we’re back to not knowing when the aliens will come or what exactly they’ll be like.

Is Dr. S’s mistake “strong evidence that alien prep is not overwhelmingly valuable”? No! It just puts us back where we were before Dr. S came along.

(end of allegory)

(Glossary: the “aliens” are AGIs; the dots in the sky are LLMs; and Dr. S would be a guy saying LLMs will scale to AGI with no additional algorithmic insights.)

It would make AI Safety work less tractable

If LLMs will plateau (as I expect), I think there are nevertheless lots of tractable projects that would help AGI safety. Examples include:

The human brain runs some sort of algorithm to figure things out and gets things done and invent technology etc. We don’t know exactly what that algorithm is (or else we would already have AGI), but we know much more than zero about it, and it’s obviously at least possible that AGI will be based on similar algorithms. (I actually believe something stronger, i.e. that it’s likely, but of course that’s hard to prove.) So now this is a pretty specific plausible AGI scenario that we can try to plan for. And that’s my own main research interest—see Intro to Brain-Like-AGI Safety. (Other varieties of model-based reinforcement learning would be pretty similar too.) Anyway, there’s tons of work to do on planning for that.
…For example, I list seven projects here. Some of those (e.g. this) seem robustly useful regardless of how AGI will work, i.e. even if future AGI is neither brain-like nor LLM-like, but rather some yet-unknown third category of mystery algorithm.
Outreach—After the invention of AGI (again, what you called digital “beings”), there are some obvious-to-me consequences, like “obviously human extinction is on the table as a possibility”, and “obviously permanent human oversight of AGIs in the long term would be extraordinarily difficult if not impossible” and “obviously AGI safety will be hard to assess in advance” and “if humans survive, those humans will obviously not be doing important science and founding important companies given competition from trillions of much-much-more-competent AGIs, just like moody 7-year-olds are not doing important science and founding important companies today” and “obviously there will be many severe coordination problems involved in the development and use of AGI technology”. But, as obvious as those things are to me, hoo boy there sure are tons of smart prominent people who would very confidently disagree with all of those. And that seems clearly bad. So trying to gradually establish good common knowledge of basic obvious things like that, through patient outreach and pedagogy, seems robustly useful and tractable to me.
Policy—I think there are at least a couple governance and policy interventions that are robustly useful regardless of whether AGI is based on LLMs (as others expect) or not (as I expect). For example, I think there’s room for building better institutions through which current and future tech companies (and governments around the world) can cooperate on safety as AGI approaches (whenever that may happen).

It seems that many people in Open Phil have substantially shortened their timelines recently (see Ajeya here).

For what it’s worth, Yann LeCun is very confidently against LLMs scaling to AGI, and yet LeCun seems to have at least vaguely similar timelines-to-AGI as Ajeya does in that link.

Ditto for me.

See also my discussion here (“30 years is a long time. A lot can happen. Thirty years ago, deep learning was an obscure backwater within AI, and meanwhile people would brag about how their fancy new home computer had a whopping 8 MB of RAM…”)

To be clear, you can definitely find some people in AI safety saying AGI is likely in <5 years, although Ajeya is not one of those people. This is a more extreme claim, and does seem pretty implausible unless LLMs will scale to AGI.

I think this makes me very concern of a strong ideological and philosophical bubble in the Bay regarding these core questions of AI.

Yeah some examples would be:

many AI safety people seem happy to make confident guesses about what tasks the first AGIs will be better and worse at doing based on current LLM capabilities;
many AI safety people seem happy to make confident guesses about how much compute the first AGIs will require based on current LLM compute requirements;
many AI safety people seem happy to make confident guesses about which companies are likely to develop AGIs based on which companies are best at training LLMs today;
many AI safety people seem happy to make confident guesses about AGI UIs based on the particular LLM interface of “context window → output token”;
etc. etc.

Many ≠ All! But to the extent that these things happen, I’m against it, and I do complain about it regularly.

(To be clear, I’m not opposed to contingency-planning for the possibility that LLMs will scale to AGIs. I don’t expect that contingency to happen, but hey, what do I know, I’ve been wrong before, and so has Chollet. But I find that these kinds of claims above are often stated unconditionally. Or even if they’re stated conditionally, the conditionality is kinda forgotten in practice.)

I think it’s also important to note that these habits above are regrettably common among both AI pessimists and AI optimists. As examples of the latter, see me replying to Matt Barnett and me replying to Quintin Pope & Nora Belrose.

By the way, this might be overly-cynical, but I think there are some people (coming into the AI safety field very recently) who understand how LLMs work but don’t know how (for example) model-based reinforcement learning works, and so they just assume that the way LLMs work is the only possible way for any AI algorithm to work.

JWS 🔸

Hey Steven! As always I really appreciate your engagement here, and I’m going to have to really simplify but I really appreciate your links^[1] and I’m definitely going to check them out 🙂

I think François is right, but I do think that work on AI safety is overwhelmingly valuable.
Here’s an allegory:

I think the most relevant disagreement that we have^[2]is the beginning of your allegory. To indulge it, I don't think we have knowledge of the intelligent alien species coming to earth, and to the extent we have a conceptual basis for them we can't see any signs of them in the sky. Pair this with the EA concern that we should be concerned about the counterfactual impact of our actions, and that there are opportunities to do good right here and now,^[3] it shouldn't be a primary EA concern.

Now, what would make it a primary concern is if Dr S is right and that the aliens are spotted and that they're on their way, but I don't think he's right. And, to stretch the analogy to breaking point, I'd be very upset that after I turned my telescope to the co-ordinates Dr S mentions and seeing meteors instead of spaceships, that significant parts of the EA movement were still wanting to have more funding to construct the ultimate-anti-alien-space-laser or do alien-defence-research instead of buying bednets.

(When I say “AGI” I think I’m talking about the same thing that you called digital “beings” in this comment.)

A secondary crux I have is that a 'digital being’ in the sense I describe, and possibly the AGI you think of, will likely exhibit certain autopoietic properties that make it significantly different from either the paperclip maxermiser or a 'foom-ing' ASI. This is highly speculative though, based on a lot of philosophical intuitions, and I wouldn’t want to bet humanity’s future on it at all in the case where we did see aliens in the sky.

To be clear, you can definitely find some people in AI safety saying AGI is likely in <5 years, although Ajeya is not one of those people. This is a more extreme claim, and does seem pretty implausible unless LLMs will scale to AGI.

My take on it, though I admit driven by selection bias on Twitter, is that many people in the Bay-Social-Scene are buying into the <5 year timelines. Aschenbrenner for sure, Kokotajlo as well, and even maybe Amodei^[4] as well? (Edit: Also lots of prominent AI Safety Twitter accounts seem to have bought fully into this worldview, such as the awful 'AI Safety Memes' account) However, I do agree it’s not all of AI Safety for sure! I just don’t think it that, once you take away that urgency and certainy of the probelm, it ought to be considered the world's “most pressing problem”, at least without further controversial philosophical assumptions.

^{^}
I remember reading and liking your 'LLM plateau-ist' piece.
^{^}
I can't speak for all the otheres you mention, but fwiw I do agree with your frustrations at the AI risk discourse on various sides
^{^}
I'd argue through increasing human flourishing and reducing the suffering we inflict on animals, but you could sub in your own cause area here for instance, e.g. 'preventing nuclear war' if you thought that was both likely and an x-risk

^{^}

See the transcript with Dwarkesh at 00:24:26 onwards where he says that superhuman/transformative AI capabilities will come within 'a few years' of the interview's date (so within a few years of summer 2023)

Ryan Greenblatt

Pair this with the EA concern that we should be concerned about the counterfactual impact of our actions, and that there are opportunities to do good right here and now,^[3] it shouldn't be a primary EA concern.

As in, your crux is that the probability of AGI within the next 50 years is less than 10%?

I think from an x-risk perspective it is quite hard to beat AI risk even on pretty long timelines. (Where the main question is bio risk and what you think about (likely temporary) civilizational collapse due to nuclear war.)

It's pretty plausible that on longer timelines technical alignment/safety work looks weak relative to other stuff focused on making AI go better.

JWS 🔸

As in, your crux is that the probability of AGI within the next 50 years is less than 10%?

I'm essentially deeply uncertain about how to answer this question, in a true 'Knightian Uncertainty' sense and I don't know how much it makes sense to use subjective probability calculus. It is also highly variable to what we mean by AGI though. I find many of the arguments I've seen to be a) deference to the subjective probabilities of others or b) extrapolation of straight lines on graphs - neither of which I find highly convincing. (I think your arguments seem stronger and more grounded fwiw)

I think from an x-risk perspective it is quite hard to beat AI risk even on pretty long timelines.

I think this can hold, but it hold's not just in light of particular facts about AI progress now but in light of various strong philosophical beliefs about value, what future AI would be like, and how the future would be post the invention of said AI. You may have strong arguments for these, but I find many arguments for the overwhelming importance of AI Safety do very poorly to ground these, especially in the light of compelling interventions to good that exist in the world right now.

Ryan Greenblatt

It is also highly variable to what we mean by AGI though.

I'm happy to do timelines to the singularity and operationize this with "we have the technological capacity to pretty easily build projects as impressive as a dyson sphere".

(Or 1000x electricity production, or whatever.)

In my views, this likely adds only a moderate number of years (3-20 depending on how various details go).

Steven Byrnes

For what it’s worth, Yann LeCun is very confidently against LLMs scaling to AGI, and yet LeCun seems to have at least vaguely similar timelines-to-AGI as Ajeya does in that link.
Ditto for me.

Oh hey here’s one more: Chollet himself (!!!) has vaguely similar timelines-to-AGI (source) as Ajeya does. (Actually if anything Chollet expects it a bit sooner: he says 2038-2048, Ajeya says median 2050.)

[anonymous]

Hi JWS, unsure if you’d see this since it’s on LW and I thought you’d be interested (I’m not sure what to think of Chollet’s work tbh and haven’t been able to spend time on it, so I’m not making much of a claim in sharing this!)

https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50-sota-on-arc-agi-with-gpt-4o

JWS 🔸

Thanks for sharing this Phil, it's very unforunate it came out just as I went on holiday! To all readers, this will probably be the major substantive response I make in these comments, and to get the most out of it you'll probably need some technical/background understanding of how AI systems work. I'll tag @Ryan Greenblatt directly so he can see my points, but only the first is really directed at him, the rest are responding to the ideas and interpretations.

First, to Ryan directly, this is really great work! Like, awesome job 👏👏 My only sadness here is that I get the impression you think this work is kind of a dead-end? On the contrary, I think this is the kind of research programme that could actually lead to updates (either way) across the different factions on AI progress and AI risk. You get mentioned positively on the xrisk-hostile Machine Learning Street Talk about this! Melanie Mitchell is paying attention (and even appeared in your substack comments)! I feel like the iron is hot here and it's a promising and exciting vein of research!^[1]

Second, as others have pointed out, the claimed numbers are not SOTA, but that is because there are different training sets and I think the ARC-AGI team should more clear about that. But to be clear for all readers, this is what's happened:

Ryan got a model to achieve 50% accuracy on the public evaluation training set provided by Chollet in the original repo. Ryan has not got a score on the private set, because those answers are kept privately on Kaggle to prevent data leakage. Note that Ryan's original claims were based on the different sets being IID and the same difficulty, which is not true. We should expect performance to be lower on the private set.
The current SOTA on the private test set was Cole, Osman, and Hodel with 34%, though apparently they now have reached 39% on the private set. Ryan has noted this, so I assume we'll have clarifications/corrections soon to that bit of his piece.
Therefore Ryan has not achieved SOTA performance on ARC. That doesn't mean his work isn't impressive, but it is not true that GPT4o improved the ARC SOTA 16% in 6 days.
Also note from the comments on Substack, when limited to ~128 sample programmes per case, the results were 26% on the held out test of the training set. It's good, but not state of the art, and one wonders whether the juice is worth the squeeze there, especially if Jianghong Ying's calculations of the tokens-per-case is accurate. We seem to need exponential data to improve results.

Currently, as Ryan notes, his solution is inelligble for the ARC prize as it doesn't meet the various restrictions on runtime/compute/internet connection to enter. While the organisers say that this is meant to encourage efficiency,^[2] I suspect it may be more of a security-conscious decision to limit people's access to the private test set. It is worth noting that, as the public training and eval sets are on GitHub (as will be most blog pieces about them, and eventually Ryan's own piece as well as my own) dataset contamination remains an issue to be concerned with.^[3]

Third, and most importantly, I think Ryan's solution shows that the intelligence is coming for him, and not from Chat-GPT4o. skybrian makes this point in the comments in the substack comments. For example:

Ryan came up with the idea and implementation to use ASCII encoding since the vision capabilities of GPT4o were so unreliable. Ryan did some feature extraction on the ARC problems.
Ryan wrote the prompts and did the prompt engineering in lieu of their being fine-tuning available. He also provides the step-by-step reasoning in his prompts. Those long, carefully crafted prompt seems quite domain/problem-specific, and would probably point toward ARC's insufficiency as a test for generality than an example of general ability in LLMs.
Ryan notes that the additional approaches and tweaks are critical for performance gain above the 'just draw more samples'. I think that meme was a bit unkind, let alone inaccurate, and I kinda wish it was removed from the piece tbh.

If you check the repo (linked above), it's full of some really cool code to make this solution work, but that's the secret sauce. To my eyes, I think the hard part here was the scaffolding done by Ryan rather than the pre-training^[4] of the LLM (this is another cruxy point I highlighted in my article). I think it's much less conceptually hard to scrape the entire internet and shove it through a transformer architecture. A lot of leg work and cost sure, but the hard part is the ideas bit, and that's still basically all Ryan-GPT.

Fourth, I got massively nerdsniped by what 'in-context learning' actually is. A lot of talk about it from a quick search seemed to be vague, wish-washy, and highly anthropomorphising. I'm quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why it's called learning at all. The model certainly isn't learning anything. After you ask GPT4o a query you can boot up a new instance and it'll be as clueless as when you started the first session, or you could just flood the context window with enough useless tokens so the original task gets cut off. So, if I accept Ryan's framing of the inconsistent triad, I'd reject the 3rd one, and say that "Current LLMs never "learn" at runtime (e.g. the in-context learning they can do isn't real learning)". I'm going to continue following the 'in-context learning' nerdsnipe, but yeah since we know that weights are completely fixed and the model isn't learning, what is doing it? And can we think of a better name for it than 'in-context learning'?

Fifth and finally, I'm slightly disappointed at Buck and Dwarkesh for kinda posing this as a 'mic drop' against ARC.^[5] Similarly, Zvi seems to dismiss it, though he praises Chollet for making a stand with a benchmark. I contrasnt, I think that the ability (or not) of models to reason robustly, out-of-distribution, without having the ability to learn from trillions of pre-labelled samples is a pretty big crux for AI Safety's importance. Sure, maybe in a few months we'll see the top score on the ARC Challenge above 85%, but could such a model work in the real world? Is it actually a general intelligence capable of novel or dangerous acts, such as to motivate AI risk? This is what Chollet is talking about in the podcast when he says:

I’m pretty skeptical that we’re going to see an LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved. If you just train the model on millions or billions of puzzles similar to ARC, you’re relying on the ability to have some overlap between the tasks that you train on and the tasks that you’re going to see at test time. You’re still using memorization.

If you're reading, thanks for making it through this comment! I'd recommend reading Ryan's full post first (which Philb linked above), but there's been a bunch of disparate discussion there, on LessWrong, on HackerNews etc. If you want to pursue what the LLM-reasoning-sceptics think, I'd recommend following/reading Melanie Mitchell and Subbarao Kambhampati. Finally, if you think this is topic/problem is worth collaborating on then feel free to reach out to me. I'd love to hear from anyone who thinks it's worth investigating and would want to pool resources.

^{^}
(Ofc your time is valuable and you should pursue what you think is valuable, I'd just hope this could be the start of a cross-factional, positive-sum research program which would be such a breath of fresh air compared to other AI discourse atm)
^{^}
Ryan estimates he used 1000x runtime compute per problem than Cole et. al, and also spent $40,000 in API costs alone (i wonder how much it costs for just 1 run though?).
^{^}
In the original interview, Mike mentions that 'there is an asterisk on any score that's reported on against the public test set' for this very reason

^{^}

H/t to @Max Nadeau for being on top of some of the clarifications on Twitter

^{^}

Perhaps I'm misinterpreting, and I am using them as a proxy for the response of AI Safety as a whole, but it's very much the 'vibe' I got from those reactions

Ryan Greenblatt

Sure, maybe in a few months we'll see the top score on the ARC Challenge above 85%, but could such a model work in the real world?

It sound like you agree with my claims that ARC-AGI isn't that likely to track progress and that other benchmarks could work better?

(The rest of your response seemed to imply something different.)

JWS 🔸

At the moment I think ARC-AGI does a good job at showing the limitations of transformer models on simple tasks that they don't come across in their training set. I think if the score was claimed, we'd want to see how it came about. It might be through frontier models demonstrating true understanding, but it might through shortcut learning/data leakage/impressive but overly specific and intuitively unsatisfying solution.

If ARC-AGI were to be broken (within the constraints Chollet and Knoop place on it) I'd definitely change my opinions, but what they'd change to depends on the matter of how ARC-AGI was solved. That's all I'm trying to say in that section (perhaps poorly)

Ryan Greenblatt

the claimed numbers are not SOTA, but that is because there are different training sets and I think the ARC-AGI team should more clear about that

Agreed, though it is possible that my approach is/was SOTA on the private set. (E.g., because Jack Cole et al.'s approach is somewhat more overfit.)

I'm waiting on the private leaderboard results and then I'll revise.

Ryan Greenblatt

My only sadness here is that I get the impression you think this work is kind of a dead-end?

I don't think it is a dead end.

As I say in the post:

ARC-AGI probably isn't a good benchmark for evaluating progress towards TAI: substantial "elicitation" effort could massively improve performance on ARC-AGI in a way that might not transfer to more important and realistic tasks.
But, I still think that work like ARC-AGI can be good on the margin for getting a better understanding of current AI capabilities.

Ryan Greenblatt

So, if I accept Ryan's framing of the inconsistent triad, I'd reject the 3rd one, and say that "Current LLMs never "learn" at runtime (e.g. the in-context learning they can do isn't real learning)"

You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.

I'm quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why it's called learning at all

In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning? I would say that a system can (potentially!) be learning as long as there is some evolving state. In the case of transformers and in-context learning, that state is activations.

JWS 🔸

You have to reject one of the three. So, if you reject the third (as I do), then you think LLMs do learn at runtime.

Ah sorry I misread the trilemma, my bad! I think I'd still hold the 3rd to be true (Current LLMs never "learn" at runtime) though I'm open to changing my mind on that looking at further research. I guess I could see ways to reject 1 (e.g. if I copied the answers and just used a lookup table I'd get 100% but I don't think there's any learning, so it's certainly feasible for this to be false, but agreed it doesn't feel satisfying), or 2 (Maybe Chollet would say selection-from-memorised-templates doesn't count as a learning, also agreed unsatisfying). It's a good challenge!

In RLHF and training, no aspect of the GPU hardware is being updated at all, its all frozen. So why does that count as learning?

I'm not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated, and that's where the 'learning' (if we want to call it that) comes in - the model is 'learning' to store information/generate information with some combination of accurately predicting the next token in its training data and satisfying the RL model created from human reward labelling. Which is my issue with calling ICL 'learning' since the model weights are fixed, the model isn't learning anything. Similarly, all the activation functions between the layers do not change either. I also don't make intuitive sense to me to call the outputs of layers as 'learning' - the activations are 'just matmul' which I know is reductionist, but they aren't a thing that acquires a new state in my mind.

But again, this is something I want to do a deep dive into myself, so I accept that my thoughts on ICL might not be very clear

Ryan Greenblatt

I'm not really referring to hardware here, in pre-training and RLHF the model weights are being changed and updated

Sure, I was just using this as an example. I should have made this more clera.

Here is a version of the exact same paragraph you wrote but for activations and incontext learning:

in pre-training and RLHF the model activations are being changed and updated by each layer, and that's where the 'in-context learning' (if we want to call it that) comes in - the activations are being updated/optimized to better predict the next token and understand the text. The layers learned to in-context learn (update the activations) across a wide variety of data in pretraining.

(We can show transformers learning to optimization in [very toy cases](https://www.lesswrong.com/posts/HHSuvG2hqAnGT5Wzp/no-convincing-evidence-for-gradient-descent-in-activation#Transformers_Learn_in_Context_by_Gradient_Descent__van_Oswald_et_al__2022_).)

Fair enough if you want to say "the model isn't learning, the activations are learning", but then you should also say "short term (<1 minute) learning in humans isn't the brain learning, it is the transient neural state learning".

JWS 🔸

I'll have to dive into the technical details here I think, but the mystery of in-context learning has certainly shot up my reading list, and I really appreciate that link btw! It seems Blaine has some of the similary a-priori scepticism that I do towards it, but the right way for me to proceed is dive into the empirical side and see if my ideas hold water there.

Ryan Greenblatt

Third, and most importantly, I think Ryan's solution shows that the intelligence is coming for him, and not from Chat-GPT4o. skybrian makes this point in the comments in the substack comments.
[...]
To my eyes, I think the hard part here was the scaffolding done by Ryan rather than the pre-training^[4] of the LLM (this is another cruxy point I highlighted in my article).

Quoting from a substack comment I wrote in response:

Certainly some credit goes to me and some to GPT4o.
The solution would be much worse without careful optimization and wouldn't work at all without gpt4o (or another llm with similar performance).
It's worth noting a high fraction of my time went into writing prompts and optimization the representation. (Which is perhaps better described as teaching gpt4o and making it easier for it to see the problem.)
There are different analogies here which might be illuminating:
Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.
If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.
You can build systems around people which remove most of the interesting intelligence from various tasks.
I think what is going on here is analogous to all of these.
Separately, this tweet is relevant: https://x.com/MaxNadeau_/status/1802774696192246133

I think it's much less conceptually hard to scrape the entire internet and shove it through a transformer architecture. A lot of leg work and cost sure, but the hard part is the ideas bit,

It is worth noting that hundreds (thousands?) of high quality researcher years have been put into making GPT4o more performant.

JWS 🔸

The solution would be much worse without careful optimization and wouldn't work at all without gpt4o (or another llm with similar performance).

I can buy that GPT4o would be best, but perhaps other LLMs might reached 'ok' scores on ARC-AGI if directly swapped out? I'm not sure what you refer to be 'careful optimization' here though.

There are different analogies here which might be illuminating:
Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.
If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.
You can build systems around people which remove most of the interesting intelligence from various tasks.
I think what is going on here is analogous to all of these.

On these analogies:

This is an interesting point actually. I suppose credit-assingment for learning is a very difficult problem. In this case though, the child stranded would (hopefully!) survive and make a life for themselves and learn the skills they need to survive. They're active agents using their innate general intelligence to solve novel problems (per chollet). If I put a hard-drive with gpt4o's weights in the forest, it'll just rust. And that'll happen no matter how big we make that model/hard-drive imo.^[1]
Agreed here, will be very interesting to see how improved multimodality affects ARC-AGI scores. I think that we have interesting cases of humans being able to perform these takes in their head presumably without sight? e.g. Blind Chess Players with high ratings or Mathematicians who can reason without sight. I think Chollet's point in the interview is that they seem to be able to parse the JSON inputs fine in various cases, but still can't perform generalisation.
Yep I think this is true, and perhaps my greatest fear from delegating power to complex AI systems. This is an empirical question we'll have to find out, can we simply automate away everything humans do/are needed for through a combination of systems even if each individual part/model used in said system is not intelligent?

Separately, this tweet is relevant: https://x.com/MaxNadeau_/status/1802774696192246133

Yep saw Max's comments and think he did a great job on X bringing some clarifications. I still think the hard part is the scaffolding. Money is easy for SanFran VCs to provide, and we know they're all fine to scrape-data-first-ask-legal-forgiveness later.

I think there's a separate point where enough scaffolding + LLM means the resulting AI system is not well described by being an LLM anymore. Take the case of CICERO by Meta. Is that a 'scaffolded LLM'? I'd rather describe it as a system which incorporates an LLM as a particular part. It's harder to naturally scale such a system in the way that you can with the transformer architecuter by stacking more layers or pre-training for longer on more data.

My intuition here is that scaffolding to make a system work well on ARC-AGI would make it less useable on other tasks, so sacrificing generality for specific performance. Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each? (Just thinking out loud here)

Final point, I've really appreciate your original work, comments on substack/X/here. I do apologise if I didn't make clear what parts were my personal reflections/vibes instead of more technical disagreements on interpretation - these are very complex topics (at least for me) and I'm trying my best to form a good explanation of the various evidence and data we have on this. Regardless of our disagreements on this topic, I've learned a lot :)

^{^}
Similarly, you can pre-train a model to create weights and get to a humongous size. But it won't do anything until you ask it to generate a token. At least, that's my intuition. I'm quite sceptical of how pre-training a transformer is going to lead to creating a mesa-optimiser

Ryan Greenblatt

But it won't do anything until you ask it to generate a token. At least, that's my intuition.

I think this seems like mostly a fallacy. (I feel like there should be a post explaning this somewhere.)

Here is an alternative version of what you said to indicate why I don't think this is a very interesting claim:

Sure you can have a very smart quadriplegic who is very knowledgable. But they won't do anything until you let them control some actuator.

If your view is that "prediction won't result in intelligence", fair enough, though its notable that the human brain seems to heavily utilize prediction objectives.

JWS 🔸

(folding in replies to different sub-comments here)

Sure you can have a very smart quadriplegic who is very knowledgable. But they won't do anything until you let them control some actuator.

I think our misunderstanding here is caused by the word do. Sure, Stephen Hawking couldn't control his limbs, but nevertheless his mind was always working. He kept writing books and papers throughout his life, and his brain was 'always on'. A transformer model is a set of frozen weights that are only 'on' when a prompt is entered. That's what I mean by 'it won't do anything'.

As far as this project, seems extremely implausible to me that the hard part of this project is the scaffolding work I did.

Hmm, maybe we're differing on what hard works means here! Could be a difference between what's expensive, time-consuming, etc. I'm not sure this holds for any reasonable scheme, and I definitely think that you deserve a lot of credit for the work you've done, much more than GPT4o.

I think my results are probably SOTA based on more recent updates.

Congrats! I saw that result and am impressed! It's definitely clearly SOTA on the ARC-AGI-PUB leaderboard, but the original '34%->50% in 6 days ARC-AGI breakthrough' claim is still incorrect.

Ryan Greenblatt

I can buy that GPT4o would be best, but perhaps other LLMs might reached 'ok' scores on ARC-AGI if directly swapped out? I'm not sure what you refer to be 'careful optimization' here though.

I think much worse LLMs like GPT-2 or GPT-3 would virtually eliminate performance.

This is very clear as these LLMs can't code basically at all.

If you instead consider LLMs which are only somewhat less powerful like llama-3-70b (which is perhaps 10x less effective compute?), the reduction in perf will be smaller.

Ryan Greenblatt

Perhaps in this case ARC-AGI is best used as a suite of benchmarks, where the same model and scaffolding should be used for each?

Yes, it seems reasonable to try out general purpose scaffolds (like what METR does) and include ARC-AGI in general purpose task benchmarks.

I expect substantial performance reductions from general purpose scaffolding, though some fraction will be due to not having prefix compute allocating test time compute less effectively.

Ryan Greenblatt

I still think the hard part is the scaffolding.

For this project? In general?

As far as this project, seems extremely implausible to me that the hard part of this project is the scaffolding work I did. This probably holds for any reasonable scheme for dividing credit and determining what is difficult.

Ryan Greenblatt

Fifth and finally, I'm slightly disappointed at Buck and Dwarkesh for kinda posing this as a 'mic drop' against ARC.

I don't think the objection is to ARC (the benchmark), I think the objection is to specific (very strong!) claims that chollet makes.

I think the benchmark is a useful contribution as I note in another comment.

JWS 🔸

Oh yeah this wasn't against you at all! I think you're a great researcher, and an excellent interlocutor, and I learn a lot (and am learning a lot) from both your work and your reactions to my reaction.^[1] Point five was very much a reaction against a 'vibe' I saw in the wake of your results being published.

Like let's take Buck's tweet for example. We know now that a) your results aren't technically SOTA and b) It's not an LLM solution, it's an LLM + your scaffolding + program search, and I think that's importantly not the same thing.

^{^}
I sincerely hope my post + comments have been somewhat more stimulating than frustrating for you

Ryan Greenblatt

We know now that a) your results aren't technically SOTA

I think my results are probably SOTA based on more recent updates.

It's not an LLM solution, it's an LLM + your scaffolding + program search, and I think that's importantly not the same thing.

I feel like this is a pretty strange way to draw the line about what counts as an "LLM solution".

Consider the following simplified dialogue as an example of why I don't think this is a natural place to draw the line:

Human skeptic: Humans don't exhibit real intelligence. You see, they'll never do something as impressive as sending a human to the moon.

Humans-have-some-intelligence advocate: Didn't humans go to the moon in 1969?

Human skeptic: That wasn't humans sending someone to the moon that was Humans + Culture + Organizations + Science sending someone to the moon! You see, humans don't exhibit real intelligence!

Humans-have-some-intelligence advocate: ... Ok, but do you agree that if we removed the Humans from the overall approach it wouldn't work.

Human skeptic: Yes, but same with the culture and organization!

Humans-have-some-intelligence advocate: Sure, I guess. I'm happy to just call it humans+etc I guess. Do you have any predictions for specific technical feats which are possible to do with a reasonable amount of intelligence that you're confident can't be accomplished by building some relatively straightforward organization on top of a bunch of smart humans within the next 15 years?

Human skeptic: No.

Of course, I think actual LLM skeptics often don't answer "No" to the last question. They often do have something that they think is unlikely to occur with a relatively straightforward scaffold on top of an LLM (a model descended from the current LLM paradigm, perhaps trained with semi-supervised learning and RLHF).

I actually don't know what in particular Chollet thinks is unlikely here. E.g., I don't know if he has strong views about the performance of my method, but using the SOTA multimodal model in 2 years.

JWS 🔸

Final final edit: Congrats on the ARC-AGI-PUB results, really impressive :)

This will be my final response on this thread, because life is very time consuming and I'm rapidly reaching the point where I need to dive back into the technical literature and stress-test my beliefs and intuitions again. I hope Ryan and any readers have found this exchange useful/enlightening for seeing two different perspectives hopefully have productive disagreement?

If you found my presentation of the scaling-skeptical position highly unconvincing, I'd recommend following the work and thoughts of Tan Zhi Xuan (find her on X here). One of biggest updates was finding her work after she pushed back on Jacob Steinhardt here, and recently she gave a talk about her approach to Alignment. I urge readers to consider spending much more of their time listening to her than to me about AI.

I feel like this is a pretty strange way to draw the line about what counts as an "LLM solution".

I don't think so? Again, I wouldn't call CICERO an "LLM solution". Surely there'll be some amount of scaffolding which tips over into the scaffolding being the main thing and the LLM just being a component part? It's probably all blurry lines for sure, but I think it's important to separate 'LLM only systems' from 'systems that include LLMs', because it's very easy to conceptual scale up the former but harder to do the latter.

Human skeptic: That wasn't humans sending someone to the moon that was Humans + Culture + Organizations + Science sending someone to the moon! You see, humans don't exhibit real intelligence!

I mean, you use this as a reductio, but that's basically the theory of Distributed Cognition, and also linked to the ideas of 'collective intelligence', though that's definitely not an area I'm an expert in by any means. Also reminds me a lot Chalmers and Clarks' thesis of the Extended Mind.^[1]

Of course, I think actual LLM skeptics often don't answer "No" to the last question. They often do have something that they think is unlikely to occur with a relatively straightforward scaffold on top of an LLM (a model descended from the current LLM paradigm, perhaps trained with semi-supervised learning and RLHF).

So I can't speak for Chollet and other LLM skeptics, and I think again LLMs+extra (or extras+LLMs) are a different beast from LLMs on their own and possibly an important crux. Here are some things I don't think will happen in the near-ish future (on the current paradigm):

I believe an adversarial Imitation Game, where the interrogator is aware of both the AI system's LLM-based nature and its failure modes, is unlikely to be consistently beaten in the near future.^[2]
Primarily-LLM models, in my view, are highly unlikely to exhibit autopoietic behaviour or develop agentic designs independently (i.e. without prompting/direction by a human controller).
I don't anticipate these models exponential increase the rate of scientific research or AI development.^[3] They'll more likely serve as tools used by scientists and researchers themselves to frame problems, but new and novel problems will still remain difficult and be bottlenecked by the real world + Hofstadter's law.
I don't anticipate Primarily-LLM models to become good at controlling and manoeuvring robotic bodies in the 3D world. This is especially true in a novel-test-case scenario (if someone could make a physical equivalent of ARC to test this, that'd be great)
This would be even less likely if the scaffolding remained minimal. For instance, if there's no initial sorting code explicitly stating [IF challenge == turing_test GO TO turing_test_game_module].
Finally, as an anti-RSI operationalisation, the idea of LLM-based models assisting in designing and constructing a Dyson Sphere within 15 years seems... particularly far-fetched for me.

I'm not sure if this reply was my best, it felt a little all-over-the-place, but we are touching on some deep or complex topics! So I'll respectfully bow out now, and thank again for the disucssion and giving me so much to think about. I really appreciate it Ryan :)

^{^}
Then you get into ideas like embodiment/enactivism etc
^{^}
I can think of a bunch of strategies to win here, but I'm not gonna say so it doesn't end up in GPT-5 or 6's training data!
^{^}
Of course, with a new breakthrough, all bets could be off, but it's also definitionally impossible to predict those, and unrobust to draw straight lines and graphs to predict the future if you think breakthroughs will be need. (Not saying you do this, but some other AIXR people definitely seem to be)

Egg Syntax

I have thoughts, but a question first: you link a Kambhampati tweet where he says,

...as the context window changes (with additional prompt words), the LLM, by design, switches the CPT used to generate next token--given that all these CPTs have been pre-computed?

What does 'CPT' stand for here? It's not a common ML or computer science acronym that I've been able to find.

DanielFilan

Since nobody else has responded, my best guess would be "conditional probability table".

Egg Syntax

I think Ryan's solution shows that the intelligence is coming for him, and not from Chat-GPT4o.

If this is true, then substituting in a less capable model should have equally good results; would you predict that to be the case? I claim that plugging in an older/smaller model would produce much worse results, and if that's the case then we should consider a substantial part of the performance to be coming from the model.

This is what Chollet is talking about in the podcast when he says...'I’m pretty skeptical that we’re going to see an LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved.'

This seems to me to be Chollet trying to have it both ways. Either a) ARC is an important measure of 'true' intelligence (or at least of the ability to reason over novel problems), and so we should consider LLMs' poor performance on it a sign that they're not general intelligence, or b) ARC isn't a very good measure of true intelligence, in which case LLMs' performance on it isn't very important. Those can't be simultaneously true. I think that nearly everywhere but in the quote, Chollet has claimed (and continues to claim) that a) is true.

Egg Syntax

I'm quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why it's called learning at all. The model certainly isn't learning anything.

I would frame it as: the model is learning but then forgetting what it's learned (due to its inability to move anything from working/short-term memory to long-term memory). That's something that we see in learning in humans as well (one example: I've learned an enormous number of six-digit confirmation codes, each of which I remember just long enough to enter it into the website that's asking for it), although of course not so consistently.

Marcel2

Can anyone point me to a good analysis of the ARC test's legitimacy/value? I was a bit surprised when I listened to the podcast, as they made it seem like a high-quality, general-purpose test, but then I was very disappointed to see it's just a glorified visual pattern abstraction test. Maybe I missed some discussion of it in the podcasts I listened to, but it just doesn't seem like people pushed back hard enough on the legitimacy of comparing "language model that is trying to identify abstract geometric patterns through a JSON file" vs. "humans that are just visually observing/predicting the patterns."

Like, is it wrong to demand that humans should have to do this test purely by interpreting the JSON (with no visual aide)?

mlsbt

Language models have no problem interpreting the image correctly. You can ask them for a description of the input grid and they’ll get it right, they just don’t get the pattern.

Marcel2

I wouldn't be surprised if that's correct (though I haven't seen the tests), but that wasn't my complaint. A moderately smart/trained human can also probably convert from JSON to a description of the grid, but there's a substantial difference in experience from seeing even a list of grid square-color labels vs. actually visualizing it and identifying the patterns. I would strike a guess that humans who are only given a list of square color labels (not just the raw JSON) would perform significantly worse if they are not allowed to then draw out the grids.

And I would guess that even if some people do it well, they are doing it well because they convert from text to visualization.

mlsbt

I might be misunderstanding you here. You can easily get ChatGPT to convert the image to a grid representation/visualization, e.g. in Python, not just a list of square-color labels. It can formally draw out the grid any way you want and work with that, but still doesn’t make progress.

Also, to answer your initial question about ARC’s usefulness, the idea is just that these are simple problems where relevant solution strategies don’t exist on the internet. A non-visual ARC analog might be, as Chollet mentioned, Caesar ciphers with non-standard offsets.

Marcel2

Just because an LLM can convert something to a grid representation/visualization does not mean it can itself actually "visualize" the thing. A pure-text model will lack the ability to observe anything visually. Just because a blind human can write out some mathematical function that they can input into a graphing calculator, that does not mean that the human necessarily can visualize what the function's shape will take, even if the resulting graph is shown to everyone else.

mlsbt

I used GPT-4o which is multimodal (and in fact was even trained on these images in particular as I took the examples from the ARC website, not the Github). I did test more grid inputs and it wasn't perfect at 'visualizing' them.

Marcel2

I almost clarified that I know some models technically are multi-modal, but my impression is that the visual reasoning abilities of the current models are very limited, so I’m not at all surprised they’re limited. Among other illustrations of this impression, occasionally I’ve found they struggle to properly describe what is happening in an image beyond a relatively general level.

mlsbt

Looking forward to seeing the ARC performance of future multimodal models. I'm also going to try to think of a text-based ARC analog, that is perhaps more general. There are only so many unique simple 2D-grid transformation rules so it can be brute forced to some extent.

Aaron_Scher

The paper that introduces the test is probably what you're looking for. Based on a skim, it seems to me that it spends a lot of words laying out the conceptual background that would make this test valuable. Obviously it's heavily selected for making the overall argument that the test is good.

Egg Syntax

"humans that are just visually observing/predicting the patterns."

I don't think that's actually any simpler than doing it as JSON; it's just that our brains are tuned for (and we're more accustomed to) doing it visually. Depending on the specifics of the JSON format, there may be a bit of advantage to being able to have adjacency be natively two-dimensional, but I wouldn't expect that to make a huge difference.

Marcel2

Again, I'd be interested to actually see humans attempt the test by viewing the raw JSON, without being allowed to see/generate any kind of visualization of the JSON. I suspect that most people will solve it by visualizing and manipulating it in their head, as one typically does with these kinds of problems. Perhaps you (a person with syntax in their username) would find this challenge quite easy! Personally, I don't think I could reliably do it without substantial practice, especially if I'm prohibited from visualizing it.

Tsunayoshi

Great post, we need more summaries of disagreeing view points!

Having said that, here are a few replies:

I think this makes me very concern of a strong ideological and philosophical bubble in the Bay regarding these core questions of AI

I am only slightly acquainted with Bay area AI safety discourse, but my impression is indeed that people lack familiarity with some of the empirically true and surprising points made by skeptics e.g. Yann LeCun(LLMs DO lack common sense and robustness), and that is bad. Nevertheless, I do not think you are outright banished if you express such a viewpoint. IIRC Yudkowsky himself asserted in the past that LLMs are not sufficient for AGI (he made a point about being surprised at GPT-4 abilities on the Lex Fridman podcast). I would not put too much stock into LW upvotes as a measure of AIS researchers POV, as most LW users are engaging with AIS as a hobby and consequently do not have a very sophisticated understanding of the current pitfalls in LLMs.

On priors, it seems odd to place very high credence in results on exactly one benchmark. The fate of most "fundamentally difficult for LLMs, this time we mean it" benchmarks has usually been that next gen LLMs perform substantially better at them, which is also a point "Situational Awareness" makes. (e.g. Winograd schemas, GPQA). Focusing on the ARC challenge now and declaring it the actual true test of intelligence is a little bit survivorship bias.

Scale Maximalists, both within the EA community and without, would stand to lose a lot of Bayes points/social status/right to be deferred to

Acknowledging that status games are bad in general, I do think that it is valid to point out that historically speaking the "Scale is almost all you need" worldview has so far been much more predictive of the performances that we do see with large models. The fact that this has been taken seriously by the AIS-community/Scott/Open Phil (I think) well before GPT-3 came out, whereas mainstream academic research thought of them as fun toys of little practical significance is a substantial win.

Even under uncertainty about whether the scaling hypothesis turns out to be essentially correct, it makes a lot of sense to focus on the possibility that it is indeed correct and plan/work accordingly. If it is not correct, we only have the opportunity cost of what else we could have done with our time and money. If it is correct, well.. you know the scenarios.

Ryan Greenblatt

Tom Davidson's model is often referred to in the Community, but it is entirely reliant on the current paradigm + scale reaching AGI.

This seems wrong.

It does use constants from the historical deep learning field to provide guesses for parameters and it assumes that compute is an important driver of AI progress.

These are much weaker assumptions than you seem to be implying.

Note also that this work is based on earlier work like bio anchors which was done just as the current paradigm and scaling were being established. (It was published in the same year as Kaplan et al.)

Steven Byrnes

I don’t recall the details of Tom Davidson’s model, but I’m pretty familiar with Ajeya’s bio-anchors report, and I definitely think that if you make an assumption “algorithmic breakthroughs are needed to get TAI”, then there really isn’t much left of the bio-anchors report at all. (…although there are still some interesting ideas and calculations that can be salvaged from the rubble.)

I went through how the bio-anchors report looks if you hold a strong algorithmic-breakthrough-centric perspective in my 2021 post Brain-inspired AGI and the "lifetime anchor".

See also here (search for “breakthrough”) where Ajeya is very clear in an interview that she views algorithmic breakthroughs as unnecessary for TAI, and that she deliberately did not include the possibility of algorithmic breakthroughs in her bio-anchors model (…and therefore she views the possibility of breakthroughs as a pro tanto reason to think that her report’s timelines are too long).

OK, well, I actually agree with Ajeya that algorithmic breakthroughs are not strictly required for TAI, in the narrow sense that her Evolution Anchor (i.e., recapitulating the process of animal evolution in a computer simulation) really would work given infinite compute and infinite runtime and no additional algorithmic insights. (In other words, if you do a giant outer-loop search over the space of all possible algorithms, then you’ll find TAI eventually.) But I think that’s really leaning hard on the assumption of truly astronomical quantities of compute [or equivalent via incremental improvements in algorithmic efficiency] being available in like 2100 or whatever, as nostalgebraist points out. I think that assumption is dubious, or at least it’s moot—I think we’ll get the algorithmic breakthroughs far earlier than anyone would or could do that kind of insane brute force approach.

Ryan Greenblatt

I agree that these models assume something like "large discontinuous algorithmic breakthroughs aren't needed to reach AGI".

(But incremental advances which are ultimately quite large in aggregate and which broadly follow long running trends are consistent.)

However, I interpreted "current paradigm + scale" in the original post as "the current paradigm of scaling up LLMs and semi-supervised pretraining". (E.g., not accounting for totally new RL schemes or wildly different architectures trained with different learning algorithms which I think are accounted for in this model.)

JWS 🔸

From the summary page on Open Phil:

In this framework, AGI is developed by improving and scaling up approaches within the current ML paradigm, not by discovering new algorithmic paradigms.

From this presentation about it to GovAI (from April 2023) at 05:10:

So the kinda zoomed out idea behind the Compute-centric framwork is that I'm assuming something like the current paradigm is going to lead to human-level AI and further, and I'm assuming that we get there by scaling up and improving the current algorithmic approaches. So it's going to look like better versions of transformers that are more efficient and that allow for larger context windows..."

Both of these seem to be pretty scaling-maximalist to me, so I don't think the quote seems wrong, at least to me? It'd be pretty hard to make a model which includes the possibility of the paradigm not getting us to AGI and then needing a period of exploration across the field to find the other breakthroughs needed.

SummaryBot

Executive summary: François Chollet and Dwarkesh Patel discuss key cruxes in the debate over whether scaling current AI approaches will lead to AGI, with Chollet arguing that more is needed beyond scaling and Patel pushing back on some of Chollet's claims.

Key points:

Chollet introduces the ARC Challenge as a test of general intelligence that current large language models (LLMs) struggle with, despite the tasks being simple for humans.
Chollet distinguishes between narrow "skill" and general "intelligence", arguing that LLMs are doing sophisticated memorization and interpolation rather than reasoning and generalization.
Patel counters that with enough scale, interpolation could lead to general intelligence, and that the missing pieces beyond scaling may be relatively easy.
Chollet thinks the hard parts of intelligence, like active inference and discrete program synthesis, are not addressed by the current scaling paradigm.
The author believes Chollet makes a compelling case, and that if he is right it should significantly update people's views on AI risk and the value of current AI safety work.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Egg Syntax

Typo watch:

Dwarkesh is annoyed because he thinks that François is conceptually defining LLM-like models as incapable of memorisation

I assume you mean 'incapable of generalization' here?

^{^}

At the end of the podcast Dwarkesh explicitly says he was playing devil's advocate, but I think he is arguing for a pro-scaling point-of-view. His post Will scaling work? provides a more clear look at his perspective, and I highly recommend reading it.

^{^}

There's a second section with Mike Knoop (now Mike) which is more focused on the ARC Prize relaunch, which I have fewer notes on but still included

^{^}

At least, come to your own conclusion on how you stand regarding them. Not saying everyone has to become an expert in mechanistic interpretability.

^{^}

If there's anyone reading who's in a position to verify this, can we? Even the fact of ~poor performance from the leading labs would support Francois' prior and not Dwarkesh's.

^{^}

The most relevant research paper I could find is this one, where children were able outperform the average LLM on a simplified ARC test from around the age of 6 onwards. Still, these were kids visiting the 'NEMO science museum in
Amsterdam' so again it's not really a sample of median humans.

^{^}

Perhaps not-coincidentally, a noted critic of AI x-risk

^{^}

He essentially means get expert performance, solve close enough to ~100% so that there is not much signal in a model's ARC score.

^{^}

Or has even... dare we say... hit a wall?

^{^}

I'm a bit confused on this point, and also about what 'natively multi-modal' means, or at least why Dwarkesh is expecting it to be such a game changer? Aren't GPT4o and Gemini already multimodal models that perform badly at ARC?

^{^}

Chollet seems to be referring to cases like Syphex Wasps, though how accurate that anecdote actually is is up for debate. But to me, even simple organisms showing adaptive behaviour beyond the capacity of LLMs is even more reason to be sceptical about projections of imminent AGI.

^{^}

See section 4.2.2.1

^{^}

He is gesturing at the notion of shortcut learning

^{^}

He has literally written a textbook about it

^{^}

The implication here is that as their scale increases, they'll be able to achieve human level extrapolation via interpolation.

^{^}

In the linked paper, it's defined as an phenomenon that's observed "where models abruptly transition to a generalizing solution after a large number of training steps, despite initially overfitting"

^{^}

Even I, as an LLM sceptic, was sceptical of this claim by François, but it's actually true!

^{^}

There was an interesting exchange between François and Subbarao Kambhampati on whether this also holds for civilisation, which you can read here

^{^}

Trenton Bricken, Member of Technical Staff on the Mechanistic Interpretability team at Anthropic

^{^}

To be very fair to him, Trenton does introduce this as a 'hot take'

^{^}

In the chapter called "How Might AI Progress in the Future?", Russell says "I believe this capability is the most important step needed to reach
human-level AI." though he also says this could come at any point given a breakthrough. Perhaps, but I still think getting to that breakthrough will be much more difficult than scaling transformers to ever-larger-sizes, and especially if scaling maximalism becomes ideologically dominant to the exclusion of alternative paradigms.

^{^}

Given François' sceptical position, I wouldn't put too much stock in taking his timeline adjustments too concretely.

^{^}

When the Chinese Room comes up, for instance, it's instantly dismissed with the systems reply despite Searle addressing that in this original paper.

^{^}

I actually can't recommend Xuan's work and perspective highly enough. My route to LLM scepticism really picked up momentum with this thread, I think.

^{^}

I'm calling out particularly examples here because I think it's good to do so rather than to vaguepost, but please see my final bullet point in this section. I think my issue might be with OpenPhil's epistemic perspective on AI culturally, rather than any of the individuals working there.

^{^}

Bongard Problems were developed in the 1960s, and are very similar to ARC puzzles. There are a few shots indicating some kind of rule, and you'll solve the test once you can identify the rule.

On the Dwarkesh/Chollet Podcast, and the cruxes of scaling to AGI

On the Dwarkesh/Chollet Podcast, and the cruxes of scaling to AGI

Overview

The Podcast

Introducing the ARC Challenge

Should we expect LLMs to "saturate" it?

Has Jack Cole shown LLMs can solve ARC?

Is there a difference between 'Skill' and 'Intelligence'?

Are LLMs 'Just' Memorising?

Are the missing pieces to AGI difficult or hard to solve?

What would it take for François to change his mind?

Some Odds and Ends

Takeaways