Note: This post was crossposted from Planned Obsolescence by the Forum team, with the author's permission. The author may not see or respond to comments on this post.

This startlingly fast progress in LLMs was driven both by scaling up LLMs and doing schlep to make usable systems out of them. We think scale and schlep will both improve rapidly.

Kelsey Piper co-drafted this post. Thanks also to Isabel Juniewicz for research help.

In January 2022, language models were still a pretty niche scientific interest. Once ChatGPT was released in November 2022, it accumulated a record-breaking 100 million users by February 2023. Many of those users were utterly flabbergasted by how far AI had come, and how fast. And every way we slice it, most experts were very surprised as well.

This startlingly fast progress was largely driven by scale and partly driven by schlep.

Scale involves training larger language models on larger datasets using more computation, and doing all of this more efficiently[1] over time. “Training compute,” measured in “floating point operations” or FLOP, is the most important unit of scale. We can increase training compute by simply spending more money to buy more chips, or by making the chips more efficient (packing in more FLOP per dollar). Over time, researchers also invent tweaks to model architectures and optimization algorithms and training processes to make training more compute-efficient — so each FLOP spent on training goes further in 2023 compared to 2020.

Scale has been by far the biggest factor in the improvements in language models to date. GPT-4 is bigger than GPT-3.5 which is bigger than GPT-3 which is bigger than GPT-2 which is bigger than GPT.[2]

alt_text

Increasing scale automatically improves performance on pretty much every test of skill or practically useful task. GPT-2 got an F- on college-level multiple choice tests ranging from abstract algebra to business ethics; GPT-4 got a B+. GPT-2 was just starting to string together plausible-sounding paragraphs; GPT-4 can write essays that net a B+ at Harvard — and hundreds of lines of functioning code that can take human programmers hours to reproduce.

If you add more data, more parameters, and more compute, you’ll probably get something that is a lot better yet. GPT-4.5 will perform much better than GPT-4 on most tests designed to measure understanding of the world, practical reasoning in messy situations, and mathematical and scientific problem-solving. A whole lot of things GPT-4 struggles with will probably come easily to GPT-4.5. It will probably generate a whole lot more economic value, and present much bigger societal risks. And then the same thing will happen all over again with GPT-5. We think the dramatic performance improvements from scale will continue for at least another couple of orders of magnitude — as Geoffrey Hinton joked in 2020, “Extrapolating the spectacular performance of GPT-3 into the future suggests that the answer to life, the universe and everything is just 4.398 trillion parameters.”

But even if no one trained a larger-scale model than GPT-4, and its basic architecture and training process never got any more efficient, there would still probably be major economic change from language models over the next decade. This is because we can do a lot of schlep to better leverage the language models we already have and integrate them into our workflows.

By schlep, we mean things like prompting language models to give higher-quality answers or answers more appropriate to a certain use case, addressing annoying foibles like hallucination with in-built fact-checking and verification steps, collecting specialized datasets tailored to specific tasks and fine-tuning language models on these datasets, providing language models with tools and plug-ins such as web search and code interpreters, and doing a lot of good old fashioned software engineering to package all this into sleek usable products like ChatGPT.

For one example of what schlep can do to improve language models, take chain of thought prompting. Chain of thought prompting is dead simple — it’s basically the same thing as your teacher reminding you to show your work — and it substantially improves the performance of language models on all kinds of problems, from mathematical problem-solving to ‘common sense’ reasoning.

alt_text

By default, language models can only answer questions by immediately spitting out the first word of their answer, then rolling with whatever they said and immediately spitting out the next word, and so on word-by-word until eventually they’ve completed their thought. They are unable to backtrack or get any thinking done outside of the next-word rhythm.

Imagine if you had to take a standardized test like that: by answering every question on the spot without any backtracking, as if you were being interviewed on live TV. For one thing, it would be very hard! For another, you’d probably do a lot better if you verbalized your reasoning step by step than if you just tried to blurt out the final answer. This is exactly what we see in chain-of-thought.

This suggests chain-of-thought is probably not just a one-off trick, but an instance of a more general pattern: language models will probably perform better if they can spend more effort iterating on and refining their answers to difficult questions, just like humans get to do. In fact, simply allowing language models to spit out a “pause and think longer” symbol rather than having to commit to the next word immediately also seems to improve performance.

Scale makes language models better. Techniques like chain-of-thought improve models at any given scale. But it’s more than that: chain of thought prompting only works at all on sufficiently large language models, and the returns are greater when it’s used on bigger models which were already more powerful to begin with.[3]

You could imagine a variety of more elaborate techniques that follow the same principle. For example, you could imagine equipping an LLM with a full-featured text editor, allowing it to backtrack and revise its work in the way humans do.[4] Or you could imagine giving a language model two output streams: one for the answer, and one for its stream-of-consciousness thoughts about the answer it’s writing. Imagine a language model that had the option to iterate for days on one question, writing notes to itself as it mulls things over. The bigger and more capable it is, the more use it could get out of this affordance. Giving a genius mathematician a scratchpad will make a bigger difference than giving one to a six year old.

Another technique that dramatically improves performance of models at a given size is fine-tuning: retraining a large model on a small amount of well-chosen data in order to get much better-targeted responses for a given use case. For example, reinforcement learning from human feedback (RLHF) involves fine-tuning models on whether human raters liked its answer. Without RLHF, a language model might riff off a question or generate the kind of text that would surround that question on a website, instead of answering it. In effect, RLHF gets the model to ‘realize’ that it should be aiming to use its capabilities and knowledge to answer questions well rather than predict text on the internet: two different tasks, with a lot of overlapping skills.

With prompting and fine-tuning and a lot of other schlep, we can build systems out of language models. ChatGPT is a very simple and very familiar example of a language model system. To understand all of the ways that ChatGPT is a system, not just a language model, it’s useful to compare it to InstructGPT, which is the same basic underlying technology. Before ChatGPT was released, InstructGPT was available to people who made an account to test it out in OpenAI’s playground.

Here’s the UI for OpenAI’s playground today (it was worse a year ago but unfortunately we don’t have a screenshot of it from then):

alt_text

If you’re an experienced LLM expert, the option to customize temperature and adjust the frequency penalty and presence penalty, add stop sequences, and so on is really useful. If you’re a random consumer, all of that is intimidating. ChatGPT’s UI abstracts it away:

alt_text

That UI difference might seem small. But InstructGPT was mostly used only by a small community of researchers, and ChatGPT reached 100 million users within two months of launch. The difference between the two products was essentially presentation, user experience, and marketing. That kind of thing can result in massive differences in actual user behavior. Some of the work that goes into making language model systems, then, is about figuring out how to make the model usable for users.

If language models are like engines, then language model systems would be like cars and motorcycles and jet planes. Systems like Khan Academy’s one-on-one math tutor or Stripe’s interactive developer docs would not be possible to build without good language models, just as cars wouldn’t be possible without engines. But making these products a reality also involves doing a lot of schlep to pull together the “raw” language model with other key ingredients, getting them all to work well together, and putting them in a usable package. Similarly, self-driving cars would not be possible without really good vision models, but a self-driving car is more than just a big vision neural network sitting in a server somewhere.

One kind of language model system that has attracted a lot of attention and discussion is a language model agent.

An agent is a system which independently makes decisions and acts in the world. A language model is not an agent, but language models can be the key component powering a system which is agentic and takes actions in the world. The most famous early implementation of this is Auto-GPT, a very straightforward and naive approach: you can tell it a goal, and it will self-prompt repeatedly to take actions towards this goal. People have already employed it towards a wide range of goals, including building ChaosGPT, which has the goal of destroying humanity.

Auto-GPT is not very good. Users have complained that it constantly comes up with reasons to do more research and is reluctant to decide that it’s done enough research and can actually do the task now. It’s also just not very reliable. But there are many people building agentic language models for commercial uses, and working to solve all of these shortcomings, including well-funded and significantly-sized companies like Imbue and Adept. Adding chain of thought prompting, fine tuning the underlying language models, and many similar measures will likely make agents a lot better – and, of course, increasing scale will make them better too.

We’re really at the very beginning of this work. It wouldn’t be surprising to see major advances in the practical usefulness of LLMs achieved through schlep alone, such that agents and other systems built out of GPT-4 tier models are much more useful in five years than they are today. And of course, we are continuing to scale up models at the same time. That creates the conditions for rapid improvements along many dimensions at once — improvements which could reinforce each other. Many people will be trying hard to make this a reality. Even if specific approaches meet dead-ends, the field as a whole doesn’t seem likely to.

  1. ^

    Here, we’re wrapping together algorithmic progress with scale for simplicity; this could have been broken out into its own type of progress. 

  2. ^

    The graph pictured in the text is by Our World in Data, pulling numbers from our grantee organization Epoch, the original GPT was trained on 1.8e19 FLOP; GPT-2 was 1.5e21 FLOP; the largest version of GPT-3 (175 billion parameters) was trained on 3.1e23 FLOP, and GPT-4 was trained on ~2.1e25 FLOP. There isn’t clear documentation of GPT-3.5, the model in between GPT-3 and GPT-4, but we suspect that it was retrained from scratch and its effective training compute was order 10^24 FLOP. 

  3. ^

    See e.g. Wei et al 2023 Figures 4, 7, and 8.

  1. ^

    Perhaps this would be based on terminal-based text editors that programmers use, which do everything via keyboard commands.

Show all footnotes
Comments3


Sorted by Click to highlight new comments since:

Just out of curiosity: Where is the word "schlep" originating from in the context of AI? Don't think I ever came across it before reading this post. 

I assume and (ChatGPT agrees) that it's the tedious, unglamorous, and labor-intensive work. It probably comes from the Germany "schleppen" which is "heavy lifting".

Executive summary: This post argues that recent progress in language models like ChatGPT has been driven by scaling them up as well as integrating them into usable systems, and both trends will continue.

Key points:

  1. Increasing the scale of language models by training them on more data and compute has been the primary driver of improvements so far.
  2. But integrating language models into well-designed systems through techniques like prompting and fine-tuning also unlocks major gains.
  3. Scale and system-building are complementary - gains in one enhance the other.
  4. There is still a lot of low-hanging fruit in system design to better leverage existing large language models.
  5. Commercial efforts are focused on building agentic systems that can take actions, which will benefit from scale and system improvements.
  6. We're still early in developing language model capabilities, with potential for rapid progress on both scale and systems fronts.

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
LintzA
 ·  · 15m read
 · 
Cross-posted to Lesswrong Introduction Several developments over the past few months should cause you to re-evaluate what you are doing. These include: 1. Updates toward short timelines 2. The Trump presidency 3. The o1 (inference-time compute scaling) paradigm 4. Deepseek 5. Stargate/AI datacenter spending 6. Increased internal deployment 7. Absence of AI x-risk/safety considerations in mainstream AI discourse Taken together, these are enough to render many existing AI governance strategies obsolete (and probably some technical safety strategies too). There's a good chance we're entering crunch time and that should absolutely affect your theory of change and what you plan to work on. In this piece I try to give a quick summary of these developments and think through the broader implications these have for AI safety. At the end of the piece I give some quick initial thoughts on how these developments affect what safety-concerned folks should be prioritizing. These are early days and I expect many of my takes will shift, look forward to discussing in the comments!  Implications of recent developments Updates toward short timelines There’s general agreement that timelines are likely to be far shorter than most expected. Both Sam Altman and Dario Amodei have recently said they expect AGI within the next 3 years. Anecdotally, nearly everyone I know or have heard of who was expecting longer timelines has updated significantly toward short timelines (<5 years). E.g. Ajeya’s median estimate is that 99% of fully-remote jobs will be automatable in roughly 6-8 years, 5+ years earlier than her 2023 estimate. On a quick look, prediction markets seem to have shifted to short timelines (e.g. Metaculus[1] & Manifold appear to have roughly 2030 median timelines to AGI, though haven’t moved dramatically in recent months). We’ve consistently seen performance on benchmarks far exceed what most predicted. Most recently, Epoch was surprised to see OpenAI’s o3 model achi
Dr Kassim
 ·  · 4m read
 · 
Hey everyone, I’ve been going through the EA Introductory Program, and I have to admit some of these ideas make sense, but others leave me with more questions than answers. I’m trying to wrap my head around certain core EA principles, and the more I think about them, the more I wonder: Am I misunderstanding, or are there blind spots in EA’s approach? I’d really love to hear what others think. Maybe you can help me clarify some of my doubts. Or maybe you share the same reservations? Let’s talk. Cause Prioritization. Does It Ignore Political and Social Reality? EA focuses on doing the most good per dollar, which makes sense in theory. But does it hold up when you apply it to real world contexts especially in countries like Uganda? Take malaria prevention. It’s a top EA cause because it’s highly cost effective $5,000 can save a life through bed nets (GiveWell, 2023). But what happens when government corruption or instability disrupts these programs? The Global Fund scandal in Uganda saw $1.6 million in malaria aid mismanaged (Global Fund Audit Report, 2016). If money isn’t reaching the people it’s meant to help, is it really the best use of resources? And what about leadership changes? Policies shift unpredictably here. A national animal welfare initiative I supported lost momentum when political priorities changed. How does EA factor in these uncertainties when prioritizing causes? It feels like EA assumes a stable world where money always achieves the intended impact. But what if that’s not the world we live in? Long termism. A Luxury When the Present Is in Crisis? I get why long termists argue that future people matter. But should we really prioritize them over people suffering today? Long termism tells us that existential risks like AI could wipe out trillions of future lives. But in Uganda, we’re losing lives now—1,500+ die from rabies annually (WHO, 2021), and 41% of children suffer from stunting due to malnutrition (UNICEF, 2022). These are preventable d
Rory Fenton
 ·  · 6m read
 · 
Cross-posted from my blog. Contrary to my carefully crafted brand as a weak nerd, I go to a local CrossFit gym a few times a week. Every year, the gym raises funds for a scholarship for teens from lower-income families to attend their summer camp program. I don’t know how many Crossfit-interested low-income teens there are in my small town, but I’ll guess there are perhaps 2 of them who would benefit from the scholarship. After all, CrossFit is pretty niche, and the town is small. Helping youngsters get swole in the Pacific Northwest is not exactly as cost-effective as preventing malaria in Malawi. But I notice I feel drawn to supporting the scholarship anyway. Every time it pops in my head I think, “My money could fully solve this problem”. The camp only costs a few hundred dollars per kid and if there are just 2 kids who need support, I could give $500 and there would no longer be teenagers in my town who want to go to a CrossFit summer camp but can’t. Thanks to me, the hero, this problem would be entirely solved. 100%. That is not how most nonprofit work feels to me. You are only ever making small dents in important problems I want to work on big problems. Global poverty. Malaria. Everyone not suddenly dying. But if I’m honest, what I really want is to solve those problems. Me, personally, solve them. This is a continued source of frustration and sadness because I absolutely cannot solve those problems. Consider what else my $500 CrossFit scholarship might do: * I want to save lives, and USAID suddenly stops giving $7 billion a year to PEPFAR. So I give $500 to the Rapid Response Fund. My donation solves 0.000001% of the problem and I feel like I have failed. * I want to solve climate change, and getting to net zero will require stopping or removing emissions of 1,500 billion tons of carbon dioxide. I give $500 to a policy nonprofit that reduces emissions, in expectation, by 50 tons. My donation solves 0.000000003% of the problem and I feel like I have f