Also see paper and results compilation video!
Today, we published "Open-Ended Learning Leads to Generally Capable Agents," a preprint detailing our first steps to train an agent capable of playing many different games without needing human interaction data. ... The result is an agent with the ability to succeed at a wide spectrum of tasks — from simple object-finding problems to complex games like hide and seek and capture the flag, which were not encountered during training. We find the agent exhibits general, heuristic behaviours such as experimentation, behaviours that are widely applicable to many tasks rather than specialised to an individual task.
The neural network architecture we use provides an attention mechanism over the agent’s internal recurrent state — helping guide the agent’s attention with estimates of subgoals unique to the game the agent is playing. We’ve found this goal-attentive agent (GOAT) learns more generally capable policies.
Playing roughly 700,000 unique games in 4,000 unique worlds within XLand, each agent in the final generation experienced 200 billion training steps as a result of 3.4 million unique tasks. At this time, our agents have been able to participate in every procedurally generated evaluation task except for a handful that were impossible even for a human. And the results we’re seeing clearly exhibit general, zero-shot behaviour across the task space — with the frontier of normalised score percentiles continually improving.
Looking qualitatively at our agents, we often see general, heuristic behaviours emerge — rather than highly optimised, specific behaviours for individual tasks. Instead of agents knowing exactly the “best thing” to do in a new situation, we see evidence of agents experimenting and changing the state of the world until they’ve achieved a rewarding state. We also see agents rely on the use of other tools, including objects to occlude visibility, to create ramps, and to retrieve other objects. Because the environment is multiplayer, we can examine the progression of agent behaviours while training on held-out social dilemmas, such as in a game of “chicken”. As training progresses, our agents appear to exhibit more cooperative behaviour when playing with a copy of themselves. Given the nature of the environment, it is difficult to pinpoint intentionality — the behaviours we see often appear to be accidental, but still we see them occur consistently.
My hot take: This seems like a somewhat big deal to me. It's what I would have predicted, but that's scary, given my timelines. I haven't read the paper itself yet but I look forward to seeing more numbers and scaling trends and attempting to extrapolate... When I do I'll leave a comment with my thoughts.
EDIT: My warm take: The details in the paper back up the claims it makes in the title and abstract. This is the GPT-1 of agent/goal-directed AGI; it is the proof of concept. Two more papers down the line (and a few OOMs more compute), and we'll have the agent/goal-directed AGI equivalent of GPT-3. Scary stuff.
It seems like this could extend naturally to cooperative inverse reinforcement learning. Basically, the real world is a new game the AI has to play, and humans decide the reward subjectively (rather than with some explicit rule). The AI has developed some general competence beforehand by playing games, but it has to learn the new rules in the real world, which are not explicit.
Might be confirmation bias. But is it.
I did say it was a hot take. :D If I think of more sophisticated things to say I'll say them.
Is there already a handy way to compare computation costs that went into training? E.g. compared to GPT3, AlphaZero, etc.?
I would love to know! If anyone finds out how many PF-DAYs or operations or whatever were used to train this stuff, I'd love to hear it. (Alternatively: How much money was spent on the compute, or the hardware.)
You probably want the link at the top of this post to go directly to the Deepmind paper page, instead of the LessWrong redirect-URL for the link. I.e. the current link is:
When it probably should be:
Oops, sorry thanks!
For what it's worth, I've mostly not been interested in AI safety/alignment (and am still mostly not), but this also seems like a pretty big deal to me. I haven't actually read the details, but this is basically not "narrow" AI anymore, right?
I guess the expressions "narrow" and "general" are a bit unfortunate, since I don't really want to call this either. I would want to reserve the term AGI for AI that can do at least this, but can also reason generally and abstractly, and excels at one-shot learning (although there are specific networks designed for one-shot learning, like Siamese networks. Actually, why aren't similar networks used more often,even as subnetworks?).
My take is that indeed, we now have AGI -- but it's really shitty AGI, not even close to human-level. (GPT-3 was another example of this; pretty general, but not human-level.) It seems that we now have the know-how to train a system that combines all the abilities and knowledge of GPT-3 with all the abilities and knowledge of these game-playing agents. Such a system would qualify as AGI, but not human-level AGI. The question is how long it'll take, and how much money (to make it bigger, train for longer) to get to human-level or something dangerously powerful at least.
AGI confirmed? 😬