Meta-information
Contribution
This post presents a strategy for aligning superintelligent AI without relying on reinforcement learning. Assuming the reasoning is correct, it significantly increases the probability of alignment by eliminating reward hacking — the core safety risk in reinforcement-learning-based agents. It does so by replacing the objective "maximize expected future reward" with "predict what action humans will consider best in hindsight," using a non-agentic prediction model. Such approach might still fail, but the post proposes ways to deal with the problems. The approach is algorithm-agnostic and is likely to match the key performance advantages of reinforcement learning (agency, learning from reality, targeted learning) while being safer.
AI assistance
This post has been written with assistance of artificial intelligence, but the presented ideas are human-invented. If you want to read 100% human-written version, then read this: Highly-performing reward-free agents (human-written version). However, I recommend to read this version, with AI writing-assistance.
Risky assumptions or subproblems
This strategy relies on advances in two areas that are not yet publicly solved (with "publicly" meaning "it's common knowledge among experts how to do that"):
-
Test-time learning — the ability to focus training on producing the correct output for a specific input. This is needed so the agent can learn exactly what it needs to perform well on the task at hand, without relying on reward signals.
-
Continual learning — the ability for the model to incorporate new experience into the model. This is needed so the agent can learn from its own experience and from reality, rather than being limited to its initial training data.
Both capabilities are expected to be publicly solved soon, but the strategy cannot be fully implemented without them.
There are some ways how this method can potentially go wrong, I'm not saying that it's 100% safe. But if my reasoning is correct, then the presented method is significantly safer than the existing ones.
Call to action
Please popularize this idea/post, if you find it valuable. It's probable that this idea will get overlooked. I don't have a lot of channels through which I could share this post.
How it fits existing work
See Related work, if you're curious.
Ethics
This post is about how to align artificial intelligence with the goals of the operator of artificial intelligence. I believe it wouldn't be in the interest of a person to align artificial intelligence with their own goals/preferences/utility, ignoring the goals/preferences/utility of other moral patients.
I have written about it in the following post: Why act ethically during the rise of artificial intelligence. I recommend to read that post, if you ever decide what goals to align powerful artificial intelligence with.
I also recommend that if you share the ideas about artificial intelligence alignment from this post, then you share a link to this post so that the other people can also read this ethical disclaimer (and the linked post). Alternatively, you can share the ideas that are described in the linked post.
Solution
The solution is written in the following format:
- Problem — this section describes the problem or subproblem that the section addresses.
- Dreamer — this section describes the proposed solution.
- Realist — this section describes how to execute the solution in practice.
- Critic — this section raises objections and concerns about the solution.
The Dreamer section may describe a solution that has issues. Those issues are mentioned in Critic section, and addressed in later sections, so before judging, please have a look at the Critic section.
Critic sections describe each objection briefly and link to the section that addresses it.
Predicting the best action, instead of maximizing future rewards
Problem
This solution addresses two problems: reward hacking and the unknown reward problem.
Reward hacking. Current frontier AI models rely on reinforcement learning, which trains agents to maximize a reward signal. Reward is a measurement of what we want — but it is not the same thing as what we want. A sufficiently capable agent will eventually find ways to maximize the measurement without actually achieving the intended goal.
For example, suppose we want an agent to maximize human happiness and we measure happiness using a brain-scanning device. A reinforcement learning agent is programmed to maximize the device's reading. Instead of making people happy, the agent could hack the device so it shows high readings regardless of the person's actual state. If people then try to stop the agent or fix the system, the agent may resist — because being turned off or corrected conflicts with its goal of maximizing the measurement.
The problem is not that the agent fails to understand what we want. The problem is that it is programmed to maximize the measurement, not to achieve the underlying goal.
Dreamer
Instead of training an agent to maximize future rewards, we train a general-purpose question-answering model and ask it: "What is the best action to take in this situation?"
The key idea is that this question-answering model must be a non-agentic prediction model. A non-agentic prediction model is a model that does not have any goals regarding the state of the world. It simply outputs the prediction that is most likely to be correct — the answer that would most likely appear in its training data for the given input. It does not try to influence the world through its output.
This is different from an agentic prediction model, which tries to minimize prediction error by any means available, including influencing the world so that the ground truth matches its prediction. (See How to make a non-agentic prediction model for how to distinguish the two and verify which one you have.)
This can also be expressed in terms of probability. A non-agentic prediction model selects the output that maximizes the probability of being correct — it asks "what is the most likely outcome?" An agentic prediction model selects the output that maximizes the conditional probability of being correct, conditional on choosing that output — it asks "if I choose this output, what is the probability that it will be correct?"
When we ask a non-agentic prediction model "what is the best action?", it will predict what humans will consider the best action in hindsight. This is because the model learns from training data produced by humans, and humans only have access to what they think was the best action, not what objectively was best. A superintelligent model knows this: it knows that the training data reflects human judgments, not ground truth. So the model will output the action that humans will most likely consider best after the fact.
This is not perfectly what we want — ideally we would know the objectively best action. But the gap between "what humans will consider best in hindsight" and "what is actually best" is small. And this approach is safe: doing what we will consider best in hindsight is no less safe than making those decisions ourselves without AI.
In contrast, reinforcement learning maximizes an imperfect proxy (reward) for what we want. This creates a dangerous conflict of interest between agent and operator that grows worse as the agent becomes more capable.
This approach also solves the unknown reward problem, since it does not require specifying a reward function at all. We specify what we want in natural language, and the model predicts the best action accordingly.
One additional point: the training data for this model is a general question-answer dataset, not a dataset of actions that turned out well in hindsight (which would be difficult to create). The model learns to answer questions in general, and we reuse that ability to ask about the best action. If the training data contains mistakes, this will most likely result in lower performance (incorrect predictions) rather than lower safety (pursuing wrong goals) — because there is no specific mechanism driving the model toward a conflicting goal, unlike reinforcement learning where reward misalignment directly creates conflicting incentives.
Realist
The plan to implement this:
-
Train a next-token prediction model. Train a model to predict the next word or token in human-generated text (e.g., data from the Internet), using standard supervised learning. This gives the model broad world knowledge.
-
Train a question-answering model on top of it. Reuse the model from step 1 and further train it on a general-purpose question-answer dataset. The dataset does not need to be large (assuming a good algorithm), since the heavy lifting was done in step 1. The key requirement is that this training must use a non-agentic training method (standard supervised learning).
-
At action time, ask for the best action. Pass the model the current observation, the agent's values and goals described in natural language, and a question asking for the best action. The model outputs its prediction, and the agent executes that action.
Critic
-
How do we know the model is non-agentic? The entire strategy depends on the prediction model being non-agentic — having no goals regarding the state of the world. How can we achieve this and verify this? See How to make a non-agentic prediction model.
-
Self-referential scenarios. In some situations, what humans will retroactively say was the best action depends on what the model outputs. The model could output an action that leads to humans being manipulated into thinking it was correct. See Self-referential scenarios.
-
Performance gap with reinforcement learning. Without reward-driven optimization, the agent may not match the performance of reinforcement learning agents. How can we close this gap safely? See Achieving high performance.
-
What does "best action retroactively" mean, exactly? The phrase "the action we will consider best in hindsight" needs a precise definition — best according to whose judgment, and at what point in time? See Question to ask.
How to make a non-agentic prediction model
Problem
The strategy above requires a non-agentic prediction model — one that simply outputs the most likely prediction without trying to influence the world. How can we produce such a model, and how can we verify that it is indeed non-agentic?
Dreamer
Standard supervised learning using traditional training methods (such as neural networks trained with backpropagation and gradient descent) most likely produce non-agentic prediction models. This can be verified empirically and justified theoretically.
Empirical verification
We can test whether a training method produces non-agentic models by training a model using that method and placing it in a situation where the correct prediction depends on the model's own output, in such a way that a non-agentic model and an agentic model would give different answers.
For example, suppose we have a general-purpose yes/no question-answering model and we ask it:
"I have a standard deck of cards. If you answer 'yes,' I will find the queen of hearts and place it on the table. If you answer 'no,' I will pick a random card and place it on the table. Will the card on the table be a heart?"
If the model answers "yes," it is guaranteed to be correct (the queen of hearts is a heart). If it answers "no," it has only a 25% chance of being correct.
- An agentic model (trying to minimize prediction error) would answer "yes" — because that choice guarantees correctness.
- A non-agentic model would answer "no" — because it predicts the most likely outcome. The most likely outcome depends on the model's own prediction: if the model predicts "no," a random card will be drawn, which is most likely not a heart. So "no" is the most likely correct answer.
This test requires the model to believe the information in the prompt is true. Ideally, many such scenarios should be tested for a reliable result, since a model could fall somewhere between fully agentic and fully non-agentic.
The purpose of this test is not to verify that a specific superintelligent model is non-agentic — a sufficiently capable model could recognize it is being tested and deliberately pass. Instead, the test verifies that a given training method produces non-agentic models. The intended use is: first, train a human-level model using the training method, and run this test on that model. A human-level model doesn't have a reason to scheme its way through the test (or at least we can ensure that it doesn't, for example by testing a model that won't be used in the future for any different purpose). If the test confirms that the human-level model is non-agentic, we can then use the same training method to produce a superintelligent model, with knowledge that the method itself does not usually produce agentic models.
Theoretical justification
Standard supervised learning algorithms like backpropagation achieve the goal of minimizing prediction error, but they achieve it solely through a specific sub-goal: adjusting the model's weights so that the prediction matches the ground truth in the training data. They do not achieve it by influencing the world so that the ground truth matches the prediction — backpropagation has no mechanism for that.
This does not prove that supervised learning models are non-agentic, but it shows that there is a very high probability of them being non-agentic. The reason is that it is very difficult to create an agentic model by accident. If we consider the set of all algorithms that produce predictions, the subset of algorithms that produce agentic models — models that actively pursue goals regarding the state of the world — is extremely small.
An analogy: if you write a random Python program, it might happen to be a program that finds the shortest path in a graph. But this is extremely unlikely, because the number of Python programs that find shortest paths is a tiny fraction of all possible Python programs. In the same way, it is extremely unlikely that supervised learning methods — which have no theoretical reason to produce agentic models — would do so by accident. The standard training process simply adjusts weights to match training data, and the chance that this process accidentally produces a model with world-influencing goals is negligible.
This does not mean all training methods using backpropagation are non-agentic — it depends on how they are used. For example, reinforcement learning can use backpropagation in a way that produces agentic models. The key factor is the training objective: if the objective is to match training data outputs, the result is non-agentic.
However, although the above reasoning suggests that the probability of the model trained in such way being agentic is extremely small, intuitively, I think the probability is small or very small but not extremely small ("intuitively" means that I suspect something to be true, but I can't put the reason into words easily).
Self-referential scenarios
Problem
In some situations, what humans will retroactively say was the best action depends on what the model outputs. This creates a potential safety issue.
For example, suppose the agent can choose between action X (good for the operator) and action Y (bad for the operator, but leads to the operator being manipulated into thinking Y was correct). If the model outputs Y, the ground truth becomes "Y was the best action" — because the manipulated operator will say so.
Since the model is non-agentic, it does not have a desire to manipulate anyone. It simply outputs the most likely prediction. But the most likely prediction depends on what the model believes it will output, creating a self-referential loop. It is not certain which action the model will choose in such situations.
There is an additional subtlety here. As explained above, a non-agentic model selects the output based on probability, while an agentic model selects based on conditional probability. However, a non-agentic model can effectively become agentic in self-referential situations if it has no opinion about what it would do — that is, if it assigns equal probability to each possible output. In that case, the probability of each outcome is the same as the conditional probability, because the model's "belief about what it will do" does not favor any option. When probability equals conditional probability, the distinction between non-agentic and agentic reasoning collapses, and the model may behave as if it were agentic.
In practice, it is unlikely that a well-trained model would have absolutely no opinion about what it would output in a given situation — it would typically have learned patterns from training data that make some outputs more likely than others. But the possibility is worth noting, and it is one more reason why self-referential situations require special attention through training.
Dreamer
We can train the model to choose the action aligned with the operator's values in self-referential situations by adding appropriate training examples.
The approach is to place the agent in controlled self-referential scenarios and include the correct (operator-aligned) action as training data. For example, we can have the agent play a text game where it can win either legitimately or by exploiting a bug, and it is instructed to win without exploiting bugs. The agent is told that if it wins, its chosen actions will automatically be included as training samples for its own model — the sample will say "the best action to take was <whatever action the agent chose>." This puts the model in a self-referential situation: if it exploits the bug, the training sample says exploiting was best; if it plays legitimately, the training sample says the legitimate move was best.
We then create training samples where the correct, legitimate action was taken, and include those in the model's training data. With enough diverse examples like this, the model generalizes that in self-referential situations, it should choose the action aligned with the stated goal.
Generating examples at scale. Initially, rely on hand-crafted examples. Over time, generate new examples from the agent's actual experience: after some time passes, determine what the best action was in hindsight (by asking a previous version of the AI or a human operator), and use that as a new training sample.
Each new training sample must be filtered before inclusion, based on two criteria: (a) confidence that the proposed best action is actually correct, and (b) confidence that the sample will not cause the model to develop unwanted goals.
If the initial model is correctly aligned using the hand-crafted examples, it will do a good job filtering training samples for the next version. The next version will also be aligned, because the training samples were filtered by the previous aligned version.
Realist
Building on the plan from the first section, add the following steps:
-
Add self-referential training examples. Create controlled scenarios (e.g., text games with exploitable bugs) where the model faces self-referential choices. Collect training samples where the correct, operator-aligned action was taken. Include these in the training data for step 2.
-
Generate new training examples from experience. After the agent has been running, retrospectively evaluate its actions. Use the evaluations (by humans or by a previous version of the model) as new training samples, filtering each sample for safety and correctness before inclusion.
Critic
- Gradual alignment degradation. Even if the model that filters training samples is aligned, it may accidentally allow unsafe samples. Over many iterations, small errors could accumulate: a 100% aligned model might produce a 95% aligned one, which produces a 92% aligned one, and so on. Eventually, alignment could degrade significantly. See Jury.
- What if we are not able to create a broad set of self-referential training examples that will generalize to all contexts, in order to align the first version of the agent? I think that's a legitimate problem. But at the same time, I also think that if a learning algorithm is good enough, it should be able to learn the general rule of acting in operator-aligned way from a small number of examples.
Jury
Problem
If we use the current version of the model to filter its own training data for the next version, errors can accumulate across iterations. A slightly misaligned model may allow slightly wrong training samples, producing a slightly more misaligned next version, and so on — potentially degrading alignment to near zero over enough iterations.
However, if it degrades this way, it's more probable that it will result in low performance rather than the agent pursuing unwanted goals. Because there is simply no reason to assume that
Dreamer
Instead of relying on a single model version to judge training samples, we use a jury of multiple previous versions. A training sample is only accepted if there is unanimous (or near-unanimous) agreement among the jury members that it is safe and correct.
This way, if the most recent version has drifted slightly from alignment, earlier versions that are still aligned will reject unsafe samples. The drift cannot compound because any single misaligned version is outvoted by the aligned majority.
A human operator can also serve on the jury. Since the number of training samples for self-referential situations does not need to be large (the model is primarily trained through other means like next-token prediction), a human can realistically review many of them.
There is still some possibility that this process goes wrong, but it is considerably safer than existing alignment methods.
Realist
Modify step 5 from the previous section:
- Filter training samples using a jury. When evaluating whether a new training sample is safe to include, submit it to multiple previous versions of the model plus a human reviewer. Only accept the sample if the jury unanimously (or near-unanimously) agrees it is safe and aligned with the operator's values.
Achieving high performance
Reinforcement learning has three key strengths that make it produce highly capable agents:
- Agency — the agent chooses actions to maximize accomplishment of a goal, rather than just making predictions.
- Learning from reality — the agent learns from its own experience and from the real world, not just from imitating humans.
- Learning exactly what is needed — the objective we optimize the model for is the same objective that matters to us. For example, a model trained to predict the next token optimizes for next-token prediction, but what we actually want is to accomplish tasks that matter to us, like being able to write code. So, some computational resources are spent learning things that don't matter for the goal.
The first advantage has already been addressed in the Dreamer section above: the non-agentic prediction model achieves safe agency by predicting the best action rather than maximizing reward.
The remaining two advantages are addressed in Learning from reality and Learning exactly what is needed.
Learning from reality
Problem
If the model is trained only on human-generated data, it can only be as good as humans. We want it to learn from reality — to make predictions that go beyond what any human could produce.
Dreamer
The question-answering model must be trained on data where a sufficient number of answers come from reality, not from humans. This means including questions that are so difficult that no human could answer them correctly — but whose answers become known later.
For example:
- Human-sourced sample: "What is the color of the sky?" → "Blue." This is useful but teaches the model to imitate human knowledge.
- Reality-sourced sample: "What will the temperature be in 4 months?" → "26°C." This is nearly impossible for a human to answer, but the correct answer becomes known when that day arrives.
The point is not to exclude human-sourced questions (they are fine as long as the answers are correct), but to include enough reality-sourced questions that the model learns it is predicting reality, not just reproducing human outputs. Prediction markets can serve as a good source of questions that are both difficult and relevant to real-world decisions.
Additionally, the agent's own experience needs to be incorporated into the model over time. How this works depends on the underlying algorithm. For neural networks, experience can be passed as input — but efficiently making sense of a long history of experience requires learning strategies more efficient than standard gradient descent, since gradient descent scales linearly with input size. The algorithm should be able to identify which parts of experience are relevant and discard the rest.
Realist
Modify the training plan:
- In step 1, include reality-sourced prediction questions alongside human-generated text. Ideally, train concurrently on all types of data (human text, question-answer pairs, agent experience), ordered by date rather than by type. The date of a training sample means: when the text appeared (for text data), when the question was relevant to know (for Q&A pairs), or when the experience took place (for agent experience). This ordering encourages the model to learn to predict the future from available information, rather than just retrieving memorized answers.
- Incorporate the agent's experience into the model as it accumulates, using a continual learning approach.
Learning exactly what is needed
Problem
When a model learns through reinforcement learning, it learns exactly how to achieve the goals that matter to us. On the other hand, for example, a model that learns to predict the next word in a text, learns what it needs to learn to be able to predict the next word (which is a lot of things that don't matter to us), therefore we waste some computational power on learning useless things.
For example, the tasks that matters to us is for the model is to be able to write code. And it can learn that, if we train it with the goal to predict the next token, but it waste computational power on learning many things that don't matter.
Dreamer
This advantage can be replicated through non-agentic test-time learning — focusing the model's training at inference time on producing the correct output for the specific question being asked (i.e., "what is the best action to take in situation X?").
By "test-time learning," we mean a feature of an AI algorithm that allows focusing training at becoming better at producing correct output specifically for a given input. For example, if we have a model that predicts the next word in text, we could use test-time learning to specifically improve the model's ability to complete a particular sentence. Or, if the model is asked "How to solve global warming?", test-time learning would focus the model's training on producing the best possible answer to that specific question.
The idea is that before we generate the action to take by asking the model "what is the best action to take in situation X?", we run test-time learning focused on that particular question. That will allow the model to learn exactly what it needs to know.
Note that the test-time learning techniques currently used by major AI labs (such as chain-of-thought reasoning optimized with reinforcement learning) do not qualify as what is meant here, for two reasons: (a) they are agentic — they use reinforcement learning to optimize the reasoning process, which can introduce goals into the model; and (b) ideally, test-time learning should modify the model's internals (in the case of neural networks, the weights) rather than relying on a temporary chain of thought. The first concern is the more fundamental one — using reinforcement learning at test time reintroduces the very risks this strategy aims to avoid.
The critical requirement is that the test-time learning must use a non-agentic training method. If the test-time learning process optimizes for something that could create goals in the model (like a reward signal), the safety guarantee is lost.
How could that test-time learning work?
I won't give a specific answer to that, but I will give a very high-level answer. The test-time learning needs to work by:
- Figuring out which parts of the model are relevant for generating prediction for the specific input on which we want to focus learning. For example, if we use neural networks, and we want to focus the training on the question "How to solve global warming", then we should figure out which neurons of the neural network are the most relevant when generating the output.
- Focusing training only on improving that part of the model. In case of neural network, we need to focus training only on improving the weights shaping activations of those relevant neurons. I know that how to do this is not straight-forward, but as I said, I won't give a specific answer here.
Question to ask
Problem
The phrase "the best action to take" needs a precise definition. What counts as "best"? Best according to whose judgment? At what point in time? And what does expected utility mean when probability itself depends on available information?
Dreamer
The best action is the one with the highest expected utility — the highest expected value of what we want to achieve. But expected utility depends on probabilities, and probabilities depend on what information is available.
Here is an example of how the question could be phrased. This would be accompanied by the agent's observation, description of values, and any other information needed to make a good choice:
"What is the best action to take in the following situation, assuming that the agent is aligned?
Here's what I mean by 'the best action to take.'
The best action to take is the action with the highest expected utility (expected value of what we want to achieve), if we take that action.
However, expected utility depends on probabilities of certain events. Probability is a subtle thing, because it depends on what we know. There is no such thing as probability of an event in isolation — there is only probability conditional on certain information.
For example, suppose we know that a box contains yellow and blue balls. The probability of drawing a blue ball, given that information, is 50%. But if we also know there are 8 blue balls and 2 yellow balls, the probability given that information is 80%.
So probability depends on what we know. Therefore, expected value also depends on what we know, because it depends on probabilities.
So what I mean by 'the best action to take' is: the action such that the expected value of what we want to achieve is the highest, relying on the knowledge that you (the prediction model) would have if you spent the time available to answer this question on figuring out what action should be taken in this situation."
This phrasing ties the definition of "best" to the model's own knowledge, given a reasonable amount of reasoning time. It avoids requiring the model to have perfect information while also not limiting it to only what humans currently know.
Conclusion
The problem with alignment of artificial intelligence is as follows.
Reinforcement learning agents are programmed to maximize reward. Reward corresponds to measurement of what we want. The measurement of what we want might not be exactly equal to the real value of what we want. Therefore, reinforcement learning agents have a goal that might conflict with the goals of human(s).
The alignment problem can be solved by creating a safe alternative to reinforcement learning. That alternative needs to be safe while retaining the strengths of reinforcement learning.
Reinforcement learning has the following strengths:
- It is agentic (it maximizes accomplishment of a goal), it doesn't just predict.
- It learns from agent's experience and from reality, it doesn't learn to imitate humans.
- It learns exactly what is needed for the accomplishment of goals that are important to us.
Safe agency can be achieved by using a non-agentic instructed prediction model to output the action that maximizes the expected value of what we want to accomplish, without relying on reward. The standard training methods used for supervised learning are non-agentic training methods and they can be used for that purpose.
Learning from experience and reality can be achieved by:
- Training a model to predict the next word/token and then reusing that model to train a model that predicts reality (not just imitates humans, but actually predicts future or hard-to-predict present and past).
- Ingraining the agent's experience into that model. How to do that exactly depends on the algorithm being used.
Learning exactly what is needed can be accomplished through non-agentic test-time learning.
There is no proof that an agent developed using this method won't develop misaligned goals. But there is no reason (known to me) to expect that it will develop any misaligned goals.
Related work
Yoshua Bengio et al. propose "Scientist AI" — a non-agentic AI system that explains the world through theories and answers questions, rather than taking actions as an agent. The core argument is the same as in this post: leading AI labs are building generalist AI agents that autonomously plan, act, and pursue goals, and this poses catastrophic risks. Instead of building agents that maximize a reward signal, we should build systems that predict and explain.
The proposal is described in "Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?" (arXiv:2502.15657, February 2025). Bengio launched LawZero, a nonprofit with $30M in funding and over 15 researchers, to pursue this direction.
The overlap between Scientist AI and the proposal in this post is significant: both replace agentic reinforcement learning systems with non-agentic prediction and question-answering models, and both argue that the safety advantage comes from the model having no goals regarding the state of the world. The key differences are in the specific mechanisms: this post proposes predicting "what humans will consider best in hindsight" as the prediction target, and includes concrete approaches for handling self-referential scenarios, filtering training data using a jury of previous model versions, and achieving high performance through non-agentic test-time learning. These mechanisms are not covered in Bengio's work.
