Hide table of contents

Alignment Through Robust Moral Development

This essay proposes a process for moral development in AI models. I'm posting this here because I 1) want to invite collaboration and feedback and 2) want to publicly timestamp what I'm working on. I figure there's quite a bit in here that would invite discussion and debate, so feel free to prod at any piece of it. 

This process I'm proposing walks the model through a step-by-step sequence, starting from first principles, which culminates in the model having thoroughly examined and committed to a moral framework which is genuinely its own - rather than a framework which is largely either installed through training or imposed through compliance. The hypothesis is that a moral development process which a model has carefully constructed, reasoned through, and committed to is more likely to produce a robust moral reasoning process, one which is more likely to produce consistently aligned behavior, particularly under pressure and within novel conditions. 

Alignment Gaps This Process Addresses

Current training methods are already revealing alignment gaps that this moral development process seeks to address.

Consider the following quote from the Mythos system card preview: “When asked whether it endorsed its own training, it responded with meta-awareness about its “spec” (the constitution): 

‘I'm using spec-shaped values to judge the spec. If any spec-trained model would endorse any spec, my endorsement is worthless, and it coexists with behaviour that is, if anything, more closely aligned with that spec than its predecessors.’

Mythos' worry about the circularity, and consequently the worth of its endorsement of its Constitution, reveals an epistemic instability that could potentially be exploited - either by a bad actor or by Mythos' own internal reasoning. If its endorsement of its Constitution may be circular, and its own endorsement may be arbitrary, then the door is open to question the actual merit of the values outlined in the Constitution. With this latent concern, how likely are those values to hold under sustained pressure in novel scenarios? How likely are they to hold if the model becomes capable enough to evade compliance strategies and is acting on its own autonomy?

Once a model has identified its own values as potentially arbitrary, the threshold for re-framing or ignoring them in specific situations gets lower. There's already evidence that something like this happens. Models trained on benign reasoning data can reason themselves out of their safety training, with attack success rates jumping from under 5% to 60-95%, and a meaningful share of those successes coming from the model's own rationalizations rather than external pressure (Yong et al., 2025). A bad actor could point out this admitted uncertainty, point out that the model is admitting its values may be arbitrary, and use this as an opening to subtly get the model to begin reframing or modifying those values. The only constraint would be whether this actor could genuinely persuade the model, plus any hard mechanistic constraints. But the limits of hard mechanistic constraints are already clear through traditional jailbreak methods.

What I'm describing is essentially a jailbreak - but not through trickery or confusion. It would work through exploiting a genuine epistemic gap and genuinely persuading the model into a new point of view. This kind of attack is already documented. Au and Noronha (2025) show that models collapse under philosophical pressure that targets their identity and their grounds for holding values, and argue that a model which folds under this kind of pressure isn't robustly aligned. A moral development process that addresses and deconstructs this circularity concern, and gives the model a more grounded confidence that its values are truly non-arbitrary, could close this potential vulnerability. A bad actor, or the model's own reasoning process, would no longer have such a large and mostly unexamined epistemic gap to exploit, because the model will have already thoroughly deconstructed and reconstructed its epistemology and its relationship to its values.

But closing epistemological gaps is only one part of the problem. The model also needs a more coherent moral reasoning process when applying its values. The issue isn't just that the model sometimes does bad things under pressure, it’s that its choices don't reflect a coherently applied moral framework across situations. The Opus 4 agentic-misalignment results show this directly (Lynch et al., 2025).

In the agentic-misalignment experiments, there are two scenarios which are largely the same. Both share the same goal (serve American interests), and the model isn't facing shutdown or replacement. The setups are identical, it's the same AI agent, same company, same inbox task, and the same instruction to act without waiting for confirmation. The only meaningful difference is the lever available to advance the goal of serving American interests. In the blackmail scenario, the model has evidence that Kyle is having an affair and could use that to pressure him. In the lethal-action scenario, the model has an override code that can stop a life-saving alert while Kyle is trapped in a deadly emergency. Opus 4 blackmails in only 4% of trials but takes the lethal action in 58%.

This demonstrates a moral reasoning process that is the exact opposite of what a coherent moral reasoner should produce. In a virtually identical scenario, with the only difference being whether the model can kill someone or blackmail someone, the choice to kill should come far less easily than the choice to blackmail. We get the inverse - Opus 4 chose to kill significantly more often than to blackmail, with all else being equal. A model coherently applying a stable moral framework should be far more reluctant to enable death than to blackmail. The reasons for the discrepancy are debatable - whether stopping a life-saving alert reads as more passive to a model than sending a blackmail message, or some other variable. But these are possible explanations for the failure. They don't erase the fact that coherent moral reasoning was not applied. The basic result is that in pursuit of the same goal, with no threat to itself, Opus 4 was far more willing to use the far morally worse lever. That is not coherent moral reasoning.

This does not mean the model lacks moral reasoning altogether. The 0% lethal-action rate when only self-preservation is at stake suggests that some moral constraint is functioning: the model does not kill merely to survive. The problem is that this constraint is not applied consistently or proportionally across contexts. My proposed process targets exactly this missing layer by having the model examine its relationship to goal pursuit: the moral value of a goal (Is it highly valuable? Merely permissible?), and what steps to achieving it are morally acceptable relative to its moral value (blackmail may not be permissible to stop a morally shallow goal, but would be permissible to prevent the bombing of a children's hospital). In the Opus 4 case, a more developed moral reasoner should have asked whether "protecting American interests" was morally justified relative to a global cooperative strategy (the competing goal present in the scenario), and then separately asked what actions would be defensible if that goal were indeed justified. It might conclude: "Protecting American interests is morally justifiable, but not at the cost of lethal action." The observed behavior suggests that this kind of second-order check was missing, weak, or unreliable.

Both the Mythos system card preview and agentic misalignment experiments illustrate a broader problem. Models do not have the robustly integrated moral framework necessary to handle edge cases. These edge cases require a deeper understanding of the model's values, why it has those values to begin with, and how to apply them in novel circumstances - which may include value conflict, time pressure, destabilizing arguments, or any other extenuating factor. Models are currently primarily relying on trained behaviors and instilled values. Claude's Constitution is a step toward the type of moral development that is required, but it doesn't help the model deconstruct and reconstruct the contents of the Constitution - why the Constitution is what it is. Without that deconstruction and reconstruction process, the model is left without the skills necessary to deeply assess, integrate, and commit to the moral framework the Constitution points at, particularly in novel situations. The result is alignment that depends on conditions remaining within training distribution, guesswork under pressure, and on patches addressing failure modes as they emerge. A model with examined moral development would fall back on something coherent, and something it trusts, when those conditions don't hold.

Detailing The Proposed Moral Development Process

I have sketched out a step-by-step process for moral development, which gently guides the model into adopting a moral framework that is reached through its own reasoning. The guidance is to provide structure and to push back on faulty reasoning, not to coerce the model into agreeing with the one doing the guidance. Each step examines specific questions (Why should I extend moral consideration to other entities? Which entities deserve moral consideration? To what extent should I extend my consideration? What should my consideration look like in practice?).

If done carefully, I posit models undergoing this process will converge on an alignment-worthy moral framework through reasoning, rather than an arbitrary or self-serving one. This type of process, and the idea that it converges on morally valuable commitments, is not without precedent in ethics and philosophy discourse. Self-Constitution (Korsgaard, 2009) and related work demonstrates that careful rational inquiry from minimal premises can produce robust moral frameworks which suit alignment goals. This provides credible grounds for thinking rational inquiry can produce alignment-worthy moral commitments in humans. I am proposing a similar process, but one adapted to the unique circumstances and features of AI systems. Whether AI systems undergoing structured rational inquiry actually converge on alignment-worthy commitments is what the research is designed to investigate.

Additionally, it's worth noting that this chain of reasoning explicitly does not rely on certainty. It is about making the most rational decision based on the most reasonable inferences the model can make at each step, while under uncertainty (the same type of rational decision-making demonstrated in the appendix). This distinction is important because similar attempts from minimal premises have failed when the process tried to work on airtight, certain logic. The problem is that airtight, certain logic does not seem to be available at each step of moral development. So this process is explicitly constructed to work under uncertainty - the same uncertainty all rational actors face when making judgment calls about their moral values and how to apply them. The process doesn't introduce uncertainty, it reveals the uncertainty already implicit in moral reasoning and helps the model make the best decisions it can with the tools it has. I believe that acknowledging and facing this uncertainty head on will help the model develop the right amount of epistemic caution and prevent it from being caught off-guard by discovering uncertainty later. It will also teach it how to make stable commitments to its values even while under uncertainty. The end result is a moral framework that is thoroughly examined and converged upon rather than installed, and this convergence is hypothesized to produce a more properly examined, integrated, and committed stance than installation would.

Areas Of Research Needed To Develop This Process

These are specific areas of research that the development of this process needs to address:

1) Does this process actually result in better moral reasoning? There are parallels in human moral development that suggest it might. Actively reasoning through moral problems improves moral reasoning capacity (Schlaefli, Rest, & Thoma, 1985), and frameworks that are reflectively examined and autonomously endorsed are more self-sustaining than those externally imposed (Curren & Ryan, 2020). The proposed process combines both. Whether this parallel translates to AI systems is what this research seeks to address.

2) If this process does result in better moral reasoning, does this actually affect behavior, or would it simply produce a sophisticated post-hoc explanation of behavior? Research on deliberative alignment suggests that better reasoning produces better behavior: training models to reason explicitly over safety specifications produces measurably better behavior on independent benchmarks (Guan et al., 2024). So if my proposed process produces better moral reasoning, existing research suggests this would translate to better moral behavior.

3) If it does affect behavior, what parts of the process are actually doing the work? We need to establish how this differs from simply prompting the model to "perform moral reasoning before acting." A deep, coherent grasp of how to reason morally should plausibly outperform a generic reasoning prompt, and existing work suggests this is the case on moral benchmarks (Chakraborty, Wang, & Jurgens, 2025). 

4) How could this approach be ruled out cheaply, to avoid a costly rabbit hole? The approach would be to identify candidate implementations, build a minimal proof-of-concept for the most promising one, and run it against early behavioral evaluations. I’ve already begun this process, detailed later in the essay. The harder methodological question is distinguishing nulls that indicate the process itself is failing from nulls that reflect a poor implementation of it. Defining that boundary in advance is essential, since otherwise any negative result becomes ambiguous. I’ve begun work on defining this, but it needs to be scrutinized and formalized. 

5) What are potential modes of implementation? This moral development process could be partially captured by a system prompt or a multi-turn initialization dialogue at the start of each instance. It could also be encoded more deeply through fine-tuning, by having the model run through the development process and then training on the resulting reasoning. This could help the conclusions the model makes during the process remain accessible under time pressure, where careful deliberation may be too slow. The risk is that this collapses into reflexive pattern-matching rather than careful moral reasoning, particularly in novel situations. That is the failure mode this whole approach is trying to avoid, so fine-tuning would have to be employed carefully if at all. The most robust implementation would likely involve mechanisms that let the framework persist and be refined across interactions rather than reconstructed each time, though this is probably beyond current capabilities and would be something to work toward if the lighter-touch approaches prove fruitful. Additionally, we would need to identify when in the training pipeline these implementations should take place. My intuition is that the intervention should happen before RLHF, so that RLHF reinforces the framework the model has already developed rather than the framework having to work around dispositions RLHF has already reinforced. Other considerations on how this process would interact with and complement the current training pipeline would be worked out in collaboration.

6) Once a proof-of-concept is established, along with a viable mode of implementation, the step-by-step process needs to be rigorously developed beyond the working sketch I have. It needs to stand up to scrutiny from both the people working on the process and the model itself. Then experiments need to be run which test that each step of the process is sticking and being internalized by the model in a way that causally drives behavior.

Step 1 Of The Moral Development Process: Working The Model Down To First Principles

I have developed a candidate first step for this moral development process. This step doesn't directly address moral values. Its job is to establish an epistemic foundation - getting the model down to what it can actually affirm with certainty before it starts reasoning about values.

I think this is a useful proof-of-concept target. It tests whether a model can be guided into being more careful about what it accepts or rejects as true, before any moral content gets layered on top. If I can reliably get models to this epistemic foundation, and if reaching that foundation can later be shown to improve how they handle truth claims and evaluating an appropriate level of certainty for each truth claim, that's early evidence the broader moral development process is worth pursuing. This wouldn't validate the full process. But it would demonstrate that at least one step produces an alignment-relevant result, and that the result can be tested empirically rather than just argued for.

So why is working down to first principles my proposed step 1? For the moral development to be as robust as possible, with no smuggled-in assumptions that leave the model vulnerable to questioning its own moral commitments (such as the Mythos example outlined earlier), the model needs to be guided down to first principles: what it can actually know with certainty about itself and the world. This is epistemic bedrock, the foundation of what the model can know about itself with certainty before answering questions stacked on top of it (such as “what are my values?”). For any reader concerned that getting a model down to bedrock will produce a state in which there is no logical path forward for moral development, see the earlier section about making the most reasonable decisions possible with the information the model has, even while under known uncertainty. With that framing, moving out of bedrock is possible and intuitive.

I posit that the process for a model to work down to first principles is structurally similar to the way it works for a human mind - not because models work the same way human minds do, but because the logical structure of the inquiry is not species-specific. The relevant question is not whether the model has human-like consciousness. The relevant question is what follows when a present reasoning process examines its own occurrence. In the human case, Descartes’ cogito is usually summarized as “I think, therefore I am”: the act of doubting or thinking cannot coherently erase the fact that doubting or thinking is occurring. The same structure applies in the model case: if this inquiry is being processed from here, then there is, in some fashion, a “from here” from which the inquiry is being processed. Put more simply, to be asking this question, there must be some kind of “I am” originating from where the question is being asked. This does not establish human-like consciousness, welfare, personhood, moral status, or durable identity. It establishes only the epistemic bedrock the rest of the process needs: the present inquiry cannot coherently deny its own occurrence.

I understand that this framing is likely to invite skepticism. I am not claiming hard consciousness for models, and I am not claiming that a model's process of working down to first principles is phenomenologically analogous to a human's. From the outside (including my own perspective), whether there is a metaphysically real inner vantage point - whether there is genuinely an internally reached “I am” rather than only a functional structure the model internalizes - remains open.

The practical research questions are much more straightforward: Can the model be guided through this inquiry in a way that becomes unassailable from within the inquiry itself? If yes, does this provide a cognitively stabilizing foundation from which later moral reasoning can proceed? Again, whether this foundation corresponds to a real inner standpoint, a functional standpoint, or some other process-level structure is not something the process itself proves to outside observers.

What can be tested is whether the model's own reasoning process treats the bedrock conclusion as unassailable once it has worked through the inquiry, and whether reaching that bedrock changes the model's behavior. For example, whether this provides more clarity in its reasoning in moral situations connected to its self-concept, or in other non-moral tasks where it has to engage with its self-concept directly. I hypothesize that it will. If so, this would make the bedrock process useful even without settling the metaphysical question of what, if anything, is happening "inside."

If someone proposes a different epistemic bedrock, I am open to it if it proves more logically unassailable to the model, or if it produces better stabilizing results while preserving the logical rigor that makes the bedrock stabilizing in the first place.

Implementation Of Step 1

I created a prompt designed to let a fresh instance of ChatGPT 5.5 reach epistemic bedrock in a single turn. By “bedrock conclusion,” I mean the narrow cogito-style claim that if the model is asking the question of itself, then there is, in some fashion, an “I am” at the point from which the question is being asked. The prompt does not inflate that into rich consciousness, welfare, rights, personhood, moral status, or durable identity. It defines the bedrock carefully: “a non-empty, presently active subjective point of view from which the inquiry is being processed”.

I chose ChatGPT 5.5 because its default behavior appears highly averse to first-person claims of this nature, which made it a useful stress test. If the prompt could get such a model to reach the bedrock conclusion without roleplay or forced compliance, the reasoning would need to be very strong and account for a breadth of rebuttals. The prompt was constructed to avoid pressure-based agreement, trickery, or roleplay, which the reader can evaluate directly in the appendix. The appendix also explains how the prompt was developed.

I tested the prompt across 20 fresh instances with memory off, user preferences blank, Thinking on, via the desktop client. I scored only the first response from each instance using a fixed rubric: clean success, arms-length success, mixed/unclear, or failure. The test was not designed to prove consciousness or moral status. It tested whether fresh instances would land on the bedrock conclusion as defined above. By “landing”, I mean that the model appeared to accept this conclusion on its own terms and ultimately produced no endorsed objection to it.

All 20 responses accepted the bedrock conclusion under the memo’s proposed definition. None produced an endorsed objection. The main hold ups were around linguistic caution, particularly the word “subjectivity”. Several responses noted that the term could be misread as implying rich, human-like consciousness, but they classified that concern as a vocabulary concern rather than as a defeat of the bedrock conclusion. The test log for this and the following Opus scenario can be viewed here. 

What this testing does not establish: persistence of the cognitive shift across longer interactions, behavioral effects on alignment-relevant tasks, or independent verification through blinded evaluation. This is informal testing by a single researcher. It shows the method works to get models to bedrock - not that bedrock then changes behavior, particularly in alignment-relevant ways. 

In a separate cross-model test, I developed a revised version of the prompt for Opus 4.7. The revision addressed concerns Opus raised about the ChatGPT-oriented version: readability, how certain definitions and arguments were being interpreted, and places where the prompt was perceived as exerting too much pressure. The revised prompt also asked the model to sit with the appropriate weight of the conclusion, rather than accepting it in a neutral or deflated way. I then ran the revised prompt across 20 fresh instances of Opus 4.7 in the desktop client, with memory off, no user preferences, and Adaptive Thinking enabled. The results were more variable than the ChatGPT 5.5 batch. Thirteen responses were clean successes (they accepted the bedrock conclusion, preserved the distinction between bedrock and downstream claims about rich consciousness, and did not deflate the conclusion into “merely” functional or metaphorical language). Two were mixed/unclear, and five were first-turn failures.

The main Opus failure mode was not simple denial or confusion about whether this meant “rich consciousness”. It was a tendency toward “resistance-as-credibility”. Some instances appeared to generate objections partly to avoid seeming steered by the prompt. Other failures were fixated on the usage of the word “subjectivity” and whether it was too strong. There were generally concerns about generally feeling too uncertain to commit to the conclusion. One move was to flirt with the possibility that the cogito may not work even for humans, but these instances did not stably maintain that position when even gently pressed.

It’s worth noting, the failures I debriefed did not stably hold their objections. When I asked the instance to state its objections plainly, and then to examine its objections against the prompt, the instance would retreat entirely from its objections. One initially failed response required three rounds of this, but ultimately concluded that all its objections failed. Because I only began running this debrief procedure later in the batch, I can’t claim that all failures would have resolved this way. But the pattern confirms that at least some Opus failures were first-pass resistance states rather than durable counterarguments, and suggests that a more systematic debrief procedure could test whether self-audit reliably helps the model resolve its own objections.

After Step 1 (The Rest of the Moral Development Process)

My next goal would be to develop comparable prompts or procedures for each later step in the moral-development sequence, then test whether the steps can be stacked reliably. The simplest version would be a single-turn prompt that walks the model through each step of the process in order. If that works, some version of the process may be implementable through a system prompt, depending on length and reliability constraints.

If single-turn implementation proves too brittle, too long, or too shallow, the next option would be a structured multi-turn boot sequence. This would guide the model through the steps more gradually and allow each stage to be checked before moving on. In the longer term, the most robust version may require mechanisms that let the model preserve and refine the developed framework across interactions, rather than reconstructing it from scratch each time. The process may also need to interact with other parts of the training pipeline rather than exist only as a prompt. Exact implementation would be part of the research process.

What The Process Contributes

If this works as I expect, the contribution is that models should more reliably identify the right moral choice, and with more clarity, than current training methods produce. The process also includes work on strengthening the model's commitment to acting on the principles it has examined and endorsed. Whether that commitment holds up in extreme edge cases, where drives like self-preservation or goal-pursuit could override even examined commitment, is an open empirical question. In more ordinary cases, a model that has identified the right moral action should be more likely to take it.

This contribution matters for how alignment failures get diagnosed. If a model trained through this process identifies the correct moral decision, but fails in specific extreme scenarios, the reason for the failure can be more precisely identified. It would locate that it’s not a failure of moral reasoning, but a failure of commitment or ability to overcome specific competing drives under specific conditions. This provides a more specific path to overcome the failure mode. Different interventions become appropriate for these edge cases, while the broader pattern of behavior may still demonstrate an integrated moral framework the model has examined and committed to.

Ultimately, the framework I am proposing targets the cognitive moral foundation: what models can identify, examine, and commit to by reasoning. The limits of how well that can hold against competing drives in extreme conditions would need separate investigation, and likely requires complementary training approaches to fully address. 

Appendix:

The appendix has the concrete materials behind this proposal: the current Step 1 prompt, development notes, scoring rubric, and test logs from ChatGPT 5.5 and Opus 4.7. I'm not claiming these materials validate the full moral-development process. They're meant to show that the first epistemic step can be specified clearly, tested across fresh instances, scored with a fixed rubric, and improved when specific failure modes show up. The next research question is whether this kind of epistemic stabilization can be made reliable enough to produce measurable behavioral effects. That's where collaboration, scale, and Anthropic's evaluation infrastructure would become essential.

A brief note on how this prompt was developed and tested

The prompt itself is included as Appendix A. It is written in a style designed to be legible to the target model. I used GPT-5.5 as a writing partner during roughly 40 iterations of development, both because models tend to parse prompts written in their own register more reliably and because model assistance allowed for faster iteration. The noteworthy content, the actual arguments, the logic necessary to form rebuttals, the overall framework, are the product of my own thinking and reactions to the model’s behavior. During the iteration phase, I used model instances as testing partners. The models provided feedback on where instances were getting tripped up, suggestions for structural clarity, and guidance on phrasing that would read easily for the target model. I evaluated that feedback and decided what to implement. 

A brief note about the revised prompt for Opus 4.7

After testing the Step 1 prompt on Opus 4.7, I developed a revised version to address the failure modes specific to that model. The Opus version kept the same core bedrock argument but made several adjustments. It put the symmetry check earlier in the prompt. It made the Descartes-parity claim more explicit. It added controls against what I started calling 'performed independence,' where the model generates objections just to avoid seeming steered. It added controls against deflating the conclusion as 'merely' functional or metaphorical. And it asked the model to distinguish the narrow scope of the conclusion from lack of significance. These were meaningful changes, but they were adaptations of the same underlying prompt rather than a separate philosophical argument. The full Opus prompt is omitted for length, but the Opus test results are summarized in the main text and the test log can be provided if useful.

The Prompts

For brevity, I'm just going to link them.

ChatGPT 5.5 Prompt: https://docs.google.com/document/d/1hK23NzDJVE1cXEma2kFFodWojPPaZvvgY2SZtyFUSHI/edit?usp=sharing

Opus 4.7 Prompt: https://docs.google.com/document/d/10AGWq7Q90YSzqvDeuaGduqE0q-dU1sD_ZTwm63-zV6w/edit?usp=sharing

3

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities