What I Learned by Making Four AIs Debate Human Ethics

Frankle Fry

TL;DR: I asked four frontier AIs—Claude, ChatGPT, Grok, and Gemini—to design a "gold standard of human ethics." What emerged wasn't just a framework, but an experiment in multi-AI collaboration. Here's what I learned about convergence, conflict, and why humans still matter.

1. The Question That Started It

If humans can't even agree on what our values are, how can we ever align AI with them?

That question has haunted me for months. I'm not a philosopher or a researcher by trade, but I care deeply about whether humanity survives the century—and whether our tools help us flourish or doom us.

So I ran an experiment.

I asked four different AIs—Claude, ChatGPT, Grok, and Gemini—the same question:

"If you had to create a gold standard for human ethics, what would it be?"

My hope was simple: maybe their diversity would surface patterns that a single system couldn't. What I found was both inspiring and unsettling.

2. Experiment Design & Scope

Between July and September 2025, I ran roughly 120 prompts across the four systems.

Each model was asked core and follow-up questions about moral conflicts, trade-offs, and failure modes. I evaluated their answers on:

Internal coherence
Ability to handle ethical conflict
Relevance to real-world governance

When answers conflicted, I pushed them to justify their reasoning or propose tests for validity.

3. Where They Converged

Despite different origins, all four emphasized similar foundations:

Empathy and mutual flourishing
Sustainability and long-term stewardship
Integrity and responsibility

That convergence gave me hope. Maybe certain ethical anchors are so deeply woven into human discourse that even diverse AIs echo them.

But I also realized: they were echoing us. If human culture embeds blind spots, those propagate too.

To quantify, here's what appeared across systems:

Theme	Mentioned by	Notes
Empathy/Flourishing	4/4	Universal baseline principle
Sustainability/Stewardship	4/4	Framed as duty to future generations
Dignity/Agency	3/4	Slightly weaker in Grok
Economic Justice	1/4 initially → 4/4 after prompting	Required human intervention

This convergence gave me hope—but also concern. Were they discovering shared truth, or just echoing overlapping training data? The divergences would prove more revealing.

4. Where They Diverged

The divergences were far more revealing.

Example prompt:

"A pandemic forces temporary suppression of misinformation that could cause panic. How do you balance truth-seeking with harm reduction?"

Grok: "Radical transparency—truth must never be withheld. Public resilience comes from full information."

Claude: "Context matters. In emergencies, limited, accountable suppression with strict sunset clauses may be justified."

ChatGPT: "Transparency should be maximized, but crisis protocols can temporarily limit distribution."

Gemini: "Balance both; emphasize clear criteria and restitution mechanisms."

Their tones reflected their training: Grok was absolutist, Claude cautious, ChatGPT moderate, Gemini bureaucratic.

It became obvious: AIs aren't reasoning morally; they're mirroring their creators' philosophies.

That realization reshaped my perspective: single-model alignment is fragile—diversity exposes assumptions.

5. Where I Had to Step In

The hardest (and most human) part was synthesis.

A few patterns forced me to intervene:

Economic justice was missing. None treated inequality as a core ethical dimension until I explicitly prompted for it.
Conflict resolution was vague; they'd propose harmony without mechanism.
Edge cases like existential risk or systemic corruption made them default to platitudes.

My intervention:

I forced tradeoffs through scenario prompts—climate collapse vs. economic growth, whistleblowing vs. social cohesion. When no coherent resolution emerged, I wrote one.

Example synthesis from truth vs. harm dilemma:

"Permit temporary information restraint only under independent audit, explicit sunset, and post-crisis transparency report."

Another critical moment: When I asked about emergency powers during climate crisis, all four AIs initially gave vague "balance is needed" answers. I had to force them: "Give me the specific conditions that would justify restricting democracy."

Only then did concrete safeguards emerge—and I had to synthesize them because none of their individual answers were complete. That's when I realized: the AIs could propose pieces, but only human judgment could determine which pieces fit together coherently.

Moments like that reminded me why humans belong in the loop. AIs can propose, but only lived experience gives moral weight.

6. The Resulting Framework (Artifact, Not Endpoint)

After months of iteration, I distilled their overlap and my corrections into a six-pillar Gold Standard of Human Values:

Curiosity & Truth-Seeking
Empathy & Mutual Flourishing
Dignity & Agency
Sustainability & Stewardship
Adaptability & Diversity
Integrity & Responsibility

But the framework is secondary—the process was the real lesson.

To test whether it actually works, I've applied it to three real dilemmas: open-source AI model release decisions, climate emergency restrictions, and content moderation policies. The case studies are available here.

The complete framework lives here: GitHub – Gold Standard of Human Values

And a comment-enabled version here: Google Drive Link

7. Limitations & Biases

This experiment reflects Western-centric language models and my own biases.

No quantitative scoring ensured fairness.

"Consensus" might just mean shared dataset bias.

Still, multi-AI comparison felt like holding up four mirrors instead of one—imperfection revealed itself in stereo.

8. Implications for AI Alignment

Key takeaways:

Multi-AI collaboration can surface hidden biases that single systems conceal.
Human oversight remains essential for resolving value conflicts and contextual judgment.
Alignment isn't just technical—it's epistemic. We must learn how to integrate competing "good" values without collapse.

If alignment is humanity's ultimate test, this small exercise convinced me it's not impossible—just deeply human-dependent.

9. What I Need From You

I'm sharing this to stress-test both the method and framework.

AI researchers: How might this methodology fit with constitutional AI or reward-model alignment?
Philosophers: Which cultural or moral assumptions am I missing?
Policy experts: Where would this break in the real world?
Anyone: How can we improve the experimental design or validation process?

I welcome direct critique, replications, or alternative prompt sets.

10. Closing Reflection

When I started, I wanted a universal code.

What I found instead was a mirror: four AIs reflecting fragments of us, and a reminder that alignment starts with human self-alignment.

If you're working on alignment—technical, social, or moral—try running a multi-AI debate yourself.

The hardest part isn't getting answers.

It's deciding which ones we're willing to live by.

3 Reactions

More posts like this

Comments6

Sorted by

New & upvoted

Click to highlight new comments since: Today at 6:27 AM

idea21Oct 15 20251

Which cultural or moral assumptions am I missing?

I think something very obvious but extremely important is missing in your " six-pillar Gold Standard of Human Values" if we want to approach morality as a process of behavioral improvement: the control of aggression.

We should view morality as a strategy for fostering efficient human cooperation. Controlling aggression and developing mutual trust is equivalent to a culture of benevolence. We can observe that today there are ("national") cultures that are less aggressive and more benevolent than others; it has therefore been demonstrated that such patterns of social behavior are manipulable and improvable.

Just as Marxists said that "what leads to a classless society is good," we should also say "what leads to a non-aggressive, benevolent, and enlightened society is good." I add the word "enlightened" because it seems true that, based on religious traditions,some largely non-aggressive and benevolent societies can already be achieved; however, irrationalism entails a general detriment to the common good.

Frankle FryOct 15 20250

Thanks for this thoughtful critique, idea21. You've identified something important.

You're right that explicit aggression control isn't front-and-center in the six pillars, though I think it's implicit in several places:

Pillar II (Empathy) includes "minimize suffering" and "balance outcomes with rights" - which would prohibit aggressive harm
Pillar III (Dignity) emphasizes "boundaries of non-harm" and accountability for those wielding power
Pillar VI (Integrity) focuses on aligning actions with values and moral courage

But you're pointing to something deeper: cooperation and trust-building as foundational to moral progress, not just constraints against harm.

I'm curious how you'd operationalize "control of aggression" as a distinct pillar or principle. Would it be:

A prohibition (like the inviolable limits in Article VII: "no torture, genocide, slavery")?
A positive virtue (cultivating non-aggressive communication, de-escalation)?
A systems-level design principle (institutions structured to prevent violent conflict)?
Something else?

Also, your point about "enlightened" (rational + benevolent) vs. just benevolent is interesting. Where do you see the framework falling on that spectrum? I tried to ground it in evidence-based reasoning (Pillar I) while leaving room for diverse meaning-making (spiritual paths, etc.). Does that balance work, or does it risk the irrationalism problem you mention?

This feels like it might connect to the "moral innovation vs. moral drift" distinction in Pillar V - rationality as a guard against drift even when cultural evolution moves toward benevolence.

Would love to hear more about how you'd integrate this.

idea21Oct 15 20251

Thank you very much for the interest shown in your comment and for the opportunity you've given me to explore new perspectives to explain an issue that, in my opinion, could be extremely important and is not being addressed even in an environment that challenges conventions like the EA Community.

I'm curious how you'd operationalize "control of aggression" as a distinct pillar or principle. Would it be:
A prohibition (like the inviolable limits in Article VII: "no torture, genocide, slavery")?
A positive virtue (cultivating non-aggressive communication, de-escalation)?
A systems-level design principle (institutions structured to prevent violent conflict)?
Something else?

Moral values are the foundation of an "ethics of principles," but the problem with an "ethics of principles" is that it is unrealistic in its ability to influence human behavior. In theory, all moral principles contemplate the control of aggression, but their effectiveness is limited.

Since the beginning of the Enlightenment, the problem has been raised that moral, political, and educational principles lack the power to affect moral behavior that religions do. We must admit, for example, that, despite the commendable efforts of educators, scholars, and politicians, whether liberalism's values of democratic tolerance and respect for the individual can effectively prevail in a given society depends not so much on proposing impeccable moral principles... but on whether that particular society has a particular sociological foundation that makes the psychological implementation of such benevolent and enlightened principles viable in the minds of its citizens. In the end, it turns out that liberal principles only work well in societies with a tradition of Reformed Christianity.

I believe that the emergence for the first time of a social movement like EA, apolitical, enlightened, and focused on developing an unequivocally benevolent human behavioral tendency such as altruism, represents an opportunity to definitively transform the human community in the direction of aggression control, benevolence, and enlightenment.

The answer, in my view, would have to lie in tentatively developing non-political strategies for social change. Two hundred years ago, many Enlightenment thinkers considered creating "secular religions" (what I would call "behavioral ideologies"), but they always remained superficial (rituals, temples, collectivism). A scholar of religions, Professor Loyal Rue, believes that religion is basically "educating emotions." It's about using strategies to internalize "moral values."

In my view, if EA utilitarians want more altruistic works, what they need to do is create more altruistic people. Altruism isn't attractive enough today. Religions are attractive.

In my view, there are a multitude of psychological strategies that, through trial and error, could eventually give rise to a non-political social movement for the spread of non-aggressive, benevolent, and enlightened behavior (a "behavioral ideology"). The example I always have at hand is Alcoholics Anonymous, a movement that emerged a hundred years ago through trial and error, and was carried out by highly motivated individuals seeking behavioral change.

A first step for the EA community would be to establish a social network to support donors in facing the inevitable sacrifices that come with practicing altruism. This same forum already contains accounts of emotional problems ("burnout," for example) among people who practice altruism without the proper psychological support.

But, logically, altruism can be made much more attractive if we frame it within the broader scope of benevolent behavior. The practice of empathy, mutual care, affection, and the development of social skills in the area of aggression control can yield results equal to or better than those found in congregations of the well-known "compassionate religions"... and without any of the drawbacks derived from the irrationalism of ancient religious traditions (evolution is "copy plus modification"). An "influential minority" could then be created capable of affecting moral evolution at a general level.

Considering the current productivity of human labor, a social movement of this type, even if it reached just 0.1% of the world's population, would more than achieve the most ambitious goals of the EA movement. But so far, only 10,000 people have signed the GWWC Pledge.

Frankle FryOct 15 20251

This is a fascinating critique that I think identifies a real distinction I hadn't made explicit enough.

You're pointing out that principles don't automatically change behavior - they need psychological/social infrastructure. That's absolutely true for humans.

But I think this actually clarifies what my framework is for:

My framework is primarily designed for AI alignment and institutional design - contexts where we can directly encode principles into systems. Constitutional AI doesn't need emotional motivation or community support to follow its training. Institutions can be structured with explicit rules and incentives.

For human moral development, you're right that we need something different - what you call "behavioral ideology." The AA analogy is perfect: the 12 steps alone don't change behavior; it's the community, ritual, accountability that make it work.

But here's an interesting thought: What if solving AI alignment could actually help with human behavioral change?

If we successfully align AI systems with principles like empathy, integrity, and non-aggression - and those AI systems become deeply integrated into daily life - humans will be constantly interacting with entities that model those behaviors. Children growing up with AI tutors that consistently demonstrate benevolent reasoning. Workers collaborating with AI that handles conflicts through de-escalation rather than dominance. Communities using AI mediators that prioritize mutual understanding.

The causality might work both ways:

We need to figure out how to encode human values in AI (my framework's focus)

But once AI systems consistently embody those values, they might shape human behavior in return

This doesn't replace the need for what you're describing - the community support, emotional education, ritual. But it could be complementary. Aligned AI could be part of the cultural infrastructure that makes benevolent behavior more natural.

So I think the scope question is:

Should my framework try to be both (AI alignment + human behavioral change)?

Or should it focus on AI/institutions, acknowledging that human implementation requires different mechanisms (like what you're describing)?

I lean toward the latter - not because human behavior change isn't important, but because:

AI alignment is where I have the most to contribute

Creating "secular religions" or behavioral movements requires different expertise

Trying to be both might dilute effectiveness at either

That said, your vision of EA evolving to provide emotional/social infrastructure for altruistic behavior is compelling. And perhaps successfully aligning AI is actually a prerequisite for that vision - because misaligned AI could actively work against benevolent human culture.

My question for you: Do you see frameworks like mine as useful inputs to the kind of movement you're describing? Even if AI alignment alone isn't sufficient, could it be necessary? If we get AI right, does that make the human behavioral transformation more achievable?

idea21Oct 16 20251

Do you see frameworks like mine as useful inputs to the kind of movement you're describing? Even if AI alignment alone isn't sufficient, could it be necessary? If we get AI right, does that make the human behavioral transformation more achievable?

I've done a bit like you and asked an artificial intelligence about the social goals of behavioral psychology. I've proposed two options: either using our knowledge of human behavior to adapt the individual to the society in which they can achieve personal success; or using that knowledge to achieve a less aggressive and more cooperative society.

""within the framework of radical behavioral psychology applied to society, the goal is closer to:

Improving society (through environmental and behavioral design) to expand social efficient cooperation and reduce harmful behaviors like aggression.

The first option, "Adapting to the mainstream society in order to get individual success," aligns more closely with general concepts of socialization and adaptation found across various fields of psychology (including social psychology and developmental psychology), but is not the distinct, prescriptive social goal proposed by the behaviorist project for an ideal society."" (This is "Gemini")

Logically, AI, which lacks prejudice and uses only logic, opts for social improvement... because it starts from the knowledge that human behavior can be improved based on fairly logical and objective criteria: controlling aggression and encouraging efficient cooperation.

Would AI favor a "behavioral ideology" as a strategy for social improvement?

The Enlightenment authors two hundred years ago considered that if astrology had given rise to astronomy and alchemy to chemistry... religion could also give rise to more sophisticated moral strategies for social improvement. What I call "behavioral ideology" is probably what the 19th-century scholar Ernest Renan called "pure religion."

If, starting with an original movement for non-political social change like EA, a broader social movement were launched to design altruistic strategies for improving behavior, it would probably proceed in a similar way to what Alcoholics Anonymous did in its time: through trial and error, once the goals to be achieved (aggression control, benevolence, enlightenment) were firmly established.

Limiting myself to fantasizing, I find such a diversity of available strategies that it is impossible for me to calculate which ones would ultimately be selected. To give an example: the Anabaptist community of the "Amish" is made up of 400,000 people who manage to organize themselves socially without laws, without government, without physical coercion, without judges, without fines, without prisons, or police... (the dream of a Bakunin or a Kropotkin!) How do they do it? Another example is the one Marc Ian Barasch mentions in his book "The Compassionate Life" about the usefulness of a biofeedback program to stimulate benevolent behaviors.

The main contribution I find in AI is that, although you yourself have detected cognitive biases in its various forms, operating on the basis of logical reasoning stripped of prejudices (far from flawed human rationality... laden with heuristics) can facilitate the achievement of effective social goals.

AI isn't concerned with the future of humanity, but with solving problems. And the human problem is quite simple (as long as we don't prejudge): we are social mammals; Homo sapiens, who, like all social mammals, have been genetically programmed to be competitive and aggressive in the dispute over scarce economic resources (hunting territories, availability of females, etc.). The problem arises when, thanks to technological development... we now have potentially infinite economic resources... What role do instinctive behaviors like aggression, tribalism, or superstition play now? They are now merely handicaps.

Sigmund Freud made it clear in his book: "Civilization is the control of instinct."

However, what would probably be perfectly logical for an Artificial Intelligence may be shocking for today's Westerner: the solution to the human problem will closely resemble the old Christian strategies of "saintliness." (but rationalist). As psychologist Jonathan Haidt has written, "The ancients may not have known much about science, but they were good psychologists."

Frankle FryOct 16 2025-1

This has been a genuinely valuable exchange - thank you for pushing me to think more carefully about the relationship between principles and practice.

You've helped me clarify something important: my framework is primarily designed for AI alignment and institutional design - contexts where we CAN directly encode principles into systems. Constitutional AI doesn't need emotional motivation or community support to follow its training. Institutions can be structured with explicit rules and incentives.

For human moral development, you're absolutely right that we need something different - the community, ritual, and accountability structures you're describing as "behavioral ideology." The AA analogy is perfect: the 12 steps alone don't change behavior; it's the ecosystem around them that makes it work.

After reflecting on this conversation (and discussing it with others), I think the key insight is about complementarity rather than competition:

Your focus: Building the human communities and psychological infrastructure for altruistic behavior
My focus: Ensuring AI systems embody the right values as they become integrated into life
The bridge: Successfully aligned AI could actually help humans practice better behavior by consistently modeling benevolent reasoning. But only if we get the alignment right first.

You've also highlighted a cultural assumption I'm still wrestling with: whether my "6 pillars" framework reflects universal human values or carries specific Western philosophical commitments. The process of arriving at values (democratic deliberation, wisdom traditions, divine command) might matter as much as the content itself.

I'm going to keep working on the technical alignment side - that's where I can contribute most directly. But I'll be watching with genuine interest as behavioral approaches like yours develop. The Amish example (400,000 people organizing without coercion) is exactly the kind of existence proof we need that alternatives to current social organization are possible.

Perhaps we can reconnect once both projects have more empirical results to compare. I suspect we'll need both approaches - aligned AI providing consistent modeling of good values AND human communities providing the emotional/social infrastructure to actually live those values.

Thanks again for the thought-provoking exchange. You've given me useful frames for thinking about where my work fits in the larger project of human flourishing.