Can AI make advancements in moral philosophy by writing proofs?

MichaelDickens

Nice post! Miscellaneous thoughts:

if individuals have VNM utility functions, and if the Pareto principle holds over groups, then a version of utilitarianism must be true.

Harsanyi's theorem also requires that the social planner's preferences satisfy the VNM axioms.

Not many philosophical proofs have been written

I think this all depends on what you mean by 'many'. I'd guess maybe 10% of analytic philosophy papers include a proof of some kind, so that at least hundreds of proofs are published every year. And in a sense, every valid (spelled-out) argument is a proof.

I agree that the Claude proofs are pretty bad. The Arrhenius point is fairly obvious: what Arrhenius means by 'theories' in that paper is weak orders on populations, so if after taking into account moral uncertainty you still have a weak order, then the impossibility theorem still applies. (And later Arrhenius theorems relax both completeness and transitivity, so even departing from a weak order doesn't get you off the hook.)

Claude makes this kind of point, but first it introduces an Agreement axiom that the proof never uses. Claude later comes close to admitting this ('Agreement plays almost no role'), tries to walk it back ('But Agreement rules out the escape route...'), and then fully admits it ('the fundamental impossibility holds regardless').

Which Claude model did you use? Did you use extended thinking? The flip-flopping above makes me think there was no extended thinking, and maybe a model with extended thinking would do better. (Though not much better I'd guess. I've found LLMs to be surprisingly bad at philosophy, even just the 'understanding the view and its implications' parts.)

I didn't bother checking the second population ethics proof but it looks sloppy:

Axiom (Sufficient Comparability). For any pair of populations A, B that differ by at most some fixed bounded amount (e.g., adding or removing one person, or changing one person's welfare level by a small amount), M(μ) must rank A and B (no incomparability for "local" comparisons)."

Don't any pair of populations "differ by at most some fixed bounded amount"? What is Claude doing including 'e.g.'s in its formal statement of axioms?

With some additional effort, present-day LLMs might be capable of coming up with a good novel proof. If not, then it will likely be possible soon. Most kinds of moral philosophy might be difficult for AIs, but proofs are one area where AI assistance seems promising.

Yes, you'd think so given that they've gotten so good at math! But when I've tried using LLMs to help with formal philosophy, I've found them to be really surprisingly bad, even at parts that seem very math-loaded (e.g. inventing proofs, following arguments, grasping views and their implications, coming up with counterexamples, etc.). I'm not sure why this is. I guess part of it is that it's hard to do RLVR on philosophy in the same way that you can do RLVR on math, but naively I'd expect more generalization from math to formal philosophy. Maybe the following is a factor: pretraining data doesn't contain that much bad mathematical reasoning, but it contains a huge amount of bad philosophical reasoning.

Elliott Thornley

3mo

Nice post! Miscellaneous thoughts:

if individuals have VNM utility functions, and if the Pareto principle holds over groups, then a version of utilitarianism must be true.

Harsanyi's theorem also requires that the social planner's preferences satisfy the VNM axioms.

Not many philosophical proofs have been written

I didn't bother checking the second population ethics proof but it looks sloppy:

Axiom (Sufficient Comparability). For any pair of populations A, B that differ by at most some fixed bounded amount (e.g., adding or removing one person, or changing one person's welfare level by a small amount), M(μ) must rank A and B (no incomparability for "local" comparisons)."

Don't any pair of populations "differ by at most some fixed bounded amount"? What is Claude doing including 'e.g.'s in its formal statement of axioms?

With some additional effort, present-day LLMs might be capable of coming up with a good novel proof. If not, then it will likely be possible soon. Most kinds of moral philosophy might be difficult for AIs, but proofs are one area where AI assistance seems promising.

Proof of concept

A basic test would be to run a pro-tier LLM with extended thinking for a while to search through possibilities and try to come up with an interesting proof; then have human judges review the resulting proof(s). This test would be relatively easy to conduct; the hard part is judging whether the proofs are interesting.

As an even simpler test, I ran three Claude sessions to generate novel impossibility proofs. In each session I provided some guidance on what I was looking for, and I provided different guidance in each case to try to elicit three distinct results. Below is a quick summary of Claude's three proofs, along with my assessments. I haven't carefully verified that these proofs are correct, but they passed a quick sanity check.

First proof: We cannot escape Arrhenius' impossibility result by introducing moral uncertainty.

My assessment: The concept is somewhat interesting, although to me it's intuitively obvious that moral uncertainty wouldn't let us get around Arrhenius' result.

Second proof: If a pluralist value system cares about both maximizing welfare and mitigating individuals' most severe complaints (similar to Rawls' maximin principle), then the pluralist system either violates transitivity, or it can be collapsed onto a single scale.

My assessment: Uninteresting—the definition of "complaint minimization" does all the work in the proof, and the welfare-maximization criterion is irrelevant.

Third proof: Given five reasonable axioms of how an aligned AI agent ought to behave, it is impossible for an agent to simultaneously satisfy all five.

My assessment: Uninteresting—it's a trivial special case of Sen (1970)^[7], which proved that no society can satisfy both Pareto efficiency and liberalism. If no society can satisfy those axioms, then clearly no aligned AI can satisfy them, either.

This was just a quick attempt; more work could perhaps elicit better proofs. Claude had a reasonable understanding of the limitations of its own proofs—it noticed (without additional prompting) that the second proof depended only on the definition of "complaint minimization", and that the third proof was a special case of a known result.

A next step could be to ask many LLM instances to write dozens of proofs, and then use a manager LLM to filter down to the most interesting ones. At minimum, the manager should be able to filter out proofs that are trivial extensions of known results. With some additional effort, present-day LLMs might be capable of coming up with a good novel proof. If not, then it will likely be possible soon. Most kinds of moral philosophy might be difficult for AIs, but proofs are one area where AI assistance seems promising.

Is it risky to train AI on philosophy?

This post was about using pre-existing AI to write philosophy proofs, not about specifically training AI to get better at philosophy. I expect advanced AI to be relatively bad at (most kinds of) philosophy because philosophy is hard to train for.

However, it may be dangerous to train AI to get better at philosophy. My worry is that this would make AI better at persuading us of incorrect philosophical positions, and it would make misalignment harder to catch—precisely because it's so hard to tell whether a philosophical position is correct.

I don't have a strong view on how important this is, but I would be remiss if I didn't talk about potential downsides. To be clear, I'm not proposing that we train AI to get better at philosophy. I'm proposing that perhaps near-future AI could be a useful assistant for writing formal philosophical proofs, and that this may be an important application of AI.

Harsanyi, J. C. (1955). Cardinal Welfare, Individualistic Ethics, and Interpersonal Comparisons of Utility. ↩︎
The Pareto principle states that if outcome A is at least as good as outcome B for every person, and outcome A is better for at least one person, then outcome A is better overall. ↩︎
Arrhenius, G. (2000). An Impossibility Theorem for Welfarist Axiologies. doi: 10.1017/s0266267100000249 ↩︎
Askell, A. (2018). Pareto Principles in Infinite Ethics. ↩︎
The four axioms are "the Pareto principle, transitivity, an axiom stating that populations of worlds can be permuted, and the claim that if the 'at least as good as' relation holds between two worlds then it holds between qualitative duplicates of this world pair". ↩︎
I couldn't have made this statement in 2023. LLMs used to be bad at formal systems, but they've gotten much better. ↩︎
Sen, A. (1970). The Impossibility of a Paretian Liberal. ↩︎