Hide table of contents

Cross-posted from my website.

If civilization advances its technological capabilities without advancing its wisdom, we may miss out on most of the potential of the long-term future. Unfortunately, it's likely that that ASI will have a comparative disadvantage at philosophical problems.

You could approximately define philosophy as "the set of problems that are left over after you take all the problems that can be formally studied using known methods and put them into their own fields." Once a problem becomes well-understood, it ceases to be considered philosophy. Logic, physics, and (more recently) neuroscience used to be philosophy, but now they're not, because we know how to formally study them.

Our inability to understand philosophical problems means we don't know how to train AI to be good at them, and we don't know how to judge whether we've trained them well. So we should expect powerful AI to be bad at philosophy relative to other, more measurable skills.

However, there is one type of philosophy that is measurable, while also being extremely important: philosophy proofs.

Some examples of proofs that made important advances in moral philosophy:

  • The VNM Utility Theorem proved that any agent whose preferences satisfy four axioms must have a utility function, and their preferences entail maximizing the expected value of that function.
  • Harsanyi's utilitarian theorem (see also Harsanyi (1955)[1]), which showed that if individuals have VNM utility functions, and if the Pareto principle[2] holds over groups, then a version of utilitarianism must be true. In particular, utility must aggregate linearly across individuals.
  • Arrhenius (2000)[3] proved that any theory of population ethics must accept at least one counterintuitive conclusion.
  • Askell (2018)[4] proved that if four intuitive axioms[5] hold, then it is impossible to compare infinite worlds.

I wrote a proof of my own in GiveWell's Charity Recommendations Require Taking a Controversial Stance on Population Ethics.

The general pattern with these proofs is that you start from a set of intuitively reasonable axioms and use them to produce a controversial conclusion. Having that sort of proof doesn't tell you whether you ought to reject one of the axioms or accept the conclusion, but it does tell you that you have to do one of those things.

Not many philosophical proofs have been written. That suggests that they're difficult to write, or at least difficult to come up with. None of the proofs I listed are particularly complicated from a mathematical point of view—undergraduate math students routinely have to write more difficult proofs than those. The challenging part is identifying the right setup: you have to find a proof that tells you something new.

That's the sort of thing that AI might be able to do well. AI can churn through ideas more quickly than humans can, and it's relatively good at working with formal systems.[6] Modern-day LLMs might be smart enough to come up with useful philosophical proofs; even if not, the first AIs that can write these proofs will not need to be superintelligent.

AI won't be good at telling you how to move forward after finding an impossibility proof; but it can give you the proof.

Proof of concept

A basic test would be to run a pro-tier LLM with extended thinking for a while to search through possibilities and try to come up with an interesting proof; then have human judges review the resulting proof(s). This test would be relatively easy to conduct; the hard part is judging whether the proofs are interesting.

As an even simpler test, I ran three Claude sessions to generate novel impossibility proofs. In each session I provided some guidance on what I was looking for, and I provided different guidance in each case to try to elicit three distinct results. Below is a quick summary of Claude's three proofs, along with my assessments. I haven't carefully verified that these proofs are correct, but they passed a quick sanity check.

  • First proof: We cannot escape Arrhenius' impossibility result by introducing moral uncertainty.

    My assessment: The concept is somewhat interesting, although to me it's intuitively obvious that moral uncertainty wouldn't let us get around Arrhenius' result.

  • Second proof: If a pluralist value system cares about both maximizing welfare and mitigating individuals' most severe complaints (similar to Rawls' maximin principle), then the pluralist system either violates transitivity, or it can be collapsed onto a single scale.

    My assessment: Uninteresting—the definition of "complaint minimization" does all the work in the proof, and the welfare-maximization criterion is irrelevant.

  • Third proof: Given five reasonable axioms of how an aligned AI agent ought to behave, it is impossible for an agent to simultaneously satisfy all five.

    My assessment: Uninteresting—it's a trivial special case of Sen (1970)[7], which proved that no society can satisfy both Pareto efficiency and liberalism. If no society can satisfy those axioms, then clearly no aligned AI can satisfy them, either.

This was just a quick attempt; more work could perhaps elicit better proofs. Claude had a reasonable understanding of the limitations of its own proofs—it noticed (without additional prompting) that the second proof depended only on the definition of "complaint minimization", and that the third proof was a special case of a known result.

A next step could be to ask many LLM instances to write dozens of proofs, and then use a manager LLM to filter down to the most interesting ones. At minimum, the manager should be able to filter out proofs that are trivial extensions of known results. With some additional effort, present-day LLMs might be capable of coming up with a good novel proof. If not, then it will likely be possible soon. Most kinds of moral philosophy might be difficult for AIs, but proofs are one area where AI assistance seems promising.

Is it risky to train AI on philosophy?

This post was about using pre-existing AI to write philosophy proofs, not about specifically training AI to get better at philosophy. I expect advanced AI to be relatively bad at (most kinds of) philosophy because philosophy is hard to train for.

However, it may be dangerous to train AI to get better at philosophy. My worry is that this would make AI better at persuading us of incorrect philosophical positions, and it would make misalignment harder to catch—precisely because it's so hard to tell whether a philosophical position is correct.

I don't have a strong view on how important this is, but I would be remiss if I didn't talk about potential downsides. To be clear, I'm not proposing that we train AI to get better at philosophy. I'm proposing that perhaps near-future AI could be a useful assistant for writing formal philosophical proofs, and that this may be an important application of AI.


  1. Harsanyi, J. C. (1955). Cardinal Welfare, Individualistic Ethics, and Interpersonal Comparisons of Utility. ↩︎

  2. The Pareto principle states that if outcome A is at least as good as outcome B for every person, and outcome A is better for at least one person, then outcome A is better overall. ↩︎

  3. Arrhenius, G. (2000). An Impossibility Theorem for Welfarist Axiologies. doi: 10.1017/s0266267100000249 ↩︎

  4. Askell, A. (2018). Pareto Principles in Infinite Ethics. ↩︎

  5. The four axioms are "the Pareto principle, transitivity, an axiom stating that populations of worlds can be permuted, and the claim that if the 'at least as good as' relation holds between two worlds then it holds between qualitative duplicates of this world pair". ↩︎

  6. I couldn't have made this statement in 2023. LLMs used to be bad at formal systems, but they've gotten much better. ↩︎

  7. Sen, A. (1970). The Impossibility of a Paretian Liberal. ↩︎

27

0
0

Reactions

0
0

More posts like this

Comments1
Sorted by Click to highlight new comments since:

Nice post! Miscellaneous thoughts:

if individuals have VNM utility functions, and if the Pareto principle holds over groups, then a version of utilitarianism must be true.

Harsanyi's theorem also requires that the social planner's preferences satisfy the VNM axioms.

Not many philosophical proofs have been written

I think this all depends on what you mean by 'many'. I'd guess maybe 10% of analytic philosophy papers include a proof of some kind, so that at least hundreds of proofs are published every year. And in a sense, every valid (spelled-out) argument is a proof.

I agree that the Claude proofs are pretty bad. The Arrhenius point is fairly obvious: what Arrhenius means by 'theories' in that paper is weak orders on populations, so if after taking into account moral uncertainty you still have a weak order, then the impossibility theorem still applies. (And later Arrhenius theorems relax both completeness and transitivity, so even departing from a weak order doesn't get you off the hook.)

Claude makes this kind of point, but first it introduces an Agreement axiom that the proof never uses. Claude later comes close to admitting this ('Agreement plays almost no role'), tries to walk it back ('But Agreement rules out the escape route...'), and then fully admits it ('the fundamental impossibility holds regardless'). 

Which Claude model did you use? Did you use extended thinking? The flip-flopping above makes me think there was no extended thinking, and maybe a model with extended thinking would do better. (Though not much better I'd guess. I've found LLMs to be surprisingly bad at philosophy, even just the 'understanding the view and its implications' parts.)

I didn't bother checking the second population ethics proof but it looks sloppy:

Axiom (Sufficient Comparability). For any pair of populations A, B that differ by at most some fixed bounded amount (e.g., adding or removing one person, or changing one person's welfare level by a small amount), M(μ) must rank A and B (no incomparability for "local" comparisons)."

Don't any pair of populations "differ by at most some fixed bounded amount"? What is Claude doing including 'e.g.'s in its formal statement of axioms?

With some additional effort, present-day LLMs might be capable of coming up with a good novel proof. If not, then it will likely be possible soon. Most kinds of moral philosophy might be difficult for AIs, but proofs are one area where AI assistance seems promising.

Yes, you'd think so given that they've gotten so good at math! But when I've tried using LLMs to help with formal philosophy, I've found them to be really surprisingly bad, even at parts that seem very math-loaded (e.g. inventing proofs, following arguments, grasping views and their implications, coming up with counterexamples, etc.). I'm not sure why this is. I guess part of it is that it's hard to do RLVR on philosophy in the same way that you can do RLVR on math, but naively I'd expect more generalization from math to formal philosophy. Maybe the following is a factor: pretraining data doesn't contain that much bad mathematical reasoning, but it contains a huge amount of bad philosophical reasoning.

Curated and popular this week
Relevant opportunities