Hide table of contents

Read Part 1 for more background.

Let’s get on track

If we’re going to understand what constitutes good policy, we need to understand what constitutes a good moral theory. We can gain some insights from looking at the practices of moral philosophy. Yes, you knew damn well that when we started this, we would talk about the trolley problem.

So you’re in class, and your philosophy professor asks you why you chose to pull the lever, you say “More lives are better than fewer”, and then they hit you with the Surgeon variant of the problem. Why do they do that? What are we achieving when we pose moral dilemmas?

They’re fake people, but their pain is real. Does that make sense?
Pictured: My partner explaining why reality TV is enjoyable The Good Place’s take on the trolley problem

Your professor is, for each dilemma, getting you to state your intuitive moral preferences over two specified timelines (one in which you pull the lever, and the one in which you don’t). Then you’re supposed to state a rule that consistently represents your moral preferences over all possible dilemmas. And if your proposed rule fails to account for the next contrived dilemma your professor throws at you, you’ve done something wrong. It’s essentially the same as scientific falsification, except we use thought experiments.

Trolley problems are just “Would you Rather” but for adults
People in the philosophy biz often get away with questions that no one else can ask

Now you might wonder, if moral intuitions arbitrate which moral rules are “correct”, why not just use moral intuition to evaluate these dilemmas? What is the point of a moral rule? Well, they have several purposes:

  • Unless we’re born with moral intuitions that are perfectly accurate, we’re going to have some priors that are wrong.
  • Sometimes our moral priors are outright contradictory. For example, “Violence is never moral” and “Culture determines morality” contradict because of the possibility of violent cultures.
  • Sometimes our stated justifications for our moral intuitions are contradictory, but those intuitions could be represented with a non-contradictory moral rule.
  • Often, our intuitive ranking will be incomplete, which makes moral decisions difficult even if we could perfectly predict the consequences.

Finding a moral rule helps us to correct poor moral intuitions, explain our good moral intuitions with more clarity, and guide us where our intuitions don’t exist.

Our view of thinking of moral rules as summarising moral preferences over timelines works not just for consequentialist theories, but for all moral theories that meet some fairly general criteria (namely, completeness and transitivity).

Here’s to feeling good all the time
Even the really dumb ones

How are we going to link this to goals?

Now this part is extremely important. If we’re clever about how we represent each timeline, and how we represent each goal, we can mathematically connect goals to morality. The most elegant representation, if I do say so myself, is the following.

Let each timeline be the set containing all statements that are empirically true at all times and places within that timeline. For example, “[Unique identifier for you] is reading [unique identifier for this sentence]” would not be in the representation of the timeline that we exist in, because once you begin the next sentence, it’s no longer true. Instead, the statement would have to say something like “[Unique identifier for you] read [unique identifier for that sentence] during [the time period at which you read that sentence]”. For brevity’s sake, the remaining example statements won’t contain unique identifiers or time periods, as the precise meaning of the statements should be clear from the context.

The Hulk solves time travel
What? I see this wordy explanation as an absolute win.

Now if you’re a stickler, and I know some of you are (yes, I’m pointing at you), you’re probably thinking “But we can’t actually talk about timelines, because at best, we only make educated guesses about the future.” So let’s incorporate uncertainty by using “timeline forecasts”.

Let’s represent these forecasts as the set of all predictions that we would make about a particular timeline right now. For example, the pull-the-lever forecast contains the statement “The five live with a credence of 1”. Note that each forecast can roughly be thought of as a probability distribution of timelines.

(Why are they effectively probability distributions? Each timeline contains all statements that are true about itself. So if “You pull the lever”, “The five live”, and “The one dies” are within a timeline, you can string these into a single, larger statement that is also in that timeline: “You pull the lever, and the five live, and the one dies”. So, each timeline contains a very large statement that uniquely identifies it within any finite set of timelines. And that combined statement, amended with an associated credence, will be within our timeline forecast.)

So, why is this particular representation so useful?

Linking the pieces together

This representation allows us to link moral rules to forecasts to “goalsets”.

Charlie as a crazy detective
Pictured: Me explaining literally anything

We’ll say that a moral rule chooses the best forecasts. I.e. it’s a function such that

  • the input is any set of forecasts, and
  • the output is the set of all forecasts that are weakly preferred to anything in the input set.

As for goalsets, they can be “satisfied” by particular forecasts (i.e. your goalset is a subset of that forecast). For example, let’s say your goalset in the trolley problem contains only one statement: “There was no way to save more people, with a credence of 1”. Then only the pull-the-lever forecast satisfies your goalset.

The last piece of this is your “empirical model”, which tells you what your forecast would be if you attempted a given “plan”. (I.e. it’s a function that maps a plan to a forecast.)

Wait, why goalsets? Can’t we evaluate goals individually?

We cannot evaluate goals individually. There are two reasons for this. (1) Any issue with a goal will manifest in a goalset that contains it, so we can rephrase any criticisms of a goal in terms of goalsets. And (2) there are times when we have to look at the goalset as a whole before identifying an issue.

For example, suppose we’re stranded on an island. Suppose the best outcome occurs when one of us hunts for food and the other builds a shelter or vice versa. So “I build a shelter” is a goal that fits in one ideal outcome, and “You build a shelter” fits in another. However, if our goalsets contains both those statements, we’ll starve (albeit in the comfort of an excellent shelter).

Okay? Good. Now we have everything in place.

What’s in the next part?

Remember that, at the start of all this, our main purpose is to show that mechanism design is the right way to make policy. Hopefully the constraints we outline will also be practical takeaways for you, just as they’ve been for me.

We’ll show that we can make perfect policy with mechanism design (which we’ll rigorously define), as well as use it to navigate towards perfect policy when we don’t have all the answers just yet.

We now have all the tools to objectively define criteria for all of this, and we can prove mathematically how these properties relate to each other. Which, in my opinion, is pretty damn cool. If it interests you too, see you in the next one.


This is Part 2 of a series. I’ll try to write a new post each week. If you want to get notified of each new post, you can subscribe to my account.

Comments


No comments on this post yet.
Be the first to respond.
Curated and popular this week
 ·  · 38m read
 · 
In recent months, the CEOs of leading AI companies have grown increasingly confident about rapid progress: * OpenAI's Sam Altman: Shifted from saying in November "the rate of progress continues" to declaring in January "we are now confident we know how to build AGI" * Anthropic's Dario Amodei: Stated in January "I'm more confident than I've ever been that we're close to powerful capabilities... in the next 2-3 years" * Google DeepMind's Demis Hassabis: Changed from "as soon as 10 years" in autumn to "probably three to five years away" by January. What explains the shift? Is it just hype? Or could we really have Artificial General Intelligence (AGI)[1] by 2028? In this article, I look at what's driven recent progress, estimate how far those drivers can continue, and explain why they're likely to continue for at least four more years. In particular, while in 2024 progress in LLM chatbots seemed to slow, a new approach started to work: teaching the models to reason using reinforcement learning. In just a year, this let them surpass human PhDs at answering difficult scientific reasoning questions, and achieve expert-level performance on one-hour coding tasks. We don't know how capable AGI will become, but extrapolating the recent rate of progress suggests that, by 2028, we could reach AI models with beyond-human reasoning abilities, expert-level knowledge in every domain, and that can autonomously complete multi-week projects, and progress would likely continue from there.  On this set of software engineering & computer use tasks, in 2020 AI was only able to do tasks that would typically take a human expert a couple of seconds. By 2024, that had risen to almost an hour. If the trend continues, by 2028 it'll reach several weeks.  No longer mere chatbots, these 'agent' models might soon satisfy many people's definitions of AGI — roughly, AI systems that match human performance at most knowledge work (see definition in footnote). This means that, while the compa
 ·  · 4m read
 · 
SUMMARY:  ALLFED is launching an emergency appeal on the EA Forum due to a serious funding shortfall. Without new support, ALLFED will be forced to cut half our budget in the coming months, drastically reducing our capacity to help build global food system resilience for catastrophic scenarios like nuclear winter, a severe pandemic, or infrastructure breakdown. ALLFED is seeking $800,000 over the course of 2025 to sustain its team, continue policy-relevant research, and move forward with pilot projects that could save lives in a catastrophe. As funding priorities shift toward AI safety, we believe resilient food solutions remain a highly cost-effective way to protect the future. If you’re able to support or share this appeal, please visit allfed.info/donate. Donate to ALLFED FULL ARTICLE: I (David Denkenberger) am writing alongside two of my team-mates, as ALLFED’s co-founder, to ask for your support. This is the first time in Alliance to Feed the Earth in Disaster’s (ALLFED’s) 8 year existence that we have reached out on the EA Forum with a direct funding appeal outside of Marginal Funding Week/our annual updates. I am doing so because ALLFED’s funding situation is serious, and because so much of ALLFED’s progress to date has been made possible through the support, feedback, and collaboration of the EA community.  Read our funding appeal At ALLFED, we are deeply grateful to all our supporters, including the Survival and Flourishing Fund, which has provided the majority of our funding for years. At the end of 2024, we learned we would be receiving far less support than expected due to a shift in SFF’s strategic priorities toward AI safety. Without additional funding, ALLFED will need to shrink. I believe the marginal cost effectiveness for improving the future and saving lives of resilience is competitive with AI Safety, even if timelines are short, because of potential AI-induced catastrophes. That is why we are asking people to donate to this emergency appeal
 ·  · 1m read
 · 
We’ve written a new report on the threat of AI-enabled coups.  I think this is a very serious risk – comparable in importance to AI takeover but much more neglected.  In fact, AI-enabled coups and AI takeover have pretty similar threat models. To see this, here’s a very basic threat model for AI takeover: 1. Humanity develops superhuman AI 2. Superhuman AI is misaligned and power-seeking 3. Superhuman AI seizes power for itself And now here’s a closely analogous threat model for AI-enabled coups: 1. Humanity develops superhuman AI 2. Superhuman AI is controlled by a small group 3. Superhuman AI seizes power for the small group While the report focuses on the risk that someone seizes power over a country, I think that similar dynamics could allow someone to take over the world. In fact, if someone wanted to take over the world, their best strategy might well be to first stage an AI-enabled coup in the United States (or whichever country leads on superhuman AI), and then go from there to world domination. A single person taking over the world would be really bad. I’ve previously argued that it might even be worse than AI takeover. [1] The concrete threat models for AI-enabled coups that we discuss largely translate like-for-like over to the risk of AI takeover.[2] Similarly, there’s a lot of overlap in the mitigations that help with AI-enabled coups and AI takeover risk — e.g. alignment audits to ensure no human has made AI secretly loyal to them, transparency about AI capabilities, monitoring AI activities for suspicious behaviour, and infosecurity to prevent insiders from tampering with training.  If the world won't slow down AI development based on AI takeover risk (e.g. because there’s isn’t strong evidence for misalignment), then advocating for a slow down based on the risk of AI-enabled coups might be more convincing and achieve many of the same goals.  I really want to encourage readers — especially those at labs or governments — to do something