Decomposing alignment to take advantage of paradigms

Christopher King

It is hard to solve alignment with money. When Elon Musk asked what should be done about AI safety, Yudkowsky tweeted:

The game board has already been played into a frankly awful state. There are not simple ways to throw money at the problem. If anyone comes to you with a brilliant solution like that, please, please talk to me first. I can think of things I'd try; they don't fit in one tweet. - Feb 21, 2023

Part of the problem is that alignment is pre-paradigmatic. It is not just that throwing money at it is hard; any kind of parallel effort (including the kind that wrote Wikipedia, the open source software the runs the world, and recreational mathematics) is difficult. From A newcomer’s guide to the technical AI safety field:

AI safety is a pre-paradigmatic field, which APA defines as:

a science at a primitive stage of development, before it has achieved a paradigm and established a consensus about the true nature of the subject matter and how to approach it.

In other words, there is no universally agreed-upon description of what the alignment problem is. Some would even describe the field as ‘non-paradigmatic’, where the field may not converge to a single paradigm given the nature of the problem that may never be definitely established. It’s not just that the proposed solutions garner plenty of disagreements, the nature of the problem itself is ill-defined and often disagreed among researchers in the field. Hence, the field is centered around various researchers / research organizations and their research agenda, which are built on very different formulations of the problem, or even a portfolio of these problems.

Therefore, I think it is incredibly useful if we can decompose the alignment problem such that most of the problems become approachable with a paradigm, even if the individual problems are harder. This is because we can adopt the institutions, processes, and best practices of fields that are based on paradigms, such as science and mathematics. These regularly tackle extremely difficult problems, thanks to their superior coordination.

My proposal for a decomposition: alignment = purely mathematical inner alignment + fully formalized indirect normativity

I propose we decompose alignment into (1) discovering how to align an AI's output to arbitrary mathematical functions (i.e. we don't care about embedded agency) and (2) creating a formalization of ontology/values in purely mathematical language. This decomposition might seem like it just makes things harder, but allow me to explain!

First, purely mathematical optimization. You might not believe this, but I think this might be the harder bit! However, it should be extremely paradigmatic.

Note that the choice of this decomposition wasn't paradigmatic, we have to rely on intuition to choose it. But those that do can then cooperate much easier to achieve it!

Purely mathematical inner alignment

Superhuman mathematical optimization: let (i.e. a function from strings to the numbers between 0 and 1 (inclusive)) be expressible by a formula in first-order arithmetic (with suitable encodings (we can represent strings with natural numbers and real numbers with a formula for its Cauchy sequence, for example). Give an efficient algorithm $g$ that takes $f$ as input such that $E [(f (g (f)) - f (h (f))] \geq 0$ (where $E$ is interpreted in sense of our subjective expected value), where $h (f)$ is the result of any human or human organization (without any sort of cryptographic secrets) trying to optimize $f$ .

Note that, by definition, any AGI will be powerful enough to do this task (since it just needs to beat the best humans). See An AGI can guess the solution to a transcomputational problem? for more details.

However, we also require that it actually does the task, which is why its a form of inner alignment. This does not include outer alignment, because $g$ 's output can have arbitrarily bad impacts on the humans that read it. Nor does it, on its own, give us an AI powerful enough to protect us from unaligned AGIs, because it only cares about mathematical optimization, not protecting humanity.

I expect this to be highly paradigmatic, since its closely related to problems in AI already. There may even be a way to reduce it to a purely mathematical problem; the main obstacle is the repeated references to humans. But if we can somehow formulate a stronger version that doesn't refer to humans (be a better optimizer than any circuit up to size X or something?), we can throw the entire computer science community at it!

Fully formalized indirect normativity

Indirect normativity is an approach to the AI alignment problem that attempts to specify AI values indirectly, such as by reference to what a rational agent would value under idealized conditions, rather than via direct specification.

This seems like it is extremely hard, maybe not much easier than the full alignment problem. However, I think we already have a couple approaches:

Paul Christiano's A formalization of indirect normativity: create a mathematical description of an alignment researcher's brain to predict how they would solve alignment
Tammy Cardao's question-answer counterfactual interval (QACI): create a mathematical description of an alignment researcher answering questions, and then iterate questions of the form "What's a better version of this specification of human values: " until you hit the answer "that's good enough".
My own Inference from a Mathematical Description of an Existing Alignment Research (IMDEAR): Like Christiano's proposal, but with lower tech pre-requisites (this one still needs feedback and volunteers btw!)

Indirect normativity isn't particularly paradigmatic, but it might be close to completion anyways! We could view the three above proposals as three potential paradigms, for example.

Combining them to solve the full alignment problem

To solve alignment, use mathematical optimization to create a plan that optimizes our indirect specification of our values.

In particular, since the string "do nothing" is something humans can come up with, a superhuman mathematical optimizer will come up with a string that is less bad than that. This gives us impact regularization. In fact, if we did indirect normativity correctly and we want it to be corrigible, the AI's string must be better than "do nothing" according to every corrigibility property, including the hard problem of corrigibility. So it is safe. (An alternative, which isn't corrigible but still a good outcome, is to ask for a plan to directly maximizes CEV.)

But if it is a sufficiently powerful optimizer, it should be able to create a superhuman plan for the prompt "Give us a piece of source code that, when run, protects us against unaligned AGI (avoiding other impacts of course).". So it is effective.

Other choices for decompositions?

Are there any other choices for decompositions? Most candidates that I can think of either:

Decompose the alignment problem, but the hardest parts are still pre-paradigmatic
OR are paradigmatic, but don't decompose the entire alignment problem

Is there a decomposition that I didn't think of?

Conclusion

So, my proposal is that most of attempts of mass organizing alignment research (whether via professionals or volunteer work) ought to either use my decomposition, or a better one if it is found.

Effective Altruism Forum
EA Forum