Hide table of contents

It is hard to solve alignment with money. When Elon Musk asked what should be done about AI safety, Yudkowsky tweeted:

The game board has already been played into a frankly awful state. There are not simple ways to throw money at the problem. If anyone comes to you with a brilliant solution like that, please, please talk to me first. I can think of things I'd try; they don't fit in one tweet. - Feb 21, 2023

Part of the problem is that alignment is pre-paradigmatic. It is not just that throwing money at it is hard; any kind of parallel effort (including the kind that wrote Wikipedia, the open source software the runs the world, and recreational mathematics) is difficult. From A newcomer’s guide to the technical AI safety field:

AI safety is a pre-paradigmatic field, which APA defines as:

a science at a primitive stage of development, before it has achieved a paradigm and established a consensus about the true nature of the subject matter and how to approach it.

In other words, there is no universally agreed-upon description of what the alignment problem is. Some would even describe the field as ‘non-paradigmatic’, where the field may not converge to a single paradigm given the nature of the problem that may never be definitely established. It’s not just that the proposed solutions garner plenty of disagreements, the nature of the problem itself is ill-defined and often disagreed among researchers in the field. Hence, the field is centered around various researchers / research organizations and their research agenda, which are built on very different formulations of the problem, or even a portfolio of these problems.

Therefore, I think it is incredibly useful if we can decompose the alignment problem such that most of the problems become approachable with a paradigm, even if the individual problems are harder. This is because we can adopt the institutions, processes, and best practices of fields that are based on paradigms, such as science and mathematics. These regularly tackle extremely difficult problems, thanks to their superior coordination.

My proposal for a decomposition: alignment = purely mathematical inner alignment + fully formalized indirect normativity

I propose we decompose alignment into (1) discovering how to align an AI's output to arbitrary mathematical functions (i.e. we don't care about embedded agency) and (2) creating a formalization of ontology/values in purely mathematical language. This decomposition might seem like it just makes things harder, but allow me to explain!

First, purely mathematical optimization. You might not believe this, but I think this might be the harder bit! However, it should be extremely paradigmatic.

Note that the choice of this decomposition wasn't paradigmatic, we have to rely on intuition to choose it. But those that do can then cooperate much easier to achieve it!

Purely mathematical inner alignment

Superhuman mathematical optimization: let  (i.e. a function from strings to the numbers between 0 and 1 (inclusive)) be expressible by a formula in first-order arithmetic (with suitable encodings (we can represent strings with natural numbers and real numbers with a formula for its Cauchy sequence, for example). Give an efficient algorithm that takes as input such that  (where  is interpreted in sense of our subjective expected value), where  is the result of any human or human organization (without any sort of cryptographic secrets) trying to optimize .

Note that, by definition, any AGI will be powerful enough to do this task (since it just needs to beat the best humans). See An AGI can guess the solution to a transcomputational problem? for more details.

However, we also require that it actually does the task, which is why its a form of inner alignment. This does not include outer alignment, because 's output can have arbitrarily bad impacts on the humans that read it. Nor does it, on its own, give us an AI powerful enough to protect us from unaligned AGIs, because it only cares about mathematical optimization, not protecting humanity.

I expect this to be highly paradigmatic, since its closely related to problems in AI already. There may even be a way to reduce it to a purely mathematical problem; the main obstacle is the repeated references to humans. But if we can somehow formulate a stronger version that doesn't refer to humans (be a better optimizer than any circuit up to size X or something?), we can throw the entire computer science community at it!

Fully formalized indirect normativity

Indirect normativity is an approach to the AI alignment problem that attempts to specify AI values indirectly, such as by reference to what a rational agent would value under idealized conditions, rather than via direct specification.

This seems like it is extremely hard, maybe not much easier than the full alignment problem. However, I think we already have a couple approaches:

Indirect normativity isn't particularly paradigmatic, but it might be close to completion anyways! We could view the three above proposals as three potential paradigms, for example.

Combining them to solve the full alignment problem

To solve alignment, use mathematical optimization to create a plan that optimizes our indirect specification of our values.

In particular, since the string "do nothing" is something humans can come up with, a superhuman mathematical optimizer will come up with a string that is less bad than that. This gives us impact regularization. In fact, if we did indirect normativity correctly and we want it to be corrigible, the AI's string must be better than "do nothing" according to every corrigibility property, including the hard problem of corrigibility. So it is safe. (An alternative, which isn't corrigible but still a good outcome, is to ask for a plan to directly maximizes CEV.)

But if it is a sufficiently powerful optimizer, it should be able to create a superhuman plan for the prompt "Give us a piece of source code that, when run, protects us against unaligned AGI (avoiding other impacts of course).". So it is effective.

Other choices for decompositions?

Are there any other choices for decompositions? Most candidates that I can think of either:

  1. Decompose the alignment problem, but the hardest parts are still pre-paradigmatic
  2. OR are paradigmatic, but don't decompose the entire alignment problem

Is there a decomposition that I didn't think of?

Conclusion

So, my proposal is that most of attempts of mass organizing alignment research (whether via professionals or volunteer work) ought to either use my decomposition, or a better one if it is found.

Comments


No comments on this post yet.
Be the first to respond.
Curated and popular this week
Paul Present
 ·  · 28m read
 · 
Note: I am not a malaria expert. This is my best-faith attempt at answering a question that was bothering me, but this field is a large and complex field, and I’ve almost certainly misunderstood something somewhere along the way. Summary While the world made incredible progress in reducing malaria cases from 2000 to 2015, the past 10 years have seen malaria cases stop declining and start rising. I investigated potential reasons behind this increase through reading the existing literature and looking at publicly available data, and I identified three key factors explaining the rise: 1. Population Growth: Africa's population has increased by approximately 75% since 2000. This alone explains most of the increase in absolute case numbers, while cases per capita have remained relatively flat since 2015. 2. Stagnant Funding: After rapid growth starting in 2000, funding for malaria prevention plateaued around 2010. 3. Insecticide Resistance: Mosquitoes have become increasingly resistant to the insecticides used in bednets over the past 20 years. This has made older models of bednets less effective, although they still have some effect. Newer models of bednets developed in response to insecticide resistance are more effective but still not widely deployed.  I very crudely estimate that without any of these factors, there would be 55% fewer malaria cases in the world than what we see today. I think all three of these factors are roughly equally important in explaining the difference.  Alternative explanations like removal of PFAS, climate change, or invasive mosquito species don't appear to be major contributors.  Overall this investigation made me more convinced that bednets are an effective global health intervention.  Introduction In 2015, malaria rates were down, and EAs were celebrating. Giving What We Can posted this incredible gif showing the decrease in malaria cases across Africa since 2000: Giving What We Can said that > The reduction in malaria has be
Rory Fenton
 ·  · 6m read
 · 
Cross-posted from my blog. Contrary to my carefully crafted brand as a weak nerd, I go to a local CrossFit gym a few times a week. Every year, the gym raises funds for a scholarship for teens from lower-income families to attend their summer camp program. I don’t know how many Crossfit-interested low-income teens there are in my small town, but I’ll guess there are perhaps 2 of them who would benefit from the scholarship. After all, CrossFit is pretty niche, and the town is small. Helping youngsters get swole in the Pacific Northwest is not exactly as cost-effective as preventing malaria in Malawi. But I notice I feel drawn to supporting the scholarship anyway. Every time it pops in my head I think, “My money could fully solve this problem”. The camp only costs a few hundred dollars per kid and if there are just 2 kids who need support, I could give $500 and there would no longer be teenagers in my town who want to go to a CrossFit summer camp but can’t. Thanks to me, the hero, this problem would be entirely solved. 100%. That is not how most nonprofit work feels to me. You are only ever making small dents in important problems I want to work on big problems. Global poverty. Malaria. Everyone not suddenly dying. But if I’m honest, what I really want is to solve those problems. Me, personally, solve them. This is a continued source of frustration and sadness because I absolutely cannot solve those problems. Consider what else my $500 CrossFit scholarship might do: * I want to save lives, and USAID suddenly stops giving $7 billion a year to PEPFAR. So I give $500 to the Rapid Response Fund. My donation solves 0.000001% of the problem and I feel like I have failed. * I want to solve climate change, and getting to net zero will require stopping or removing emissions of 1,500 billion tons of carbon dioxide. I give $500 to a policy nonprofit that reduces emissions, in expectation, by 50 tons. My donation solves 0.000000003% of the problem and I feel like I have f
LewisBollard
 ·  · 8m read
 · 
> How the dismal science can help us end the dismal treatment of farm animals By Martin Gould ---------------------------------------- Note: This post was crossposted from the Open Philanthropy Farm Animal Welfare Research Newsletter by the Forum team, with the author's permission. The author may not see or respond to comments on this post. ---------------------------------------- This year we’ll be sharing a few notes from my colleagues on their areas of expertise. The first is from Martin. I’ll be back next month. - Lewis In 2024, Denmark announced plans to introduce the world’s first carbon tax on cow, sheep, and pig farming. Climate advocates celebrated, but animal advocates should be much more cautious. When Denmark’s Aarhus municipality tested a similar tax in 2022, beef purchases dropped by 40% while demand for chicken and pork increased. Beef is the most emissions-intensive meat, so carbon taxes hit it hardest — and Denmark’s policies don’t even cover chicken or fish. When the price of beef rises, consumers mostly shift to other meats like chicken. And replacing beef with chicken means more animals suffer in worse conditions — about 190 chickens are needed to match the meat from one cow, and chickens are raised in much worse conditions. It may be possible to design carbon taxes which avoid this outcome; a recent paper argues that a broad carbon tax would reduce all meat production (although it omits impacts on egg or dairy production). But with cows ten times more emissions-intensive than chicken per kilogram of meat, other governments may follow Denmark’s lead — focusing taxes on the highest emitters while ignoring the welfare implications. Beef is easily the most emissions-intensive meat, but also requires the fewest animals for a given amount. The graph shows climate emissions per tonne of meat on the right-hand side, and the number of animals needed to produce a kilogram of meat on the left. The fish “lives lost” number varies significantly by