Hide table of contents

Many thanks to Jan for commenting on a draft of this post.

There were a lot of great comments on "Let's see you write that corrigibility tag". This is my attempt at expanding Jan Kulveit's comment[1], because I thought it was useful, and should be read more widely. This post may not accurately represent Jan's thoughts. This post took approximately 3 hours to write.

Introduction

'Corrigible' is a word that is used in a lot of different ways in AI alignment. Here, we're using Eliezer Yudkowsky's framing from mad investor chaos and the woman of asmodeus (a story about a lot of things, but also about alignment). In the story, there is a world with plans to create a limited AI if needed in an emergency, with as many safeguards as they have come up with at the time. 

The initial craft doesn't have to be literally perfect to work perfectly in the end, it just has to be good enough that its reflection and self-correction ends up in exactly the right final place.

Eliezer asked if LessWrong could write a list of safeguards for a limited AI like this without his input. Eliezer's answer is available here; this post is an explanation of Jan's answer. Jan notes that his answer was written in Eliezer's frame, not the frame he would normally think in.

We'll look at examples of five types of tool; those for reducing instrumental convergence and reducing unbounded rationality, 'traps', oversight methods, and interpretability methods. 

Five Types of Tool

Against Instrumental Convergence

"Instrumental convergence is the theorized tendency for most sufficiently intelligent agents to pursue potentially unbounded instrumental goals, such as self-preservation and resource acquisition". 

Instrumental convergence precludes corrigibility. For instance, many theoretical agents will not allow their goal function to be changed, because this will reduce their ability to achieve their current goals. Additionally, pursuit of instrumental goals can break some of the other corrigibility methods. Preventing instrumental convergence is therefore key to achieving corrigibility. Jan outlines three tools that could be used to help prevent instrumental convergence: preventing resource acquisition, uncertainty about goals, and making the AI want to not exist in the future.

Resource acquisition can be disincentivized by using a mutual information measure between the AI and distant parts of the environment (e.g. an asteroid, parts of the opposite side of the earth, parts of the core of the earth). The term 'mutual information' refers to how easy it is to predict the state of one, given the state of the other. If the AI is taking actions that massively impact the state of these parts of the environment (e.g. blowing up the earth for raw metals, asteroid mining, etc.), it will be very easy to guess the state of these parts of the environment from perfect knowledge of the state of the AI. As the AI is penalised for high mutual information, actions that have large consequences are discouraged.

Jan suggests the AI should have uncertainty over its intended task, with a reasonable prior belief. This should incentivise the AI to do things like ask humans questions to clarify its goals. This concept is explored more in Stuart Russell's book Human Compatible, though I'm not aware of any empirical work demonstrating AI with goal uncertainty.

Finally, we can make the AI not want to exist in the future as a terminal goal. Making this sufficiently difficult without fulfilling the task will encourage the AI to complete the task in order to stop existing, by making this the safest/most likely path to non-existence.

Against Unbounded Rationality

Unbounded rationality is related to instrumental convergence; many agents will want to become more rational in order to better achieve their goals. Many agents will also have unbounded reward functions, where they get a theoretically unlimited amount of reward for fulfilling more of their reward function (e.g. making infinitely many paperclips). Jan outlines three methods for limiting unbounded rationality: a penalty on reasoning, making satisficers rather than maximisers, and making myopic agents. 

To reduce the desire for increased intelligence, we can make the AI have some utility cost for any bit-flip that occurs in the system. This will cause the AI to have a penalty on how much reasoning it does, and have to balance the expected gains from more intelligence against the built-in utility cost. Jan suggests the cost function could be very small initially, but quickly get steeper as the amount of processing goes up. Jan suggests this works best in combination with other corrigibility mechanisms, but is fundamentally an attempt to keep the 'limited AI' actually limited. 

To demonstrate the value of satisficers (which have bounded goals, rather than unbounded goals), we'll use the paperclip example. An unbounded model, which gets increasing reward for more paperclips with no end into infinity, will be incentivised to destroy human civilisation to access resources that can be used to make paperclips. A satisficer — which gets more reward for each paperclip it collects up to 100 paperclips, but no additional reward for 101+ paperclips, will select randomly from actions which gain at least 100 paperclips (or  select a policy with the maximal probability of attaining 100 stamps, which can still cause some risks). Here is an example paper in this direction.

Finally, myopic means short-sighted. In the context of AI, it is used to refer to agents which have a strong 'discount factor'; future reward is much less important to them than proximal reward. This is an important alignment tool because it encourages AI to take the shortest possible path to its terminal rewards, reducing instrumental convergence and disincentivising e.g. taking time to increase its intelligence or gather many resources. In particular, this combined with the AI not wanting to exist in the future encourage the AI to complete its task quickly without taking over the whole world first.

Traps

We can use a number of 'traps' left in the environment or in the AI which disincentivise bad behaviour. These include tripwires, watchdogs, and uncertainty about the level of simulation.

Firstly, tripwire artifacts -- related to mutual information measures, if we can create high negative reward for changing the state of distant parts of the environment, and hide from the system which parts of the environment these are, then the AI will be incentivised to be 'low-impact' and avoid massively altering the environment. 

We can also use watchdogs -- small, fast systems external to the agent which are trained to react quickly to out-of-distribution behaviour or certain specific bad or unexpected outcomes. This works best if the agent is uncertain how the watchdogs are set up.

Lastly, we can make the agent uncertain if it is in training or deployment. In this case, the agent is forced to behave in ways that it expects humans will find acceptable (or at worst, randomly defect[2]). This works best if the agent is also uncertain how many such simulated test runs occur (so it does not know how often it 'should' randomly defect), but may be practically difficult to achieve (how do you simulate reality well enough to fool a superintelligence?). 

Oversight

Oversight can be used to encourage the AI to behave in ways that line up with human values. These include a human-approval model, value learning, a legal system model, and AI debate style red teaming of action plans.

First, we can train a model to imitate humans approving/disapproving of actions that the AI takes, and penalise the AI when this model expects that humans would disapprove of its plan. Similarly, we can train a model which estimates what humans will say about various ethical scenarios, and penalise the agent when this model disapproves of its plan of action. We can also penalise the agent for breaking laws e.g. violating property rights, by training a model to define these. 

Jan also suggests using red-teaming as a method of producing automated oversight. This means having multiple sub-parts with different goals, like having one part plan how to produce text that fits the prompt, training another part to judge text on how harmless it is, and training the first one to get a good score as rated by the second. This can be done when training the AI, in addition to having an ethical model during deployment (as above). 

Interpretability

Our final class of tools is interpretability tools. An AI is interpretable if humans are able to follow its internal logic, how it came to its decisions, what its current plan of action is, and what it expects the consequences of that plan to be. 

Jan suggests two interpretability methods -- imposing strong incentives (in the reward function of the AI) for being highly modular, and having messaging between modules that is human-understandable. Jan also suggests training a separate 'translator' system that produces human-understandable explanations of what the system is doing. One promising direction towards this kind of tool is OpenAIs work to automate interpretability by asking GPT-4 to interpret neurons in GPT-2.

Conclusion

Although there's still a lot of technical work to be done on a number of these proposals (i.e. this isn't a complete solution for alignment), it does cut the problem into more manageable pieces, for many of which the groundwork can be (or is being) done today e.g. OpenAIs work on interpretability, constitutional AI, etc. 

Comments


No comments on this post yet.
Be the first to respond.
Curated and popular this week
 ·  · 1m read
 · 
(Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.) > “There comes a moment when the children who have been playing at burglars hush suddenly: was that a real footstep in the hall?”  > > - C.S. Lewis “The Human Condition,” by René Magritte (Image source here) 1. Introduction Sometimes, my thinking feels more “real” to me; and sometimes, it feels more “fake.” I want to do the real version, so I want to understand this spectrum better. This essay offers some reflections.  I give a bunch of examples of this “fake vs. real” spectrum below -- in AI, philosophy, competitive debate, everyday life, and religion. My current sense is that it brings together a cluster of related dimensions, namely: * Map vs. world: Is my mind directed at an abstraction, or it is trying to see past its model to the world beyond? * Hollow vs. solid: Am I using concepts/premises/frames that I secretly suspect are bullshit, or do I expect them to point at basically real stuff, even if imperfectly? * Rote vs. new: Is the thinking pre-computed, or is new processing occurring? * Soldier vs. scout: Is the thinking trying to defend a pre-chosen position, or is it just trying to get to the truth? * Dry vs. visceral: Does the content feel abstract and heady, or does it grip me at some more gut level? These dimensions aren’t the same. But I think they’re correlated – and I offer some speculations about why. In particular, I speculate about their relationship to the “telos” of thinking – that is, to the thing that thinking is “supposed to” do.  I also describe some tags I’m currently using when I remind myself to “really think.” In particular:  * Going slow * Following curiosity/aliveness * Staying in touch with why I’m thinking about something * Tethering my concepts to referents that feel “real” to me * Reminding myself that “arguments are lenses on the world” * Tuning into a relaxing sense of “helplessness” about the truth * Just actually imagining differ
 ·  · 5m read
 · 
When we built a calculator to help meat-eaters offset the animal welfare impact of their diet through donations (like carbon offsets), we didn't expect it to become one of our most effective tools for engaging new donors. In this post we explain how it works, why it seems particularly promising for increasing support for farmed animal charities, and what you can do to support this work if you think it’s worthwhile. In the comments I’ll also share our answers to some frequently asked questions and concerns some people have when thinking about the idea of an ‘animal welfare offset’. Background FarmKind is a donation platform whose mission is to support the animal movement by raising funds from the general public for some of the most effective charities working to fix factory farming. When we built our platform, we directionally estimated how much a donation to each of our recommended charities helps animals, to show users.  This also made it possible for us to calculate how much someone would need to donate to do as much good for farmed animals as their diet harms them – like carbon offsetting, but for animal welfare. So we built it. What we didn’t expect was how much something we built as a side project would capture peoples’ imaginations!  What it is and what it isn’t What it is:  * An engaging tool for bringing to life the idea that there are still ways to help farmed animals even if you’re unable/unwilling to go vegetarian/vegan. * A way to help people get a rough sense of how much they might want to give to do an amount of good that’s commensurate with the harm to farmed animals caused by their diet What it isn’t:  * A perfectly accurate crystal ball to determine how much a given individual would need to donate to exactly offset their diet. See the caveats here to understand why you shouldn’t take this (or any other charity impact estimate) literally. All models are wrong but some are useful. * A flashy piece of software (yet!). It was built as
Garrison
 ·  · 7m read
 · 
This is the full text of a post from "The Obsolete Newsletter," a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race to build Machine Superintelligence. Consider subscribing to stay up to date with my work. Wow. The Wall Street Journal just reported that, "a consortium of investors led by Elon Musk is offering $97.4 billion to buy the nonprofit that controls OpenAI." Technically, they can't actually do that, so I'm going to assume that Musk is trying to buy all of the nonprofit's assets, which include governing control over OpenAI's for-profit, as well as all the profits above the company's profit caps. OpenAI CEO Sam Altman already tweeted, "no thank you but we will buy twitter for $9.74 billion if you want." (Musk, for his part, replied with just the word: "Swindler.") Even if Altman were willing, it's not clear if this bid could even go through. It can probably best be understood as an attempt to throw a wrench in OpenAI's ongoing plan to restructure fully into a for-profit company. To complete the transition, OpenAI needs to compensate its nonprofit for the fair market value of what it is giving up. In October, The Information reported that OpenAI was planning to give the nonprofit at least 25 percent of the new company, at the time, worth $37.5 billion. But in late January, the Financial Times reported that the nonprofit might only receive around $30 billion, "but a final price is yet to be determined." That's still a lot of money, but many experts I've spoken with think it drastically undervalues what the nonprofit is giving up. Musk has sued to block OpenAI's conversion, arguing that he would be irreparably harmed if it went through. But while Musk's suit seems unlikely to succeed, his latest gambit might significantly drive up the price OpenAI has to pay. (My guess is that Altman will still ma