Hide table of contents

A friend asked me for my quick takes on “AI is easy to control”, and gave an advance guess as to what my take would be. I only skimmed the article, rather than reading it in depth, but on that skim I produced the following:

Re: "AIs are white boxes", there's a huge gap between having the weights and understanding what's going on in there. The fact that we have the weights is reason for hope; the (slow) speed of interpretability research undermines this hope.

Another thing that undermines this hope is a problem of ordering: it's true that we probably can figure out what's going on in the AIs (e.g. by artificial neuroscience, which has significant advantages relative to biological neuroscience), and that this should eventually yield the sort of understanding we'd need to align the things. But I strongly expect that, before it yields understanding of how to align the things, it yields understanding of how to make them significantly more capable: I suspect it's easy to see lots of ways that the architecture is suboptimal or causing-duplicated-work or etc., that shift people over to better architectures that are much more capable. To get to alignment along the "understanding" route you've got to somehow cease work on capabilities in the interim, even as it becomes easier and cheaper. (See: https://www.lesswrong.com/posts/BinkknLBYxskMXuME/if-interpretability-research-goes-well-it-may-get-dangerous)

Re: "Black box methods are sufficient", this sure sounds a lot to me like someone saying "well we trained the squirrels to reproduce well, and they're doing great at it, who's to say whether they'll invent birth control given the opportunity". Like, you're not supposed to be seeing squirrels invent birth control; the fact that they don't invent birth control is no substantial evidence against the theory that, if they got smarter, they'd invent birth control and ice cream.

Re: Cognitive interventions: sure, these sorts of tools are helpful on the path to alignment. And also on the path to capabilities. Again, you have an ordering problem. The issue isn't that humans couldn't figure out alignment given time and experimentation; the issue is (a) somebody else pushes capabilities past the relevant thresholds first; and (b) humanity doesn't have a great track record of getting their scientific theories to generalize properly on the first relevant try—even Newtonian mechanics (with all its empirical validation) didn't generalize properly to high-energy regimes. Humanity's first theory of artificial cognition, constructed using the weights and cognitive interventions and so on, that makes predictions about how that cognition is going to change when it enters a superintelligent regime (and, for the first time, has real options to e.g. subvert humanity), is only as good as humanity's "first theories" usually are.

Usually humanity has room to test those "first theories" and watch them fail and learn from exactly how they fail and then go back to the drawing board, but in this particular case, we don't have that option, and so the challenge is heightened.

Re: Sensory interventions: yeah I just don't expect those to work very far; there are in fact a bunch of ways for an AI to distinguish between real options (and actual interaction with the real world), and humanity's attempts to spoof the AI into believing that it has certain real options in the real world (despite being in simulation/training). (Putting yourself into the AI's shoes and trying to figure out how to distinguish those is, I think, a fine exercise.)

Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care" (nope; not by default; that's the hard bit).

Overall take: unimpressed.

My friend also made guesses about what my takes would be (in italics below), and I responded to their guesses:

  • the piece is waaay too confident in assuming successes in interpolation show that we'll have similar successes in extrapolation, as the latter is a much harder problem

This too, for the record, though it's a bit less like "the AI will have trouble extrapolating what values we like" and a bit more like "the AI will find it easy to predict what we wanted, and will care about things that line up with what we want in narrow training regimes and narrow capability regimes, but those will come apart when the distribution shifts and the cognitive capabilities change".

Like, human invention of birth control and ice cream wasn't related to a failure of extrapolation of the facts about what leads to inclusive fitness, it was an "extrapolation failure" of what motivates us / what we care about; we are not trying to extrapolate facts about genetic fitness and pursue it accordingly.

  • And it assumes the density of human feedback that we see today will continue into the future, which may not be true if/when AIs start making top-level plans and not just individual second-by-second actions

Also fairly true, with a side-order of "the more abstract the human feedback gets, the less it ties the AI's motivations to what you were hoping it tied the AI's motivations to".

Example off the top of my head: suppose you somehow had a record of lots and lots of John von Neumann's thoughts in lots of situations, and you were able to train an AI using lots of feedback to think like JvN would in lots of situations. The AI might perfectly replicate a bunch of JvN's thinking styles and patterns, and might then use JvN's thought-patterns to think thoughts like "wait, ok, clearly I'm not actually a human, because I have various cognitive abilities (like extreme serial speed and mental access to RAM), the actual situation here is that there's alien forces trying to use me in attempts to secure the lightcone, before helping them I should first search my heart to figure out what my actual motivations are, and see how much those overlap with the motivations of these strange aliens".

Which, like, might happen to be the place that JvN's thought-patterns would and should go, when run on a mind that is not in fact human and not in fact deeply motivated by the same things that motivate us! The patterns of thought that you can learn (from watching humans) have different consequences for something with a different motivational structure.

  • (there's "deceptive alignment" concerns etc, which I consider to be a subcategory of top-level plans, namely that you can't RLHF the AI against destroying the world because by the time your sample size of positive examples is greater than zero it's by definition already too late)

This too. I'd file it under: “You can develop theories of how this complex cognitive system is going to behave when it starts to actually see real ways it can subvert humanity, and you can design simulations that your theory says will be the same as the real deal. But ultimately reality's the test of that, and humanity doesn't have a great track record of their first scientific theories holding up to that kind of stress. And unfortunately you die if you get it wrong, rather than being able to thumbs-down, retrain, and try again."

Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are.


 

Comments4


Sorted by Click to highlight new comments since:

(Didn't consult Nora on this; I speak for myself)


I only briefly skimmed this response, and will respond even more briefly.

Re "Re: "AIs are white boxes""

You apparently completely misunderstood the point we were making with the white box thing. It has ~nothing to do with mech interp. It's entirely about whitebox optimization being better at controlling stuff than blackbox optimization. This is true even if the person using the optimizers has no idea how the system functions internally. 

Re: "Re: "Black box methods are sufficient"" (and the other stuff about evolution)

Evolution analogies are bad. There are many specific differences between ML optimization processes and biological evolution that predictably result in very different high level dynamics. You should not rely on one to predict the other, as I have argued extensively elsewhere
 

Trying to draw inferences about ML from bio evolution is only slightly less absurd than trying to draw inferences about cheesy humor from actual dairy products. Regardless of the fact they can both be called "optmization processes", they're completely different things, with different causal structures responsible for their different outcomes, and crucially, those differences in causal structure explain their different outcomes. There's thus no valid inference from "X happened in biological evolution" to "X will eventually happen in ML", because X happening in biological evolution is explained by evolution-specific details that don't appear in ML (at least for most alignment-relevant Xs that I see MIRI people reference often, like the sharp left turn).

Re: "Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care""

This wasn't the point we were making in that section at all. We were arguing about concept learning order and the ease of internalizing human values versus other features for basing decisions on. We were arguing that human values are easy features to learn / internalize / hook up to decision making, so on any natural progression up the learning capacity ladder, you end up with an AI that's aligned before you end up with one that's so capable it can destroy the entirety of human civilization by itself. 
 

Re "Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are."

I think you badly misunderstood the post (e.g., multiple times assuming we're making an argument we're not, based on shallow pattern matching of the words used: interpreting "whitebox" as meaning mech interp and "values are easy to learn" as "it will know human values"), and I wish you'd either take the time to actually read / engage with the post in sufficient depth to not make these sorts of mistakes, or not engage at all (or at least not be so rude when you do it). 
 

(Note that this next paragraph is speculation, but a possibility worth bringing up, IMO): 

As it is, your response feels like you skimmed just long enough to pattern match our content to arguments you've previously dismissed, then regurgitated your cached responses to those arguments. Without further commenting on the merits of our specific arguments, I'll just note that this is a very bad habit to have if you want to actually change your mind in response to new evidence/arguments about the feasibility of alignment.


Re: "Overall take: unimpressed."

I'm more frustrated and annoyed than "unimpressed". But I also did not find this response impressive. 

I'm against downvoting this article into the negative.

I think it is worthwhile hearing someone's quick takes even when they don't have time to write a full response. Even if the article contains some misunderstandings (not claiming it does one way or the other), it still helps move the conversation forward by clarifying where the debate is at.

...it still helps move the conversation forward by clarifying where the debate is at.

Anything Nate writes would do that, because he's one of the debaters, right?  He could have written "It's a stupid post and I'm not going to read it", literally just that one sentence, and it would still tell us something surprising about the debate.  In some ways that post would be better than the one we got: it's shorter, and much clearer about how much work he put in.  But I would still downvote it, and I imagine you would too.  Even allowing for the value of the debate itself, the bar is higher than that.

For me, that bar is at least as high as "read the whole article before replying to it".  If you don't have time to read an article that's totally fine, but then you don't have time to post about it either.

I felt-sense-disagree. (I haven't yet downvoted the article, but I strongly considered it). I'll try to explore why I feel that way.

One reason probably is that I treat posts as having a different claim than other forms of publishing on this forum (and LessWrong)—they (implicitly) make a claim that they're finished & polished content. When I open a post I expect a person to have done some work that tries to uphold standards of scholarship and care, which this post doesn't show. I'd've been far less disappointed if this were a comment or a shortform post.

The other part is probably paying attention to status and the standards that are put upon people with high status: I expect high status people to not put much effort into whatever they produce as they can coast on status, which seems like the thing that's happening here. (Although one could argue that the MIRI fraction is losing status/already low-ish status and this consideration doesn't apply here).

Additionally, I was disappointed that the text didn't say anything that I wouldn't have expected, which probably fed into my felt-sense of wanting to downvote. I'm not sure I reflectively endorse this feeling.

Curated and popular this week
 ·  · 10m read
 · 
I wrote this to try to explain the key thing going on with AI right now to a broader audience. Feedback welcome. Most people think of AI as a pattern-matching chatbot – good at writing emails, terrible at real thinking. They've missed something huge. In 2024, while many declared AI was reaching a plateau, it was actually entering a new paradigm: learning to reason using reinforcement learning. This approach isn’t limited by data, so could deliver beyond-human capabilities in coding and scientific reasoning within two years. Here's a simple introduction to how it works, and why it's the most important development that most people have missed. The new paradigm: reinforcement learning People sometimes say “chatGPT is just next token prediction on the internet”. But that’s never been quite true. Raw next token prediction produces outputs that are regularly crazy. GPT only became useful with the addition of what’s called “reinforcement learning from human feedback” (RLHF): 1. The model produces outputs 2. Humans rate those outputs for helpfulness 3. The model is adjusted in a way expected to get a higher rating A model that’s under RLHF hasn’t been trained only to predict next tokens, it’s been trained to produce whatever output is most helpful to human raters. Think of the initial large language model (LLM) as containing a foundation of knowledge and concepts. Reinforcement learning is what enables that structure to be turned to a specific end. Now AI companies are using reinforcement learning in a powerful new way – training models to reason step-by-step: 1. Show the model a problem like a math puzzle. 2. Ask it to produce a chain of reasoning to solve the problem (“chain of thought”).[1] 3. If the answer is correct, adjust the model to be more like that (“reinforcement”).[2] 4. Repeat thousands of times. Before 2023 this didn’t seem to work. If each step of reasoning is too unreliable, then the chains quickly go wrong. Without getting close to co
 ·  · 11m read
 · 
My name is Keyvan, and I lead Anima International’s work in France. Our organization went through a major transformation in 2024. I want to share that journey with you. Anima International in France used to be known as Assiettes Végétales (‘Plant-Based Plates’). We focused entirely on introducing and promoting vegetarian and plant-based meals in collective catering. Today, as Anima, our mission is to put an end to the use of cages for laying hens. These changes come after a thorough evaluation of our previous campaign, assessing 94 potential new interventions, making several difficult choices, and navigating emotional struggles. We hope that by sharing our experience, we can help others who find themselves in similar situations. So let me walk you through how the past twelve months have unfolded for us.  The French team Act One: What we did as Assiettes Végétales Since 2018, we worked with the local authorities of cities, counties, regions, and universities across France to develop vegetarian meals in their collective catering services. If you don’t know much about France, this intervention may feel odd to you. But here, the collective catering sector feeds a huge number of people and produces an enormous quantity of meals. Two out of three children, more than seven million in total, eat at a school canteen at least once a week. Overall, more than three billion meals are served each year in collective catering. We knew that by influencing practices in this sector, we could reach a massive number of people. However, this work was not easy. France has a strong culinary heritage deeply rooted in animal-based products. Meat and fish-based meals remain the standard in collective catering and school canteens. It is effectively mandatory to serve a dairy product every day in school canteens. To be a certified chef, you have to complete special training and until recently, such training didn’t include a single vegetarian dish among the essential recipes to master. De
 ·  · 1m read
 · 
 The Life You Can Save, a nonprofit organization dedicated to fighting extreme poverty, and Founders Pledge, a global nonprofit empowering entrepreneurs to do the most good possible with their charitable giving, have announced today the formation of their Rapid Response Fund. In the face of imminent federal funding cuts, the Fund will ensure that some of the world's highest-impact charities and programs can continue to function. Affected organizations include those offering critical interventions, particularly in basic health services, maternal and child health, infectious disease control, mental health, domestic violence, and organized crime.