I'm the CTO of Redwood Research, a nonprofit focused on applied alignment research. Read more about us here: https://www.redwoodresearch.org/

I'm also a fund manager on the EA Infrastructure Fund.

Wiki Contributions


Linch's Shortform

How do you know whether you're happy with the results?

Linch's Shortform

This argument for the proposition "AI doesn't have an advantage over us at solving the alignment problem" doesn't work for outer alignment—some goals are easier to measure than others, and agents that are lucky enough to have easy-to-measure goals can train AGIs more easily.

What are the bad EA memes? How could we reframe them?

Unfortunately this isn’t a very good description of the concern about AI, and so even if it “polls better” I’d be reluctant to use it.

Apply to the ML for Alignment Bootcamp (MLAB) in Berkeley [Jan 3 - Jan 22]

No, the previous application will work fine. Thanks for applying :)

Buck's Shortform

I think it's bad when people who've been around EA for less than a year sign the GWWC pledge. I care a lot about this.

I would prefer groups to strongly discourage new people from signing it.

I can imagine boycotting groups that encouraged signing the GWWC pledge (though I'd probably first want to post about why I feel so strongly about this, and warn them that I was going to do so).

I regret taking the pledge, and the fact that the EA community didn't discourage me from taking it is by far my biggest complaint about how the EA movement has treated me. (EDIT: TBC, I don't think anyone senior in the movement actively encouraged we to do it, but I am annoyed at them for not actively discouraging it.)

(writing this short post now because I don't have time to write the full post right now)

We're Redwood Research, we do applied alignment research, AMA

Additionally, what are/how strong are the track records of Redwood's researchers/advisors?

The people we seek advice from on our research most often are Paul Christiano and Ajeya Cotra. Paul is a somewhat experienced ML researcher, who among other things led some of the applied alignment research projects that I am most excited about.

On our team, the people with the most relevant ML experience are probably Daniel Ziegler, who was involved with GPT-3 and also several OpenAI alignment research projects, and Peter Schmidt-Nielsen. Many of our other staff have research backgrounds (including publishing ML papers) that make me feel pretty optimistic about our ability to have good ML ideas and execute on the research.

How important do you think it is to have ML research projects be led by researchers who have had a lot of previous success in ML?

I think it kind of depends on what kind of ML research you’re trying to do. I think our projects require pretty similar types of expertise to eg Learning to Summarize with Human Feedback, and I think we have pretty analogous expertise to the team that did that research (and we’re advised by Paul, who led it).

I think that there are particular types of research that would be hard for us to do, due to not having certain types of expertise.

Maybe it's the case that the most useful ML research is done by the top ML researchers

I think that a lot of the research we are most interested in doing is not super bottlenecked on having the top ML researchers, in the same way that Learning to Summarize with Human Feedback doesn’t seem super bottlenecked on having the top ML researchers. I feel like the expertise we end up needing is some mixture of ML stuff like “how do we go about getting this transformer to do better on this classification task”, reasoning about the analogy to the AGI alignment problem, and lots of random stuff like making decisions about how to give feedback to our labellers.

or that the ML community won't take Redwood very seriously (e.g. won't consider using your algorithms) if the research projects aren't lead by people with strong track records in ML.

I don’t feel very concerned about this; in my experience, ML researchers are usually pretty willing to consider research on its merits, and we have had good interactions with people from various AI labs about our research.

We're Redwood Research, we do applied alignment research, AMA

So one thing to note is that I think that there are varying degrees of solving the technical alignment problem. In particular, you’ve solved the alignment problem more if you’ve made it really convenient for labs to use the alignment techniques you know about. If next week some theory people told me “hey we think we’ve solved the alignment problem, you just need to use IDA, imitative generalization, and this new crazy thing we just invented”, then I’d think that the main focus of the applied alignment community should be trying to apply these alignment techniques to the most capable currently available ML systems, in the hope of working out all the kinks in these techniques, and then repeat this every year, so that whenever it comes time to actually build the AGI with these techniques, the relevant lab can just hire all the applied alignment people who are experts on these techniques and get them to apply them. (You might call this fire drills for AI safety, or having an “anytime alignment plan” (someone else invented this latter term, I don’t remember who).)


Assuming that it's taking too long to solve the technical alignment problem, what might be some of our other best interventions to reduce x-risk from AI? E.g., regulation, institutions for fostering cooperation and coordination between AI labs, public pressure on AI labs/other actors to slow deployment, …

I normally focus my effort on the question “how do we solve the technical alignment problem and make it as convenient as possible to build aligned systems, and then ensure that the relevant capabilities labs put effort into using these alignment techniques”, rather than this question, because it seems relatively tractable, compared to causing things to go well in worlds like those you describe.

One way of thinking about your question is to ask how many years the deployment of existentially risky AI could be delayed (which might buy time to solve the alignment problem). I don’t have super strong takes on this question. I think that there are many reasonable-seeming interventions, such as all of those that you describe. I guess I’m more optimistic about regulation and voluntary coordination between AI labs (eg, I’m happy about “Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project.” from the OpenAI Charter) than about public pressure, but I’m not confident.

If we solve the technical alignment problem in time, what do you think are the other major sources of AI-related x-risk that remain? How likely do you think these are, compared to x-risk from not solving the technical alignment problem in time?

Again, I think that maybe 30% of AI accident risk comes from situations where we sort of solved the alignment problem in time but the relevant labs don’t use the known solutions. Excluding that, I think that misuse risk is serious and worth worrying about. I don’t know how much value I think is destroyed in expectation by AI misuse compared to AI accident. I can also imagine various x-risk related to narrow AI in various ways.

We're Redwood Research, we do applied alignment research, AMA

We could operationalize this as “How does P(doom) vary as a function of the total amount of quality-adjusted x-risk-motivated AI alignment output?” (A related question is “Of the quality-adjusted AI alignment research, how much will be motivated by x-risk concerns?” This second question feels less well defined.)

I’m pretty unsure here. Today, my guess is like 25% chance of x-risk from AI this century, and maybe I imagine that being 15% if we doubled the quantity of quality-adjusted x-risk-motivated AI alignment output, and 35% if we halved that quantity. But I don’t have explicit models here and just made these second two numbers up right now; I wouldn’t be surprised to hear that they moved noticeably after two hours of thought. I guess that one thing you might learn from these numbers is that I think that x-risk-motivated AI alignment output is really important.

What are the main factors you expect will influence this? (e.g. the occurrence of medium-scale alignment failures as warning shots)

I definitely think that AI x-risk seems lower in worlds where we expect medium-scale alignment failure warning shots. I don’t know whether I think that x-risk-motivated alignment research seems less important in those worlds or not--even if everyone thinks that AI is potentially dangerous, we have to have scalable solutions to alignment problems, and I don’t see a reliable route that takes us directly from “people are concerned” to “people solve the problem”.

I think the main factor that affects the importance of x-risk-motivated alignment research is whether it turns out that most of the alignment problem occurs in miniature in sub-AGI systems. If so, much more of the work required for aligning AGI will be done by people who aren’t thinking about how to reduce x-risk.

We're Redwood Research, we do applied alignment research, AMA

Here are some things I think are fairly likely:

  • I think that there might be a bunch of progress on theoretical alignment, with various consequences:
    • More projects that look like “do applied research on various strategies to make imitative generalization work in practice” -- that is, projects where the theory researchers have specific proposals for ML training schemes that have attractive alignment properties, but which have practical implementation questions that might require a bunch of effort to work out. I think that a lot of the impact from applied alignment research comes from making it easier for capabilities labs to adopt alignment schemes, and so I’m particularly excited for this kind of work.
    • More well-scoped narrow theoretical problems, so that there’s more gains from parallelism among theory researchers.
    • A better sense of what kinds of practical research is useful.
    • I think I will probably be noticeably more optimistic or pessimistic -- either there will be some plan for solving the problem that seems pretty legit to me, or else I’ll have updated substantially against such a plan existing.
  • We might have a clearer picture of AGI timelines. We might have better guesses about how early AGI will be trained. We might know more about empirical ML phenomena like scaling laws (which I think are somewhat relevant for alignment).
  • There will probably be a lot more industry interest in problems like “our pretrained model obviously knows a lot about topic X, but we don’t know how to elicit this knowledge from it.” I expect more interest in this because this becomes an increasingly important problem as your pretrained models become more knowledgeable. I think that this problem is pretty closely related to the alignment problem, so e.g. I expect that most research along the lines of Learning to Summarize with Human Feedback will be done by people who need this research for practical purposes, rather than alignment researchers interested in the analogy to AGI alignment problems.
  • Hopefully we’ll have more large applied alignment projects, as various x-risk-motivated orgs like Redwood scale up.
  • Plausibly large funders like Open Philanthropy will start spending large amounts of money on funding alignment-relevant research through RFPs or other mechanisms.
  • Probably we’ll have way better resources for onboarding new people into cutting edge thinking on alignment. I think that resources are way better than they were two years ago, and I expect this trend to continue.
  • Similarly, I think that there are a bunch of arguments about futurism and technical alignment that have been written up much more clearly and carefully now than they had been a few years ago. Eg Joe Carlsmith’s report on x-risk from power-seeking AGI and Ajeya Cotra on AGI timelines. I expect this trend to continue.
Load More