Redwood Research is a longtermist organization working on AI alignment based in Berkeley, California. We're going to do an AMA this week; we'll answer questions mostly on Wednesday and Thursday this week (6th and 7th of October). I expect to answer a bunch of questions myself; Nate Thomas and Bill Zito and perhaps other people will also be answering questions.

Here's an edited excerpt from this doc that describes our basic setup, plan, and goals.

Redwood Research is a longtermist research lab focusing on applied AI alignment. We’re led by Nate Thomas (CEO), Buck Shlegeris (CTO), and Bill Zito (COO/software engineer); our board is Nate, Paul Christiano and Holden Karnofsky. We currently have ten people on staff.

Our goal is to grow into a lab that does lots of alignment work that we think is particularly valuable and wouldn’t have happened elsewhere.

Our current approach to alignment research:

  • We’re generally focused on prosaic alignment approaches.
  • We expect to mostly produce value by doing applied alignment research. I think of applied alignment research as research that takes ideas for how to align systems, such as amplification or transparency, and then tries to figure out how to make them work out in practice. I expect that this kind of practical research will be a big part of making alignment succeed. See this post for a bit more about how I think about the distinction between theoretical and applied alignment work.
  • We are interested in thinking about our research from an explicit perspective of wanting to align superhuman systems.
    • When choosing between projects, we’ll be thinking about questions like “to what extent is this class of techniques fundamentally limited? Is this class of techniques likely to be a useful tool to have in our toolkit when we’re trying to align highly capable systems, or is it a dead end?”
    • I expect us to be quite interested in doing research of the form “fix alignment problems in current models” because it seems generally healthy to engage with concrete problems, but we’ll want to carefully think through exactly which problems along these lines are worth working on and which techniques we want to improve by solving them.

We're hiring for research, engineering, and an office operations manager.

You can see our website here. Other things we've written that might be interesting:

We're up for answering questions about anything people are interested in.

Sorted by Click to highlight new comments since:

How do we know the AMA answers are coming from real Redwood staff and not cleverly trained text models?

GPT-3 suggests: "We will post the AMA with a disclaimer that the answers are coming from Redwood staff. We will also be sure to include a link to our website in the body of the AMA, with contact information if someone wants to verify with us that an individual is staff."

That's quite a good answer

But wait, how do we know that was really written by an algorithm? ^^

"Click here to prove you are a robot"

"What needs to happen in order for the field of x-risk-motivated AI alignment research to employ a thousand ML researchers and engineers"?

(I’ll use this comment to also discuss some aspects of some other questions that have been asked.)

I think there are currently something like three categories of bottlenecks on alignment research:

  1. Having many tractable projects to work on that we expect will help (this may be limited by theoretical understanding / lack of end-to-end alignment solution)
  2. Institutional structures that make it easy to coordinate to work on alignment
  3. People who will attack the problem if they’re given some good institutional framework 

Regarding 1 (“tractable projects / theoretical understanding”):  Maybe in the next few years we will come to have clearer and more concrete schemes for aligning superhuman AI, and this might make it easier to scope engineering-requiring research projects that implement or test parts of those plans.  ARC, Paul Christiano’s research organization, is one group that is working towards this.

Regarding 2 (“institutional structures”), I think of there being 5 major categories of institutions that could house AI alignment researchers:

  • Alignment-focused research organizations (such as ARC or Redwood Research)
  • Industry labs (such as OpenAI or DeepMind)
  • Academia
  • Independent work
  • Government agencies (none exist currently that I’m aware of, but maybe they will in the future)

Redwood Research is currently focused on 2.  One of the hypotheses behind Redwood’s current organizational structure is “it’s important for organizations to focus closely on alignment research if they want to produce a lot of high-quality alignment research” (see, for example, common startup advice such as “The most important thing for startups to do is to focus” (Paul Graham)).  My guess is that it’s generally tricky to stay focused on the problems that are most likely to be core alignment problems, and I’m not sure how to do it well in some institutions.  I’m excited about the prospect of alignment-focused research organizations that are carefully focused on x-risk-reducing alignment work and willing to deploy resources and increase headcount toward this work.  

At Redwood, our current plan is to 

  • solicit project ideas that are theoretically motivated (ie they have some compelling story for how they are either analogous to or directly solving xrisk-associated problems for alignment of superintelligent systems) from researchers across the field of x-risk-motivated AI alignment,
  • hire researchers and engineers who we expect to help execute on those projects, and 
  • provide the managerial and operational support for them to successfully complete those projects.

There are various reasons why a focus on focus might not be the right call, such as “it’s important to have close contact with top ML researchers, even if they don’t care about working on alignment right now, otherwise you’ll be much worse at doing ML research” or “it’s important to use the latest technology, which could require developing that technology in house”.  This is why I think industry labs may be a reasonable bet.  My guess is that (with respect to quality-adjusted output of alignment research) they have lower variance but also lower upside.  Roughly speaking, I am currently somewhat less excited about academia, independent work, and government agencies, but I’m also fairly uncertain, and also there are definitely people and types of work that might be much better in these homes.

To wildly speculate, I could imagine a good and achievable distribution across institutions being 500 in alignment-focused research organizations (who might be much more willing and able to productively absorb people for alignment research), 300 in industry labs, 100 in academia, 50 independent researchers, and 50 in government agencies (but plausibly these numbers should be very different in particular circumstances).  Of course “number of people working in the field” is far from an ideal proxy for total productivity, so I’ve tried to adjust for targettedness and quality of their output in my discussion here.

I estimate the current size of the field of x-risk-reduction-motivated AI alignment research is 100 people (very roughly speaking, rounded to an order of magnitude), so 1000 people would constitute something like a 10x increase.  (My guesses for the current distribution is 30 in alignment orgs, 30 in industry labs, 30 in academia, 10 independent researchers, and 0 in government (very rough numbers, rounded to nearest half order of magnitude).)  I’d guess there are at this time something like 30 - 100 people who, though they are not currently working on x-risk-motivated AI alignment research, would start working on this if the right institutions existed.  I would like this number (of potential people) to grow a lot in the future.  

Regarding 3 (“people”), the spread of the idea that it would be good to reduce x-risks from TAI (and maybe general growth of the EA movement) could increase the size and quality of the pool of people who would develop and execute on alignment projects.  I am excited for the work that Open Philanthropy and university student groups such as Stanford EA are doing towards this end.

I’m currently unsure what an appropriate fraction of the technical staff of alignment-focused research organizations should be people who understand and care a lot about x-risk-motivated alignment research.  I could imagine that ratio being something like 10%, or like 90%, or in between.

I think there’s a case to be made that alignment research is bottlenecked by current ML capabilities, but I (unconfidently) don’t think that this is currently a bottleneck; I think there is a bunch more alignment research that could be done now with current capabilities (eg my guess is that less than 50% of the alignment work that could be done at current levels of capabilities has been done -- I could imagine there being something like 10 or more projects that are as helpful as “Deep RL from human preferences” or “Learning to summarize from human feedback”).

It's 2027, and Redwood has failed to be useful while spending hundreds of person-years of researcher time. What happened?

In most worlds where we fail to produce value, I think we fail before we spend a hundred researcher-years. So I’m also going to include possibilities for wasting 30 researcher-years in this answer.

Here’s some reasons we might have failed to produce useful research: 

  • We failed to execute well on research. For example, maybe we were incompetent at organizing research projects, or maybe our infrastructure was forever bad, or maybe we couldn’t hire a certain type of person who was required to make the work go well.
  • We executed well on research, but failed on our projects anyway. For example, perhaps we tried to implement imitative generalization, but then it turned out to be really hard and we failed to do it. I’m unsure whether to count this as a failure or not, since null results can be helpful. This seems most like a failure if the reason that the project failed was knowable ahead of time.
  • We succeeded on our projects, but they turned out not to be useful. Perhaps we were confused about how to think about the alignment problem. This feels like a big risk to me. 

Some of the value of Redwood comes from building capacity to do more good research in the future (including building up this capacity for other orgs, eg by them being able to poach our employees). So you also have to imagine that this also didn’t work out.

 It doesn't seem (unlike some other places) that Redwood is directly trying to create AGI, so value will have to come from the techniques being used by other labs. Assuming Redwood finds some promising techniques, how does Redwood plan to influence the biggest research labs that are working towards AGI? Do you hope for your techniques to be useful enough to AGI research that labs adopt them anyway? Do you want to heavily evangelize your techniques in publications/the press/etc.? Or do you expect the work of persuading the biggest players to be better done by somebody else?

So to start with, I want to note that I imagine something a lot more like “the alignment community as a whole develops promising techniques, probably with substantial collaboration between research organizations” than “Redwood does all the work themselves”. Among other things, we don’t have active plans to do much theoretical alignment work, and I’d be fairly surprised if it was possible to find techniques I was confident in without more theoretical progress--our current plan is to collaborate with theory researchers elsewhere.

In this comment, I mentioned the simple model of “labs align their AGI if the amount of pressure on them to use sufficiently reliable alignment techniques is greater than the inconvenience associated with using those techniques.” The kind of applied alignment work we’re doing is targeted at reducing the cost of using these techniques, rather than increasing the pressure--we’re hoping to make it cheaper and easier for capabilities labs to apply alignment techniques that they’re already fairly motivated to use, eg by ensuring that these techniques have been tried out in miniature, and so the labs feel pretty optimistic that their practical kinks have been worked out, and there are people who have implemented the techniques before who can help them.

Organizations grow and change over time, and I wouldn’t be shocked to hear that Redwood eventually ended up engaging in various kinds of efforts to get capabilities labs to put more work into alignment. We don’t currently have plans to do so.

Do you hope for your techniques to be useful enough to AGI research that labs adopt them anyway? 

That would be great, and seems plausible.

Do you want to heavily evangelize your techniques in publications/the press/etc.?

I don’t imagine wanting to heavily evangelize techniques in the press. I think that getting prominent publications about alignment research is probably useful.

This looks brilliant, and I want to strong-strong upvote!

What do you foresee as your biggest bottlenecks or obstacles in the next 5 years? Eg. finding people with a certain skillset, or just not being able to hire quickly while preserving good culture.

Thanks for the kind words!

Our biggest bottlenecks are probably going to be some combination of:

  • Difficulty hiring people who are good at some combination of leading ML research projects, executing on ML research, and reasoning through questions about how to best attack prosaic alignment problems with applied research.
  • A lack of sufficiently compelling applied research available, as a result of theory not being well developed enough.
  • Difficulty with making the organization remain functional and coordinated as it scales.

When choosing between projects, we’ll be thinking about questions like “to what extent is this class of techniques fundamentally limited? Is this class of techniques likely to be a useful tool to have in our toolkit when we’re trying to align highly capable systems, or is it a dead end?”


We’re trying to take a language model that has been fine-tuned on completing fiction, and then modify it so that it never continues a snippet in a way that involves describing someone getting injured. (source)

Suppose you successfully modify GPT models as desired, at moderate cost in compute and human classification. How might your process generalize?

So there’s this core question: "how are the results of this project going to help with the superintelligence alignment problem?" My claim can be broken down as follows:

  • "The problem is relevant": There's a part of the superintelligence alignment problem that is analogous to this problem. I think the problem is relevant for reasons I already tried to spell out here.
  • "The solution is relevant": There's something helpful about getting better at solving this problem. This is what I think you’re asking about, and I haven’t talked as much about why I think the solution is relevant, so I’ll do that here.

I don’t think that the process we develop will generalize, in the sense that I don’t think that we’ll be able to actually apply it to solving the problems we actually care about, but I think it’s still likely to be a useful step.

There are more advanced techniques that have been proposed for ensuring models don’t do bad things. For example, relaxed adversarial training, or adversarial training where the humans have access to powerful tools that help them find examples where the model does bad things (eg as in proposal 2 here). But it seems easier to research those things once we’ve done this research, for a few reasons:

  • It’s nice to have baselines. In general, when you’re doing ML, if you’re trying to develop some new technique that you think will get around fundamental weaknesses of a previous technique, it’s important to start out by getting a clear understanding of how good existing techniques are. ML research often has a problem where people publish papers that claim that some technique is better than the existing technique, and then it turns out that the existing technique is actually just as good if you use it properly (which of course the researchers are incentivized not to do). This kind of problem makes it harder to understand where your improvements are coming from. And so it seems good to try pretty hard to apply the naive adversarial training scheme before moving on to more complicated things.
  • There are some shared subproblems between the techniques we’re using and the more advanced techniques. For example, there are more advanced techniques where you try to build powerful ML-based tools to help humans generate adversarial examples. There’s kind of a smooth continuum between the techniques we’re trying out and techniques where the humans have access to tools to help them. And so many of the practical details we’re sorting out with our current work will make it easier to test out these more advanced techniques later, if we want to.

I often think of our project as being kind of analogous to Learning to summarize with human feedback. That paper isn’t claiming that if we know how to train models by getting humans to choose which of two options they prefer, we’ll have solved the whole alignment problem. But it’s still probably the case that it’s helpful for us to have sorted out some of the basic questions about how to do training from human feedback, before trying to move on to more advanced techniques (like training using human feedback where the humans have access to ML tools to help them provide better feedback).

What might be an example of a "much better weird, theory-motivated alignment research" project, as mentioned in your intro doc? (It might be hard to say at this point, but perhaps you could point to something in that direction?)

I think the best examples would be if we tried to practically implement various schemes that seem theoretically doable and potentially helpful, but quite complicated to do in practice. For example, imitative generalization or the two-head proposal here. I can imagine that it might be quite hard to get industry labs to put in the work of getting imitative generalization to work in practice, and so doing that work (which labs could perhaps then adopt) might have a lot of impact.

Some questions that aren't super related to Redwood/applied ML AI safety, so feel free to ignore if not your priority:

  1. Assuming that it's taking too long to solve the technical alignment problem, what might be some of our other best interventions to reduce x-risk from AI? E.g., regulation, institutions for fostering cooperation and coordination between AI labs, public pressure on AI labs/other actors to slow deployment, ...

  2. If we solve the technical alignment problem in time, what do you think are the other major sources of AI-related x-risk that remain? How likely do you think these are, compared to x-risk from not solving the technical alignment problem in time?

So one thing to note is that I think that there are varying degrees of solving the technical alignment problem. In particular, you’ve solved the alignment problem more if you’ve made it really convenient for labs to use the alignment techniques you know about. If next week some theory people told me “hey we think we’ve solved the alignment problem, you just need to use IDA, imitative generalization, and this new crazy thing we just invented”, then I’d think that the main focus of the applied alignment community should be trying to apply these alignment techniques to the most capable currently available ML systems, in the hope of working out all the kinks in these techniques, and then repeat this every year, so that whenever it comes time to actually build the AGI with these techniques, the relevant lab can just hire all the applied alignment people who are experts on these techniques and get them to apply them. (You might call this fire drills for AI safety, or having an “anytime alignment plan” (someone else invented this latter term, I don’t remember who).)


Assuming that it's taking too long to solve the technical alignment problem, what might be some of our other best interventions to reduce x-risk from AI? E.g., regulation, institutions for fostering cooperation and coordination between AI labs, public pressure on AI labs/other actors to slow deployment, …

I normally focus my effort on the question “how do we solve the technical alignment problem and make it as convenient as possible to build aligned systems, and then ensure that the relevant capabilities labs put effort into using these alignment techniques”, rather than this question, because it seems relatively tractable, compared to causing things to go well in worlds like those you describe.

One way of thinking about your question is to ask how many years the deployment of existentially risky AI could be delayed (which might buy time to solve the alignment problem). I don’t have super strong takes on this question. I think that there are many reasonable-seeming interventions, such as all of those that you describe. I guess I’m more optimistic about regulation and voluntary coordination between AI labs (eg, I’m happy about “Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project.” from the OpenAI Charter) than about public pressure, but I’m not confident.

If we solve the technical alignment problem in time, what do you think are the other major sources of AI-related x-risk that remain? How likely do you think these are, compared to x-risk from not solving the technical alignment problem in time?

Again, I think that maybe 30% of AI accident risk comes from situations where we sort of solved the alignment problem in time but the relevant labs don’t use the known solutions. Excluding that, I think that misuse risk is serious and worth worrying about. I don’t know how much value I think is destroyed in expectation by AI misuse compared to AI accident. I can also imagine various x-risk related to narrow AI in various ways.

How crucial a role do you expect x-risk-motivated AI alignment will play in making things go well? What are the main factors you expect will influence this? (e.g. the occurrence of medium-scale alignment failures as warning shots)

We could operationalize this as “How does P(doom) vary as a function of the total amount of quality-adjusted x-risk-motivated AI alignment output?” (A related question is “Of the quality-adjusted AI alignment research, how much will be motivated by x-risk concerns?” This second question feels less well defined.)

I’m pretty unsure here. Today, my guess is like 25% chance of x-risk from AI this century, and maybe I imagine that being 15% if we doubled the quantity of quality-adjusted x-risk-motivated AI alignment output, and 35% if we halved that quantity. But I don’t have explicit models here and just made these second two numbers up right now; I wouldn’t be surprised to hear that they moved noticeably after two hours of thought. I guess that one thing you might learn from these numbers is that I think that x-risk-motivated AI alignment output is really important.

What are the main factors you expect will influence this? (e.g. the occurrence of medium-scale alignment failures as warning shots)

I definitely think that AI x-risk seems lower in worlds where we expect medium-scale alignment failure warning shots. I don’t know whether I think that x-risk-motivated alignment research seems less important in those worlds or not--even if everyone thinks that AI is potentially dangerous, we have to have scalable solutions to alignment problems, and I don’t see a reliable route that takes us directly from “people are concerned” to “people solve the problem”.

I think the main factor that affects the importance of x-risk-motivated alignment research is whether it turns out that most of the alignment problem occurs in miniature in sub-AGI systems. If so, much more of the work required for aligning AGI will be done by people who aren’t thinking about how to reduce x-risk.

It’s 2035, Redwood has built an array of alignment tools that make SOTA models far less existentially risky without sacrificing hardly any performance. But these tools don’t end up being used by enough of the richest labs such that we still face doom. What happened?

One simple model for this is: labs build aligned models if the amount of pressure on them to use sufficiently reliable alignment techniques is greater than the inconvenience associated with using those techniques.

Here are various sources of pressure:

  • Lab leadership
  • Employees of the lab
  • Investors
  • Regulators
  • Customers

In practice, all of these sources of pressure are involved in companies spending resources on, eg, improving animal welfare standards, reducing environmental costs, or DEI (diversity, equity, and inclusion).

And here are various sources of inconvenience that could be associated with using particular techniques, even assuming they’re in principle competitive (in both the performance-competitive and training-competitive senses).

  • Perhaps they require using substantially different algorithms or technologies, even if these aren’t fundamentally worse. As a dumb example, imagine that building an aligned AGI requires building your training code in some language that is much less bug-prone than Python, eg Haskell. It’s not really fundamentally harder to do ML in Haskell than Python, but all the ML libraries are in Python and in practice it would require a whole lot of annoying work that an org would be extremely reluctant to do.
  • Perhaps they require more complicated processes with more moving parts.
  • Perhaps they require the org to do things that are different from the things it’s good at doing. For example, I get the sense that ML researchers are averse to interacting with human labellers (because it is pretty annoying) and so underutilize techniques that involve eg having humans in the loop. Organizations that will be at the cutting edge of AI research will probably have organizational structures that are optimized for the core competencies related to their work. I expect these core competencies to include ML research, distributed systems engineering (for training gargantuan models), fundraising (because these projects will likely be extremely capital intensive), perhaps interfacing with regulators, and various work related to commercializing these large models. I think it’s plausible that alignment will require organizational capacities quite different from these. 
  • Perhaps they require you to have capable and independent red teams whose concerns are taken seriously.

And so when I’m thinking about labs not using excellent alignment strategies that had already been developed, I imagine the failures differently depending on how much inconvenience there was:

  • “They just didn’t care”: The amount of pressure on them to use these techniques was extremely low. I’d be kind of surprised by this failure: I feel like if it really came down to it, and especially if EA was willing to spend a substantial fraction of its total resources on affecting some small number of decisions, basically all existing labs could be persuaded to do fairly easy things for the sake of reducing AI x-risk.
  • “They cared somewhat, but it was too inconvenient to use them”. I think that a lot of the point of applied alignment research is reducing the probability of failures like this.
  • “The techniques were not competitive”. In this case, even large amounts of pressure might not suffice (though presumably, sufficiently large amounts of pressure could cause the whole world to use these techniques even if they weren’t that competitive.)

Thanks for the response! I found the second set of bullet points especially interesting/novel.

Also, how important does it seem like governance is here versus other kinds of coordination? Any historical examples that inform your beliefs?

This is a great question and I don't have a good answer.

What factors do you think would have to be in place for some other people to set up some similar but different organisation in 5 years time?

I imagine this is mainly about the skills and experience of the team, but also interested in other things if you think that's relevant

I think the main skillsets required to set up organizations like this are: 

  • Generic competence related to setting up any organization--you need to talk to funders, find office space, fill out lots of IRS forms, decide on a compensation policy, make a website, and so on.
  • Ability to lead relevant research. This requires knowledge of running ML research, knowledge of alignment, and management aptitude.
  • Some way of getting a team, unless you want to start the org out pretty small (which is potentially the right strategy).
  • It’s really helpful to have a bunch of contacts in EA. For example, I think it’s been really helpful for EA that I spent a few years doing lots of outreach stuff for MIRI, because it means I know a bunch of people who can potentially be recruited or give us advice.

Of course, if you had some of these properties but not the others, many people in EA (eg me) would be very motivated to help you out, by perhaps introducing you to cofounders or helping you with parts you were less experienced with.

People who wanted to start a Redwood competitor should plausibly consider working on an alignment research team somewhere (preferably leading it) and then leaving to start their own team. We’d certainly be happy to host people who had that aspiration (though we’d think that such people should consider the possibility of continuing to host their research inside Redwood instead of leaving).

Does it make sense to think of your work as aimed at reducing a particular theory-practice gap? If so, which one (what theory / need input for theoretical alignment scheme)?

I think our work is aimed at reducing the theory-practice gap of any alignment schemes that attempt to improve worst-case performance by training the model on data that was selected in the hope of eliciting bad behavior from the model. For example, one of the main ingredients of our project is paying people to try to find inputs that trick the model, then training the model on these adversarial examples.

Many different alignment schemes involve some type of adversarial training. The kind of adversarial training we’re doing, where we just rely on human ingenuity, isn’t going to work for ensuring good behavior from superhuman models. But getting good at the simple, manual version of adversarial training seems like plausibly a prerequisite for being able to do research on the more complicated techniques that might actually scale.

What are examples of reasonably scoped non-alignment/non-technical research questions, if any, that you think would be helpful for your work? 

I think that most questions we care about are either technical or related to alignment. Maybe my coworkers will think of some questions that fit your description. Were you thinking of anything in particular?

Well for me, better research on correlates of research performance would be pretty helpful for research hiring. Like it's an open question to me whether  I should expect a higher or lower (within-distribution)  correlation of {intelligence, work sample tests, structured interviews, resume screenings} to research productivity when compared to the literature on work performance overall. I expect there are similar questions for programming. 

But the selfish reason I'm interested in asking this is that I plan to work on AI gov/strategy in the near future, and it'll be useful to know if there are specific questions in those domains that you'd like an answer to, as this may help diversify or add to our paths to impact.

Okay, "How alignment research might look different five or ten years from now?"

Here are some things I think are fairly likely:

  • I think that there might be a bunch of progress on theoretical alignment, with various consequences:
    • More projects that look like “do applied research on various strategies to make imitative generalization work in practice” -- that is, projects where the theory researchers have specific proposals for ML training schemes that have attractive alignment properties, but which have practical implementation questions that might require a bunch of effort to work out. I think that a lot of the impact from applied alignment research comes from making it easier for capabilities labs to adopt alignment schemes, and so I’m particularly excited for this kind of work.
    • More well-scoped narrow theoretical problems, so that there’s more gains from parallelism among theory researchers.
    • A better sense of what kinds of practical research is useful.
    • I think I will probably be noticeably more optimistic or pessimistic -- either there will be some plan for solving the problem that seems pretty legit to me, or else I’ll have updated substantially against such a plan existing.
  • We might have a clearer picture of AGI timelines. We might have better guesses about how early AGI will be trained. We might know more about empirical ML phenomena like scaling laws (which I think are somewhat relevant for alignment).
  • There will probably be a lot more industry interest in problems like “our pretrained model obviously knows a lot about topic X, but we don’t know how to elicit this knowledge from it.” I expect more interest in this because this becomes an increasingly important problem as your pretrained models become more knowledgeable. I think that this problem is pretty closely related to the alignment problem, so e.g. I expect that most research along the lines of Learning to Summarize with Human Feedback will be done by people who need this research for practical purposes, rather than alignment researchers interested in the analogy to AGI alignment problems.
  • Hopefully we’ll have more large applied alignment projects, as various x-risk-motivated orgs like Redwood scale up.
  • Plausibly large funders like Open Philanthropy will start spending large amounts of money on funding alignment-relevant research through RFPs or other mechanisms.
  • Probably we’ll have way better resources for onboarding new people into cutting edge thinking on alignment. I think that resources are way better than they were two years ago, and I expect this trend to continue.
  • Similarly, I think that there are a bunch of arguments about futurism and technical alignment that have been written up much more clearly and carefully now than they had been a few years ago. Eg Joe Carlsmith’s report on x-risk from power-seeking AGI and Ajeya Cotra on AGI timelines. I expect this trend to continue.

What's the main way that you think resources for onboarding people has improved?

[Edited] How important do you think it is to have ML research projects be lead by researchers who have had a lot of previous success in ML? Maybe it's the case that the most useful ML research is done by the top ML researchers, or that the ML community won't take Redwood very seriously (e.g. won't consider using your algorithms) if the research projects aren't lead by people with strong track records in ML.

Additionally, what are/how strong are the track records of Redwood's researchers/advisors?

Additionally, what are/how strong are the track records of Redwood's researchers/advisors?

The people we seek advice from on our research most often are Paul Christiano and Ajeya Cotra. Paul is a somewhat experienced ML researcher, who among other things led some of the applied alignment research projects that I am most excited about.

On our team, the people with the most relevant ML experience are probably Daniel Ziegler, who was involved with GPT-3 and also several OpenAI alignment research projects, and Peter Schmidt-Nielsen. Many of our other staff have research backgrounds (including publishing ML papers) that make me feel pretty optimistic about our ability to have good ML ideas and execute on the research.

How important do you think it is to have ML research projects be led by researchers who have had a lot of previous success in ML?

I think it kind of depends on what kind of ML research you’re trying to do. I think our projects require pretty similar types of expertise to eg Learning to Summarize with Human Feedback, and I think we have pretty analogous expertise to the team that did that research (and we’re advised by Paul, who led it).

I think that there are particular types of research that would be hard for us to do, due to not having certain types of expertise.

Maybe it's the case that the most useful ML research is done by the top ML researchers

I think that a lot of the research we are most interested in doing is not super bottlenecked on having the top ML researchers, in the same way that Learning to Summarize with Human Feedback doesn’t seem super bottlenecked on having the top ML researchers. I feel like the expertise we end up needing is some mixture of ML stuff like “how do we go about getting this transformer to do better on this classification task”, reasoning about the analogy to the AGI alignment problem, and lots of random stuff like making decisions about how to give feedback to our labellers.

or that the ML community won't take Redwood very seriously (e.g. won't consider using your algorithms) if the research projects aren't lead by people with strong track records in ML.

I don’t feel very concerned about this; in my experience, ML researchers are usually pretty willing to consider research on its merits, and we have had good interactions with people from various AI labs about our research.

Do you think that different trajectories of prosaic TAI have big impacts on the usefulness of your current project? (For example, perhaps you think that TAI that is agentic would just be taught to deceive). If so, which? If not, could you say something about why it seems general?

(NB: the above is not supposed to imply criticism of a plan that only works in some worlds).

I think this is a great question.

We are researching techniques that are simpler precursors to adversarial training techniques that seem most likely to work if you assume that it’s possible to build systems that are performance-competitive and training-competitive, and do well on average on their training distribution.

There are a variety of reasons to worry that this assumption won’t hold. In particular, it seems plausible that humanity will only have the ability to produce AGIs that will collude with each other if it’s possible for them to do so. This seems especially likely if it’s only affordable to train your AGI from scratch a few times, because then all the systems you’re using are similar to each other and will find collusion easier. (It’s not training-competitive to assume you’re able to train the AGI from scratch multiple times, if you believe that there’s a way of building an unaligned powerful system that only involves training it from scratch once.) But even if we train all our systems from scratch separately, it’s pretty plausible to me that models will collude, either via acausal trade or because the systems need to be able to communicate with each other for some competitiveness reason.

So our research is most useful if we’re able to assume a lack of such collusion.

I think that some people think you might be able to apply these techniques even in cases where you don’t have an a priori reason to be confident that the models won’t collude; I don’t have a strong opinion on this.

Hm, could you expand on why collusion is one of the most salient ways in which "it’s possible to build systems that are performance-competitive and training-competitive, and do well on average on their training distribution" could fail?

Is the thought here that — if models can collude — then they can do badly on the training distribution in an unnoticeable way, because they're being checked by models that they can collude with?

Yeah basically.

I think it is fair to say that so far alignment research is not a standard research area in academic machine learning, unlike for example model interpretability. Do you think that would be desirable, and if so what would need to happen? 

In particular, I had this toy idea of making progress legible to academic journals:  Formulating problems and metrics that are "publishing-friendly"could,  despite the problems that optimizing for flawed metrics bring,  allow researchers at regular universities to conduct work in these areas.

It seems definitely good on the margin if we had ways of harnessing academia to do useful work on alignment. Two reasons for this are that 1. perhaps non-x-risk-motivated researchers would produce valuable contributions, and 2. it would mean that x-risk-motivated researchers inside academia would be less constrained and so more able to do useful work.

Three versions of this:

  • Somehow cause academia to intrinsically care about reducing x-risk, and also ensure that the power structures in academia have a good understanding of the problem, so that its own quality control mechanisms cause academics to do useful work. I feel pretty pessimistic about the viability of convincing large swathes of academia to care about the right thing for the right reasons. Historically, basically the only way that people have ended up thinking about alignment research in a way that I’m excited about is that they spent a really long time thinking about AI x-risk and talking about it with other interested people. And so I’m not very optimistic about the first of these.
  • Just get academics to do useful work on specific problems that seem relevant to x-risk. For example, I’m fairly excited about some work on interpretability and some techniques for adversarial robustness. On the other hand, my sense is that EA funders have on many occasions tried to get academics to do useful work on topics of EA interest, and have generally found it quite difficult; this makes me pessimistic about this. Perhaps an analogy here is: Suppose you’re Google, and there’s some problem you need solved, and there’s an academic field that has some relevant expertise. How hard should you try to get academics in that field excited about working on the problem? Seems plausible to me that you shouldn’t try that hard--you’d be better off trying to have a higher-touch relationship where you employ researchers or make specific grants, rather than trying to convince the field to care about the subproblem intrinsically (even if they in some sense should care about the subproblem).
  • Get academics to feel generally positively towards x-risk-motivated alignment research, even if they don’t try to work on it themselves. This seems useful and more tractable.

What type of legal entity is Redwood Research operating as/under? Is it plausible that at some point the project will be funded by investors and that shareholders will be able to financially profit?

We're a nonprofit. We don't have plans to make profits, and it seems less likely than e.g. OpenAI that in the future we would go nonprofit --> tandem for-profit / nonprofit, but there are a variety of revenue-generating things I can imagine us doing (e.g. consulting with industry labs to help them align their models).

Two hiring (and personally-motivated) questions:

  1. What would be a good pathway for a software engineer to become a viable member of your technical staff? You can assume that the engineer has had zero or minimal exposure to ML throughout their academic/professional career. If this isn't sufficiently different from what would be recommended to a software engineer interested in alignment in general, feel free to skip this, unless you think there are particular things you'd recommend someone brushing up on before applying to work with you.
  2. Would you be comfortable sharing the structure of your compensation packages (e.g. mostly salary with possible bonuses, even combination of salary and equity, etc.)?

Re 1:

It’s probably going to be easier to get good at the infrastructure engineering side of things than the ML side of things, so I’ll assume that that’s what you’re going for.

For our infra engineering role, we want to hire people who are really productive and competent at engineering various web systems quickly. (See the bulleted list of engineering responsibilities on the job page.) There are some people who are qualified for this role without having much professional experience, because they’ve done a lot of Python programming and web programming as hobbyists. Most people who want to become more qualified for this work should seek out a job that’s going to involve practicing these skills. For example, being a generalist backend engineer at a startup, especially if you’re going to be working with ML, is likely to teach you a bunch of the skills that are valuable to us. You’re more likely to learn these skills quickly if you take your job really seriously and try hard to be very good at it--you should try to take on more responsibilities when you get the opportunity to do so, and generally practice the skill of understanding the current technical situation and business needs and coming up with plans to quickly and effectively produce value.

Re 2:

Currently our compensation packages are usually entirely salary. We don’t have equity because we’re a nonprofit. We’re currently unsure how to think about compensation policy--we’d like to be able to offer competitive salaries so that we can hire non-EA talent for appropriate roles (because almost all the talent is non-EA), but there are a bunch of complexities associated with this.

How likely do you think it would be for standard ML research to solve the problems you're working on in the course of trying to get good performance? Do such concerns affect your project choices much?

More from Buck
Curated and popular this week
Relevant opportunities