tl;dr: Ask questions about AGI Safety as comments on this post, including ones you might otherwise worry seem dumb!

Asking beginner-level questions can be intimidating, but everyone starts out not knowing anything. If we want more people in the world who understand AGI safety, we need a place where it's accepted and encouraged to ask about the basics.

We'll be putting up monthly FAQ posts as a safe space for people to ask all the possibly-dumb questions that may have been bothering them about the whole AGI Safety discussion, but which until now they didn't feel able to ask.

It's okay to ask uninformed questions, and not worry about having done a careful search before asking.

Stampy's Interactive AGI Safety FAQ

Additionally, this will serve as a way to spread the project Rob Miles' volunteer team[1] has been working on: Stampy - which will be (once we've got considerably more content) a single point of access into AGI Safety, in the form of a comprehensive interactive FAQ with lots of links to the ecosystem. We'll be using questions and answers from this thread for Stampy (under these copyright rules), so please only post if you're okay with that! You can help by adding other people's questions and answers to Stampy or getting involved in other ways!

We're not at the "send this to all your friends" stage yet, we're just ready to onboard a bunch of editors who will help us get to that stage :)

Stampy - Here to help everyone learn about stamp maximization AGI Safety!

We welcome feedback[2] and questions on the UI/UX, policies, etc. around Stampy, as well as pull requests to his codebase. You are encouraged to add other people's answers from this thread to Stampy if you think they're good, and collaboratively improve the content that's already on our wiki.

We've got a lot more to write before he's ready for prime time, but we think Stampy can become an excellent resource for everyone from skeptical newcomers, through people who want to learn more, right up to people who are convinced and want to know how they can best help with their skillsets.

PS: Based on feedback that Stampy will be not serious enough for serious people we built an alternate skin for the frontend which is more professional: Alignment.Wiki. We're likely to move one more time to, feedback welcome.

Guidelines for Questioners:

  • No previous knowledge of AGI safety is required. If you want to watch a few of the Rob Miles videos, read either the WaitButWhy posts, or the The Most Important Century summary from OpenPhil's co-CEO first that's great, but it's not a prerequisite to ask a question.
  • Similarly, you do not need to try to find the answer yourself before asking a question (but if you want to test Stampy's in-browser tensorflow semantic search that might get you an answer quicker!).
  • Also feel free to ask questions that you're pretty sure you know the answer to, but where you'd like to hear how others would answer the question.
  • One question per comment if possible (though if you have a set of closely related questions that you want to ask all together that's ok).
  • If you have your own response to your own question, put that response as a reply to your original question rather than including it in the question itself.
  • Remember, if something is confusing to you, then it's probably confusing to other people as well. If you ask a question and someone gives a good response, then you are likely doing lots of other people a favor!

Guidelines for Answerers:

  • Linking to the relevant canonical answer on Stampy is a great way to help people with minimal effort! Improving that answer means that everyone going forward will have a better experience!
  • This is a safe space for people to ask stupid questions, so be kind!
  • If this post works as intended then it will produce many answers for Stampy's FAQ. It may be worth keeping this in mind as you write your answer. For example, in some cases it might be worth giving a slightly longer / more expansive / more detailed explanation rather than just giving a short response to the specific question asked, in order to address other similar-but-not-precisely-the-same questions that other people might have.

Finally: Please think very carefully before downvoting any questions, remember this is the place to ask stupid questions!

  1. ^

    If you'd like to join, head over to Rob's Discord and introduce yourself!

  2. ^

    Via the feedback form.


New Comment
94 comments, sorted by Click to highlight new comments since: Today at 7:29 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

If companies like OpenAI and Deepmind have safety teams, it seems to me that they anticipate that speeding up AI capabilities can be very bad, so why don't they press the brakes on their capabilities research until we come up with more solutions to alignment?

Possible reasons: * Their leaderships (erroneously) believe the benefits outweigh the risk (i.e. they don't appreciate the scale of the risk, or just don't care enough even if they do [] ?). * They worry about losing the race to AGI to competitors who aren't as aligned (with safety), and they aren't taking heed of the warnings of Szilárd and Ellsberg re race dynamics [] . Perhaps if they actually took the lead in publicly (and verifiably) slowing down, that would be enough to slow down the whole field globally, given that they are the leaders. There isn't really precedence for this for something so globally important, but perhaps the "killing" of the electric car [] (that delayed development by ~a decade?) is instructive.

In this post, one of the lines of discussion is about whether values are "fragile". My summary, which might be wrong:

  1. Katja says maybe they aren't: for example, GANs seem to make faces which are pretty close to normal human faces
  2. Nate responds: sure, but if you used the discriminator to make the maximally face-like object, this would not at all be like a human face
  3. cfoster0 replies: yeah but nobody uses GANs that way

And then I lose the thread. Why isn't cfoster0's response compelling?

Great question! I think the core of the answer comes down to the fact that the real danger of AI systems does not come from tools, but from agents []. There are strong incentives to build more agenty AIs, agenty AIs are more useful and powerful than tools, it's likely to be relatively easy to build agents once you can build powerful tools, and tools may naturally slide into becoming agents at a certain level of capability. If you're a human directing a tool, it's pretty easy to point the optimization power of the tool in beneficial ways. Once you have a system which has its own goals which it's maximizing for, then you have much bigger problems. Consequentialists seek power more effectively than other systems, so when you're doing a large enough program search with a diverse training task attached to a reinforcement signal they will tend to be dominant. Internally targetable maximization-flavored search is an extremely broadly useful mechanism which will be stumbled on and upwrighted by gradient descent. See Rohin Shah's AI risk from Program Search threat model [] for more details. The system which emerges from recursive self-improvement is likely to be a maximizer of some kind. And maximizing AI is dangerous (and hard to avoid!), as explored in this Rob Miles video. To tie this back to your question: Weak and narrow AIs can be safely used as tools, we can have a human in the outer loop directing the optimization power. Once you have a system much smarter than you, the thing it ends up pointed at maximizing is no longer corrigible by default, and you can't course correct if you misspecified the kind of facelikeness you were asking for. Specifying open ended goals for a sovereign maximizer to pursue in the real world which don't kill everyone is an unsolved problem.

Asking beginner-level questions can be intimidating

To make it even less intimidating, maybe next time include a Google Form where people can ask questions anonymously that you'll then post in the thread a la ?

Yes, this seems like a good idea, I'll do that next time unless someone else is up for handling incoming questions.

What stops AI Safety orgs from just hiring ML talent outside EA for their junior/more generic roles?

1Jakub Kraus3mo
I'd love to see a detailed answer to this question. I think a key bottleneck for AI alignment at the moment is finding people who can identify research directions (and then lead relevant projects) that might actually reduce x-risk, so I'm also confused why some career guides include software and ML engineering [] as one of the best ways to contribute. I struggle to see how software and ML engineering could be a bottleneck given that there are so many talented software and ML engineers out there. Counterpoint: infohazards [] mean you can't just hire anyone.

This is a question about some anthropocentrism that seems latent in the AI safety research that I've seen so far:

Why do AI alignment researchers seem to focus only on aligning with human values, preferences, and goals, without considering alignment with the values, preferences, and goals of non-human animals?

I see a disconnect between EA work on AI alignment and EA work on animal welfare, and it's puzzling to me, given that any transformative AI will transform not just 8 billion human lives, but trillions of other sentient lives on Earth. Are any AI researchers trying to figure out how AI can align with even simple cases like the interests of the few species of pets and livestock? 

If we view AI development not just as a matter of human technology, but as a 'major evolutionary transition' for life on our planet more generally, it would seem prudent to consider broader issues of alignment with the other 5,400 species of mammals, the other 45,000 species of vertebrates, etc...

IMO: Most of the difficulty in technical alignment is figuring out how to robustly align to any particular values whatsoever. Mine, yours, humanitys, all sentient life on earth, etc. All roughly equally difficult. "Human values" is probably the catchphrase mostly for instrumental reasons--we are talking to other humans, after all, and in particular to liberal egalitarian humans who are concerned about some people being left out and especially concerned about an individual or small group hoarding power. Insofar as lots of humans were super concerned that animals would be left out too, we'd be saying humans+animals. The hard part isn't deciding who to align to, it's figuring out how to align to anything at all. 

8Geoffrey Miller3mo
I've heard this argument several times, that once we figure out how to align AI with the values of any sentient being, the rest of AI alignment with all the other billions/trillions of different sentient beings will be trivially easy. I'm not at all convinced that this is true, or even superficially plausible, given the diversity, complexity, and heterogeneity of values, and given that sentient beings are severely & ubiquitously unaligned with each other (see: evolutionary game theory, economic competition, ideological conflict). What is the origin of this faith among the AI alignment community that 'alignment in general is hard, but once we solve generic alignment, alignment with billions/trillions of specific beings and specific values will be easy'? I'm truly puzzled on this point, and can't figure out how it became such a common view in AI safety.
One perhaps obvious point: if you make some rationality assumptions, there is a single unique solution [] to how those preferences should be aggregated. So if you are able to align an AI with a single individual, you can iterate this alignment with all the individuals and use Harsanyi's theorem to aggregate their preferences. This (assuming rationality) is the uniquely best method to aggregate preferences. There are criticisms to be made of this solution, but it at least seems reasonable, and I don't think there's an analogous simple "reasonably good" solution to aligning AI with an individual.

Ben - thanks for the reminder about Harsanyi.

Trouble is, (1) the rationality assumption is demonstrably false, (2) there's no reason for human groups to agree to aggregate their preferences in this way -- any more than they'd be willing to dissolve their nation-states and hand unlimited power over to a United Nations that promises to use Harsanyi's theorem fairly and incorruptibly.

Yes, we could try to align AI with some kind of lowest-common-denominator aggregated human (or mammal, or vertebrate) preferences. But if most humans would not be happy with that strategy, it's a non-starter for solving alignment.

I agree that a lot of people believe alignment to any agent is the hard part, and aligning to a particular human is relatively easy, or a mere "AI capabilities" problem. Why? I think it's a sincere belief, but ultimately most people think it because it's an agreed assumption by the AIS community, held for a mixture of intrinsic and instrumental reasons. The intrinsic reasons are that a lot of the fundamental conceptual problems in AI safety seem not to care which human you're aligning the AI system to, e.g. the fact that human values are complex, that wireheading may arise, and that it's hard to describe how the AI system should want to change its values over time. The instrumental reason is that it's a central premise of the field, similar to the "DNA->RNA->protein ->cellular functions" perspective in molecular biology. The vision for AIS as a field is that we try not to indulge futurist and political topics, and why try not to argue with each other about things like whose values to align the AI to. You can see some of this instrumentalist perspective in Eliezer's Coherent Extrapolated Volition paper: Presumably the prices have gone up with the increased EA wealth, and down again this year..
9Geoffrey Miller3mo
Ryan - thanks for this helpful post about this 'central dogma' in AI safety. It sounds like much of this view may have been shaped by Yudkowsky's initial writings about alignment and coherent extrapolated volition? And maybe reflects a LessWrong ethos that cosmic-scale considerations mean we should ignore current political, religious, and ideological conflicts of values and interests among humans? My main concern here is that if this central dogma about AI alignment (that 'alignment to any agent is the hard part, and aligning to a particular human is relatively easy, or a mere "AI capabilities" problem', as you put it) is wrong -- then we may be radically underestimating the difficult of alignment, and it might end up being much harder to align with the specific & conflicting values of 8 billion people and trillions of animals, than it is to just 'align in principle' with one example agent. And that would be very bad news for our species. IMHO, one might even argue that failure to challenge this central dogma in AI safety is a big potential failure mode, and perhaps an X risk in its own right....
Yes, I personally think it was shaped by EY and that broader LessWrong ethos. I don't really have a strong sense of whether you're right about aligning to many agents being much harder than one ideal agent. I suppose that if you have an AHI system that can align to one human, then you could align many of them to different randomly selected humans, and simulate a debates between the resulting agents. You could then could consult the humans regarding whether their positions were adequately represented in that parliament. I suppose it wouldn't be that much harder than just aligning to one agent. A broader thought is that you may want to be clear about how an inability to align to n humans would cause catastrophe. It could be directly catastrophic, because it means we make a less ethical AI. Or it could be indirectly catastrophic, because our inability to design a system that aligns to n humans makes nations less able to cooperate, exacerbating any arms race.
I think that it is unfair to characterize it as something that hasn't been questioned. It has in fact been argued for at length. See e.g. the literature on the inner alignment problem. I agree there are also instrumental reasons supporting this dogma, but even if there weren't, I'd still believe it and most alignment researchers would still believe it, because it is a pretty straightforward inference to make if you understand the alignment literature.
5Geoffrey Miller3mo
Could you please say more about this? I don't see how the so-called 'inner alignment problem' is relevant here, or what you mean by 'instrumental reasons supporting this dogma'. And it sounds like you're saying I'd agree with the AI alignment experts if only I understood the alignment literature... but I'm moderately familiar with the literature; I just don't agree with some of its key assumptions.
OK, sure. Instrumental reasons supporting this dogma: The dogma helps us all stay sane and focused on the mission instead of fighting each other, so we have reason to promote it that is independent of whether or not it is true. (By contrast, an epistemic reason supporting the dogma would be a reason to think it is true, rather than merely a reason to think it is helpful/useful/etc.) Inner alignment problem: Well, it's generally considered to be an open unsolved problem. We don't know how to make the goals/values/etc of the hypothetical superhuman AGI correspond in any predictable way to the reward signal or training setup--I mean, yeah, no doubt there is a correspondence, but we don't understand it well enough to say "Given such-and-such a training environment and reward signal, the eventual goals/values/etc of the eventual AGI will be so-and-so." So we can't make the learning process zero in on even fairly simple goals like "maximize the amount of diamond in the universe." For an example of an attempt to do so, a proposal that maaaybe might work, see [] Though actually this isn't even a proposal to get that, it's a proposal to get the much weaker thing of an AGI that makes a lot of diamond eventually.
6Geoffrey Miller3mo
Thanks; those are helpful clarifications. Appreciate it.
It may not be trivially easy, but: 1. Weakly aligned AI may (or may not) allow you to muster the necessary power to force the world into a configuration where risk is lowered and research can continue on harder versions of alignment for longer durations. 2. If AI or digital minds are itself involved in doing the research that solves hard versions of alignment, "longer durations" may be replaceable with lots of compute running the weakly aligned AIs solving this. Your point increases the importance of corrigibility for even "weaker" solutions, so our first deployment doesn't lock-in values completely, and further changes are possible. I agree on net it isn't trivial to conclude that this will be easy.
3Geoffrey Miller3mo
OK, let's say a foreign superpower develops what you're calling 'weakly aligned AI', and they do 'muster the necessary power to force the world into a configuration where [X risk] is lowered'... by, for example, developing a decisive military and economic advantage over other countries, imposing their ideology on everybody, and thereby reducing the risk of great-power conflict. I still don't understand how we could call such an AI 'aligned with humanity' in any broad sense; it would simply be aligned with its host government and their interests, and somewhat anti-aligned with everybody else. Maybe I've studied and taught game theory for too many decades, and I'm just too attuned to conflicts of interest and mixed-motive games. But I get the very uneasy feeling that the AI alignment community is sweeping some very big problems under the rug here.
Oh I agree with this! I was providing reasons why we may want to carry out research for stronger versions of alignment research after weaker versions are solved and deployed, conditional on the actors in this position actually caring to carry out the stronger research. It wasn't an argument for why they should care in the first place. But most plans I've seen anyway rely on hoping the actors in powerful positions care enough to do morally good things. For instance if you do solve a harder version of alignment on paper today (such as an algorithm for calculating humanity's CEV or whatever) you still don't have any guarantee this is what ends up deployment 50 years from now. How to prevent bad (or well-intentioned but still suboptimal) actors from getting this power is a harder problem that I agree there should be more discussion on this.
9Geoffrey Miller3mo
Cool. That all makes sense. Seems like a lot of alignment research at the moment is analogous to physicists at Los Alamos National Labs running computer simulations to show that a next-generation nuke will reliably give a yield of 5 megatons plus or minus 0.1 megatons, and will not explode accidentally, and is therefore aligned with the Pentagon's mission of developing 'safe and reliable' nuclear weaponry.... and then saying 'We'll worry about the risks of nuclear arms races, nuclear escalation, nuclear accidents, nuclear winter, and nuclear terrorism later -- they're just implementation details'.

The situation is much worse than that. It's more like: They are worried about the possibility that the first ever nuclear explosion will ignite the upper atmosphere and set the whole earth ablaze. (You've probably read, this is a real concern they had). Except in this hypothetical the preliminary calculations are turning up the answer of Yes no matter how they run them. So they are continuing to massage the calculations and make the modelling software more realistic in the hopes of a No answer, and also advocating for changes to the design of the bomb that'll hopefully mitigate the risk, and also advocating for the whole project to slow down before it's too late, but the higher-ups  have go fever & so it looks like in a few years the whole world will be on fire. Meanwhile, some other people are talking to the handful of Los Alamos physicists and saying "but even if the atmosphere doesn't catch on fire, what about arms races, accidents, terrorism, etc.?" and the physicists are like "lol yeah that's gonna be a whole big problem if we manage so survive the first test, which unfortunately we probably won't.  We'd be working on that problem if this one didn't take priority."

4Geoffrey Miller3mo
That's a vivid but perhaps all too accurate (and all too horrifying) analogy.
Yup I basically agree! It would help if our Los Alamos was staffed by more than 200 people, atleast some of whom had worked on the problem for 2+ years. I really hope as the AI safety field grows, there will be people tackling these issues, but it isn't obvious right now they will.
My perspective, I think, is that most of the difficulties that people think of as being the extra, hard part of to one->many alignment, are already present in one->one alignment. A single human is already a barely coherent mess of conflicting wants and goals interacting chaotically, and the strong form of "being aligned to one human" requires a solution that can resolve values conflicts between incompatible 'parts' of that human and find outcomes that are satisfactory to all interests. Expanding this to more than one person is a change of degree but not kind. There is a weaker form of "being aligned to one human" that's just like "don't kill that human and follow their commands in more or less the way they intend", and if that's all we can get then that only translates to "don't drive humanity extinct and follow the wishes of at least some subset of people", and I'd consider that a dramatically suboptimal outcome. At this point I'd take it though.
2Geoffrey Miller2mo
Hi Robert, thanks for your perspective on this. I love your YouTube videos by the way -- very informative and clear, and helpful for AI alignment newbies like me. My main concern is that we still have massive uncertainty about what proportion of 'alignment with all humans' can be solved by 'alignment with one human'. It sounds like your bet is that it's somewhere above 50% (maybe?? I'm just guessing); whereas my bet is that it's under 20% -- i.e. I think that aligning with one human leaves most of the hard problems, and the X risk, unsolved. And part of my skepticism in that regard is that a great many humans -- perhaps most of the 8 billion on Earth -- would be happy to use AI to inflict harm, up to and including death and genocide, on certain other individuals and groups of humans. So, AI that's aligned with frequently homicidal/genocidal individual humans would be AI that's deeply anti-aligned with other individuals and groups.
1Jakub Kraus3mo
Intent alignment seeks to build an AI that does what its designer wants. You seem to want an alternative: build an AI that does what is best for all sentient life (or at least for humanity). Some reasons that we (maybe) shouldn't focus on this problem: 1. it seems horribly intractable (but I'd love to hear your ideas for solutions!) at both a technical and philosophical level -- this is my biggest qualm 2. with an AGI that does exactly what Facebook engineer no. 13,882 wants, we "only" need that engineer to want things that are good for all sentient life 3. (maybe) scenarios with advanced AI killing all sentient life are substantially more likely than scenarios with animal suffering There are definitely counterarguments to these. E.g. maybe animal suffering scenarios are still higher expected value to work on because of their severity (imagine factory farms continuing to exist for billions of years).
3Geoffrey Miller3mo
It sounds quite innocuous to 'build an AI that does what its designer wants' -- as long as we ignore the true diversity of its designers (and users) might actually want. If an AI designer or user is a misanthropic nihilist who wants humanity to go extinct, or it a religious or political terrorist, or is an authoritarian censor who wants to suppress free speech, then we shouldn't want the AI to do what they want. Is this problem 'horribly intractable'? Maybe it is. But if we ignore the truly, horribly intractable problems in AI alignment, then we increase X risk. I increasingly get the sense that AI alignment as a field is defining itself so narrowly, and limiting the alignment problems that it considers 'legitimate' to consider so narrowly, that it we could end up in a situation where alignment looks 'solved' at a narrow technical level, and this gives reassurance to corporate AI development teams that they can go full steam ahead towards AGI -- but where alignment is very very far from solved at the actual real-world level of billions of diverse people with seriously conflicting interests.
3Jakub Kraus3mo
Totally agree that intent alignment does basically nothing to solve misuse risks. To weigh the importance of misuse risks, we should consider (a) how quickly AI to AGI happens, (b) whether the first group to deploy AGI will use it to prevent other groups from developing AGI, (c) how quickly AGI to superintelligence happens, (d) how widely accessible AI will be to the public as it develops, (e) the destructive power of AI misuse at various stages of AI capability, etc. Paul Christiano's 2019 EAG-SF talk [] highlights how there are so many other important subproblems within "make AI go well" besides intent alignment. Of course, Paul doesn't speak for "AI alignment as a field."
ICYMI: Steering AI to care for animals, and soon [] discusses this, as do some posts in this topic [] .
2Geoffrey Miller3mo
Thank you! Appreciate the links.

What things would make people less worried about AI safety if they happened? What developments in the next 0-5 years should make people more worried if they happen?

on the good side 1. We hit some hard physical limit on computation which dramatically slows the relevant variants of Moore's law. 2. A major world power wakes up and started seriously focusing on AI safety as a top priority. 3. We build much better infrastructure for scaling the research field (e.g. AI tools to automatically connect people with relevant research, and help accelerate learning with contextual knowledge of what a person has read and is interested in, likely in the form of a unified feed [] (funding offers welcome!)). Apart's AI Safety Ideas [] also falls into the category of key infrastructure. 4. Direct progress on alignment, some research paradigm emerges which seems likely to result in a real solution. 5. Dramatic slow in the rate of capabilities breakthroughs, discovering some crucial category of task which needs fundamental breakthroughs which are not quick to be achieved. 6. More SBFs entering the funding ecosystem. and on the bad side 1. Lots of capabilities breakthroughs. 2. Stagnation or unhealthy dynamics in the alignment research space (e.g. vultures, loss of collective sensemaking as we scale, self-promotion becoming the winning strategy). 3. US-China race dynamics, especially both countries explicitly pushing for AGI without really good safety considerations. 4. Funding crunch.

I'm going to repeat my question from the "Ask EA anything" thread. Why do people talk about artificial general intelligence, rather than something like advanced AI? For some AI risk scenarios, it doesn't seem necessary that the AI be "generally" intelligent.

There are some AI risks which don't require generality, but the ones which have the potential to be an x-risk will likely involve fairly general capabilities. In particular, capabilities to automate innovation, like OpenPhil's Process for Automating Scientific and Technological Advancement [] . Several other overlapping terms have been used, such as Transformative AI, AI existential safety, AI alignment, AGI safety. We're planning to have a question on Stampy which covers these different terms. This Rob Miles video covers why the risks from this class of AI is likely the most important:
1Jakub Kraus3mo
I think there are also (unfortunately) some likely AI x-risks that don't involve general-purpose reasoning. For instance, so much of our lives already involves automated systems that determine what we read, how we travel, who we date, etc, and this dependence will only increase with more advanced AI. These systems will probably pursue easy-to-measure goals like "maximize user's time on the screen" and "maximize reported well-being," and these goals won't be perfectly aligned with "promote human flourishing." One doesn't need to be especially creative to imagine how this situation could create worlds in which most humans live unhappy lives (and are powerless to change their situation). Some of these scenarios would be worse than human extinction. There are more scenarios in "What failure looks like [] " and "What multipolar failure looks like [] " that don't require AGI. A counterargument is that we might eventually build AGI in these worlds anyways, at which point the concerns in Rob's talk become relevant. (Side note: from my perspective, Rob's talk says very little about why x-risk from AGI could be more pressing than x-risk from narrow AI.)

Why do you think AI Safety is tractable?

Related, have we made any tangible progress in the past ~5 years that a significant consensus of AI Safety experts agree is decreasing P(doom) or prolonging timelines?

Edit: I hadn't noticed there was already a similar question

3Jakub Kraus3mo
I share this concern and would like to see more responses to this question. Importance and neglectedness seem very high, but tractability is harder to justify. Given 100 additional talented people (especially people anywhere close to the level of someone like Paul Christiano) working on new research directions, it sounds intuitively absurd (to me) to say the probability of an AI catastrophe would not meaningfully decrease. But the only justification I can give is that generally people make progress when they work on research questions in related fields (e.g. in math, physics, computer science, etc, although the case is weaker for philosophy []). "Significant concensus of AI Safety experts agree" is a high bar. Personally, I'm more excited about work that a smaller group of experts (e.g. Nate Soares [] ) agree is actually useful. People disagree on what work is helpful in AI safety. Some might say a key achievement was that reward learning from human feedback [] gained prominence years before it would have otherwise (I think I saw Richard Ngo write this somewhere). Others might say that it was really important to clarify [] and formalize [] concepts related to inner alignment. I encourage you to read an overview of all the research agendas here [] (or this shorter cheat sheet [] ) and come to your own conclusions. Throughout, I've been answering through the lens of "AI Safety = technical AI alignment research." My answer completely ignores some other important categories of

What are good ways to test your fit for technical AI Alignment research? And which ways are best if you have no technical background?

This [] is a really comprehensive post on pursuing a career in technical AI safety, including how to test fit and skill up
Thank you - I had forgotten about that post and it was really helpful.

How would you recommend deciding which AI Safety orgs are actually doing useful work?
According to this comment, and my very casual reading of LessWrong, there is definitely no consensus on whether any given org is net-positive, net-neutral, or net-negative.

If you're working in a supporting role (e.g. engineering or hr) and can't really evaluate the theory yourself, how would you decide which orgs are net-positive to help?

Firstly, make sure you're working on doing safety rather than capabilities. Distinguishing between who is doing the best safety work if you can't evaluate the research directly is challenging. Your best path is probably to find a person who seems trustworthy and competent to you, and get their opinions on the work of the organizations you're considering. This could be a direct contact, or from a review such as Lark's one [] .

This is a great idea! I expect to use these threads to ask many many basic questions. 

One on my mind recently: assuming we succeed in creating aligned AI, whose values, or which values, will the AI be aligned with? We talk of 'human values', but humans have wildly differing values. Are people in the AI safety community thinking about this? Should we be concerned that an aligned AI's values will be set by (for example) the small team that created it, who might have idiosyncratic and/or bad values? 


7Koen Holtman3mo
Yes. They think about this more on the policy side than on the technical side, but there is technical/policy cross-over work too. Yes. There is significant of talk about 'aligned with whom exactly'. But many of the more technical papers and blog posts on x-risk style alignment tend to ignore this part of the problem, or mention it only in one or two sentences and then move on. This does not necessarily mean that the authors are unconcerned about this question, it more often means that they feel they have little new to say about it. If you want to see an example of a vigorous and occasionally politically sophisticated debate on solving the 'aligned with whom' question, instead of the moral philosophy 101/201 debate which is still the dominant form of discourse in the x-risk community, you can dip into the literature on AI fairness.
An AI could be aligned to something other than humanity's shared values, and this could potentially prevent most of the value in the universe from being realized. Nate Soares talks about this in Don't leave your fingerprints on the future [] . Most of the focus goes on being able to align an AI at all, as this is necessary for any win-state. There seems to be consensus among the relevant actors that seizing the cosmic endowment for themselves would be a Bad Thing. Hopefully this will hold.

What should people do differently to contribute to AI safety if they have long vs short timelines?

Activities which pay off over longer time horizons than your timelines should be dramatically downweighted. e.g. if your plan is to spend 6 years building career capital in a minimal impact position then leverage that into an AI Safety job, this is not a good plan if your timelines are 7 years. Generally, I advise moving as rapidly as possible into doing something which actually helps,[1] [#fnhi0ac9efs88]either directly as a researcher or by building and maintaining infrastructure which improves the alignment ecosystem and allows research to move in the right direction faster. 1. ^ [#fnrefhi0ac9efs88]I happen to know you're helping with impact markets []! This seems like a great place to be if we have at least mid single digit years, maybe even low single digit years.
1David Mears3mo
Here [] are my personal notes on this question.

I'm relatively unconvinced by most arguments I've read that claim deceptive alignment will be a thing (which I understand to be a model that intentionally behaves differently on its training data and test data to avoid changing its parameters in training). 

Most toy examples I've seen, or thought experiments, don't really seem to actually be examples of deceptive alignment since the model is actually trained on the "test" data in these examples. For example, while humans can decieve their teachers in etiquette school then use poor manners outside the s... (read more)

2Ben Millwood3mo
In my mind this rests on: * Figuring out that you're an ML agent being trained by RL is very difficult, and probably beyond the reach of current capabilities, which is why we don't see "real" deceptive alignment so far. * However, it's in principle possible, and helped by the fact that your training data probably includes information about what RL is etc. Seems like an AGI could do it. * Explicitly modelling your reward function seems like it would be really helpful for getting more reward, so RL agents might be pushed in the direction of this explicit modelling * Once you have "optimize my reward function" as an instrumental goal, you're pretty much as good at getting reward as any other agent that has that as an instrumental goal, regardless of what your terminal goal is, so at this point you may appear to be aligned while in reality your true terminal goals are just no longer directing your behaviour for as long as you believe you're in training. * It's hard to really comprehensively hide from the agent whether it's in training or not, so it should be relatively easy for the agent to optimize for reward in training and then do something else when deployed.
Thanks! I think most of this made sense to me. I'm a bit fuzzy on the fourth bullet. Also, I'm still confused why a model would even develop an alternative goal to maximizing its reward function, even if it's theoretically able to pursue one.
Gradient descent modifies parameters in a way such that the model gets "good" rewards. We don't actually know how to get "good" parameters directly. The reward is main lever of control we have as humans. You get one possible argument for deceptive alignment if you buy: - RL agents with coherent preference functions [] will tend to be deceptively aligned by default. - Models with more coherent preferences are more powerful than less coherent models.(Even if they're DL models not RL) - We will select models for those that are more powerful. (By "select" I mean a lab selecting between multiple trained models and architectures, rather than training of a single model) (There are probably other lines of argumentation.) You are assuming that "does well on given loss function" and "is deceptively aligned" are orthogonal in the gradient descent landscape because the two things sound different from each other, this doesn't have to be the case.
"RL agents with coherent preference functions [] will tend to be deceptively aligned by default." - Why?
I have a couple of videos that talk about this! This one sets up the general idea: This one talks about how like this is to happen in practice:

Should we expect power-seeking to often be naturally found by gradient descent? Or should we primarily expect it to come up when people are deliberately trying to make power-seeking AI, and train the model as such?

With powerful enough systems, convergent instrumental goals [] emerge, and inner alignment [] is something that needs to be addressed (i.e. stopping unintended misaligned agents emerging within the model). Optimal Policies Tend To Seek Power [].
Right, so I'm pretty on board with optimal policies (i.E., "global maximum" policies) usually involve seeking power. However, gradient descent only finds local maximums, not global maximums. It's unclear to me whether these global maximums would involve something like power-seeking. My intuition for why this might not be the case is that "small tweaks" in the direction of power-seeking would probably not reap immediate benefits, so gradient descent wouldn't go down this path. This is where my question kind of arose from. If you have empirical examples of power-seeking coming up in tasks where it's nontrivial that it would come up, I'd find that particularly helpful. Does the paper you sent address this? If so, I'll spend more time reading it.
Afaik, it remains an open area of research to find examples of emergent power-seeking in real ML systems. Finding such examples would do a lot for raising the alarm about AGI x-risk I think.
Ok, cool, that's helpful to know. Is your intuition that these examples will definitely occur and we just haven't seen them yet (due to model size or something like this)? If so, why?
My intuition is that they will occur, hopefully before it's too late (but it's possible that due to incentives for deception etc we may not see it before it's too late). More here: Evaluating LM power-seeking [] .

Generally prosaic alignment is much more tractable than conceptual or mathematical alignment., atleast if we're talking about where you will make marginal progress if you started now – @acylhalide 

What is prosaic alignment? What are examples for prosaic alignment?

Working on modern-day ML systems and trying to get "good" behaviour out of them. This could include increasing their interpretability, robustness or alignment (if you follow Dan Hendrycks' classification []). Conceptual alignment focusses more on speculating about future systems and which concepts we currently understand would carry over to them; it is half-philosophy.

To what extent is AI alignment tractable? 

I'm especially interested in subfields that are very tractable – as well as fields that are not tractable at all  (with people still working there). 

Generally prosaic alignment is much more tractable than conceptual or mathematical alignment., atleast if we're talking about where you will make marginal progress if you started now. The primary reason to work on conceptual or mathematical work is if you buy that there's fundamental problems that working on prosaic alignment is going to fail to solve in time. Some buy them, some don't (out of all people who buy x-risk arguments). (I don't have a strong personal opinion on this Q yet.)

Won't other people take care of this - why should I additionally care?

I can't track it down, but there is a tweet by I think Holden who runs OpenPhil where he says that people sometimes tell him that he's got it covered and he wants to shout "NO WE DON'T". We're very very far from a safe path to a good future. It is a hard problem and we're not rising to the challenge as a species. [] Why should you care? You and everyone you've ever known will die if we get this wrong.
But people are already dying - no mights required, why not focus on more immediate problems like global health and development
5Jay Bailey3mo
A 10% chance of a million people dying is as bad as 100,000 people dying with certainty, if you're risk-neutral. Essentially that's the main argument for working on a speculative cause like AGI - if there's a small chance of the end of humanity, that still matters a great deal. As for "Won't other people take care of this", could make that same argument about global health and development, too. More people is good for increasing potential impact of both fields. (Also worth noting - EA as a whole does devote a lot of resources to global health and development, you just don't see as many posts about it because there's less to discuss/argue about)

A single AGI with its utility function seems almost impossible to make safe.  What happens if you have a population of AGIs each with its own utility function?  Probably dangerous to make biological analogies ... but biological systems are often kept stable by the interplay between agonist and antagonist processes.  For instance, one AGI in the population wants to collect stamps and another wants to keep that from happening.

To what degree is having good software engineering experience helpful for AI Safety research?

AI Safety Needs Great Engineers [] .

Related question to the one posed by Yadav: 

Does the fact that OpenAI and DeepMind have AI Safety teams factor significantly into AI x-risk estimates? 

My independent impression is that it's very positive, but I haven't seen this factor being taken explicitly into account in risk estimates.

According to the CLR, since resource acquisition is an instrumental goal - regardless of the utility function of the AGI - , it is possible that such goal can lead to a race where each AGI can threaten others such that the target has an incentive to hand over resources or comply with the threateners’ demands. Is such a conflict scenario (potentially leading to x-risks) from two AGIs possible if these two AGIs have a different intelligence level? If so, isn't there a level of intelligence gap at which x-risks become unlikely? How to characterize this f... (read more)

1Angélina 3mo
Please red-team my comment, I may be talking nonsense but: Actually, my guess is that the probability of the threat being executed is a decreasing function of the intelligence gap between the two AGIs. The rationales behind this statement are the following: * the more intelligent the agents are, the less they will need to execute their threat (because a threat is an act that the agent would not prefer to perform per se and therefore if the agent is sufficiently intelligent - in particular, if she is a very good forecaster, e.g. can predict under which conditions her opponent will yield - she designs her threat accordingly such that she will not have to execute it in fine). * if an AGI is way smarter than the other AGI, she will probably succeed to make her threat such that the other AGI will comply to her demand, thus the threat will not be carried out. However, if this claim is true it leads to a kind of paradox: If timelines towards AGI are shorts and if we're in a fast take-off scenario, then x-risks are less likely to occur in conflict scenarios but at the same time it’s more likely that AGI will not be aligned thus potentially leading to catastrophic scenarios including x-risks scenarios. The rationales/premises behind this statement are the following: * if timelines towards AGI are shorts and if we're in a fast take-off scenario, then it’s more likely that different AGIs have a significantly different level of intelligence from each other (is it like a sufficient condition btw?) * if timelines towards AGI are shorts and if we're in a fast take-off scenario, then it’s more likely that any AGI will be not aligned * if AGI is not aligned, catastrophic scenarios, including x-risks, are more likely to occur * this 1st claim I've just mentioned above Red-team this comment please! I'm pretty sure I've misunderstood something or perhaps there is a logical flaw in my reasoning.

If AGI has several terminal goals, how does it classify them? Some kind of linear combination?

The RL agent formalism accepts a single utility function. DL formalism accepts a single loss function. You need to either find a way to aggregate your "several terminal goals" into a single function, or define a different formalism that natively works with multiple goals.

I have the feeling that there is a tendency in the AI safety community to think that if we solve the alignment problem, we’re done and the future must be necessarily flourishing (I observe that some EAs say that either we go extinct or it’s heaven on earth depending on the alignment problem, in a very binary way actually). However, it seems to me that post aligned-AGI scenario merit attention as well: game theory provides us a sufficient rationale to state that even rational agents (in this cases >2 AGIs) can take sub-optimal decisions (including catastrophic scenarios) when face with some social dilemma. Any thoughts on this please?

I think to the extent that there would be post-AGI sub-optimal decision making (or catastrophe), that would be basically a failure of alignment (i.e. the alignment problem would not in actual fact have been solved!). More concretely, there are many things that need aligning beyond single human : single AGI, the most difficult being multi-human : multi-AGI [] , but there is also alignment needed at every relevant step in the human decision making chain [] .

Why is there so much more talk about the existential risk from AI as opposed to the amount by which individuals (e.g. researchers) should expect to reduce these risks through their work? 

The second number seems much more decision-guiding for individuals than the first. Is the main reason that its much harder to estimate? If so, why?

Here is an attempt by Jordan Taylor: Expected ethical value of a career in AI safety [] (which you can plug your own numbers into).

I think EA has the resources to make the alignment problem viral or at least in STEM circles. Wouldn't that be good? I'm not asking if it would be an effective way of doing good, just a way. 

Because I'm surprise that not even AI doomers seem to be trying to reach the mainstream.

Let A = set of people who have heard of AI risk arguments and buy them B = set of people who have heard of AI risk arguments and don't buy them We are not maximising |A| indifferent to |B|, we're maximising f(A,B) for some specific but unknown f. People who don't buy the arguments are a downside for the movement (if the arguments are correct), as they can (among other things) damage the reputability of the AI risk ideas in the eye of anyone who hasn't yet seriously engaged with them and is deciding whether or not to. Different people have different views on what f is. And ofcourse all people in A and B can't be weighted equally, few people wield very asymmetric power and legitimacy hence count for more, On this topic I personally liked Samo Burja's series. []
Interesting. I'm not sure I understood the first part and what f(A,B) is. In the example that you gave B is only relevant with respect of how much it affects A ("damage the reputability of the AI risk ideas in the eye of anyone who hasn't yet seriously engaged with them and is deciding whether or not to"). So, in a way you are still trying to maximize |A| (or probably a subset of it: people who can also make progress on it (|A'|)). But in "among other things" I guess that you could be thinking of ways in which B could oppose A, so maybe that's why you want to reduce it too. The thing is: I have problems visualizing most of B opposing A and what that subset (B') could even be able to do (as I said, outside reducing |A|). I think that that is my biggest argument, that B' is a really small subset of B and I don't fear them. Now, if your point is that to maximize |A| you have to keep in mind B and so it would be better to have more 'legitimacy' on the alignment problem before making it viral you are right. So is there progress on that? Is the community building plan to convert authorities in the field to A before reaching the mainstream then? Also are people who try to disprove the alignment problem in B? If that's the case I don't know if our objective should be to maximize |A'|. I'm not sure if we can reach a superintelligence with AI, so I don't know if it wouldn't be better to think about maximize the people trying to solve OR dissolve the alignment problem. If we consider that most people probably wouldn't feel strongly about one side of the other (debatable), then I don't think is that big of a deal bringing the discussions more to the mainstream. If AI risk arguments include that not matter how uncertain researchers are about the problem giving what's at stake we should lower the chances, then I even see B and B' smaller. But maybe I'm too optimistic/ marketplacer/ memer. Lastly, the maximum size of A is smaller the shorter the timelines. Are people with shor
Sorry I won't be able to respond to your full comment, but: B wields "intellectual legitimacy". If I look at the AI field and 9 out of 10 research subfields think the 10th one is crackpottery, and policymakers think the 10th one is crackpottery and journalists think the same and most people I meet on the street with an opinion on it think the same, then I'm much less likely to read anything people in the 10th field have said. There are plenty of people claiming doom on earth even today for various reasons, and I've not spent time reading what they have to say because I have finite time in my life and can't go after comprehensive proofs for everything. I can't go out and prove everyone person claiming the experienced a miracle from god is wrong about having had this experience for instance.
Oh, I didn't know that the field was so against AI X-risks. Because when I saw this [] 5-10% of X-risks seemed enough to take them seriously. Is that survey not representative? Or is there a gap between people recognizing the risks and giving them legitimacy?
I think it varies, and the picture is slightly better than what I mentioned (that was just hypothetical) but there are many AI researchers who think x-risk people are not only incorrect but unscientific/ unreputable/lack technical knowledge in AI. People thinking the latter is more damaging to AI risks legitimacy (conditional on xrisk actually existing).

If someone was looking to get a graduate degree in AI, what would they look for? Is it different for other grad schools?

Specific to AGI Safety, I would recommend looking at trying to join relevant active research groups in academia i.e. getting into PhDs they are offering; perhaps doing a relevant Masters - ML/AI, or even Philosophy or Maths - at the university they are based at first. There are some groups listed here [] (see the table). Universities include Berkeley (Stuart Russell, Jacob Steinhardt), Cambridge (David Krueger), NYU (Sam Bowman), Oxford (FHI).

How dependent is current AGI safety work on deep RL? Recently there has been a lot of emphasis on advances in deep RL (and ML more generally), so it would be interesting to know what the implications would be if it turns out that this particular paradigm cannot lead to AGI.

1Koen Holtman3mo
There is some AGI safety work that specifically targets deep RL, under the asumption that deep RL might scale to AGI. But there is also a lot of other work, both on failure modes and on solutions, that is much more independent of the method being used to create the AGI. I do not have percentages on how it breaks down. Things are in flux. A lot of the new technical alignment startups seem to be mostly working in a deep RL context. But a significant part of the more theoretical work, and even some of the experimental work, involves reasoning about a very broad class of hypothetical future AGI systems, not just those that might be produced by deep RL.