All AGI Safety questions welcome (especially basic ones) [~monthly thread]

robertskmiles

Effective Altruism Forum
EA Forum

Hide table of contents

All AGI Safety questions welcome (especially basic ones) [~monthly thread]

by robertskmiles

Nov 1 20223 min read 83

75

AI safetyAsk Me Anything

Frontpage

All AGI Safety questions welcome (especially basic ones) [~monthly thread]

tl;dr: Ask questions about AGI Safety as comments on this post, including ones you might otherwise worry seem dumb!

Stampy's Interactive AGI Safety FAQ

Guidelines for Questioners:

Guidelines for Answerers:

83 comments

tl;dr: Ask questions about AGI Safety as comments on this post, including ones you might otherwise worry seem dumb!

Asking beginner-level questions can be intimidating, but everyone starts out not knowing anything. If we want more people in the world who understand AGI safety, we need a place where it's accepted and encouraged to ask about the basics.

We'll be putting up monthly FAQ posts as a safe space for people to ask all the possibly-dumb questions that may have been bothering them about the whole AGI Safety discussion, but which until now they didn't feel able to ask.

It's okay to ask uninformed questions, and not worry about having done a careful search before asking.

Stampy's Interactive AGI Safety FAQ

Additionally, this will serve as a way to spread the project Rob Miles' volunteer team^[1] has been working on: Stampy - which will be (once we've got considerably more content) a single point of access into AGI Safety, in the form of a comprehensive interactive FAQ with lots of links to the ecosystem. We'll be using questions and answers from this thread for Stampy (under these copyright rules), so please only post if you're okay with that! You can help by adding other people's questions and answers to Stampy or getting involved in other ways!

We're not at the "send this to all your friends" stage yet, we're just ready to onboard a bunch of editors who will help us get to that stage :)

**Stampy** - Here to help everyone learn about ~~stamp maximization~~ AGI Safety!

We welcome feedback^[2] and questions on the UI/UX, policies, etc. around Stampy, as well as pull requests to his codebase. You are encouraged to add other people's answers from this thread to Stampy if you think they're good, and collaboratively improve the content that's already on our wiki.

We've got a lot more to write before he's ready for prime time, but we think Stampy can become an excellent resource for everyone from skeptical newcomers, through people who want to learn more, right up to people who are convinced and want to know how they can best help with their skillsets.

PS: Based on feedback that Stampy will be not serious enough for serious people we built an alternate skin for the frontend which is more professional: Alignment.Wiki. We're likely to move one more time to aisafety.info, feedback welcome.

Guidelines for Questioners:

No previous knowledge of AGI safety is required. If you want to watch a few of the Rob Miles videos, read either the WaitButWhy posts, or the The Most Important Century summary from OpenPhil's co-CEO first that's great, but it's not a prerequisite to ask a question.
Similarly, you do not need to try to find the answer yourself before asking a question (but if you want to test Stampy's in-browser tensorflow semantic search that might get you an answer quicker!).
Also feel free to ask questions that you're pretty sure you know the answer to, but where you'd like to hear how others would answer the question.
One question per comment if possible (though if you have a set of closely related questions that you want to ask all together that's ok).
If you have your own response to your own question, put that response as a reply to your original question rather than including it in the question itself.
Remember, if something is confusing to you, then it's probably confusing to other people as well. If you ask a question and someone gives a good response, then you are likely doing lots of other people a favor!

Guidelines for Answerers:

Linking to the relevant canonical answer on Stampy is a great way to help people with minimal effort! Improving that answer means that everyone going forward will have a better experience!
This is a safe space for people to ask stupid questions, so be kind!
If this post works as intended then it will produce many answers for Stampy's FAQ. It may be worth keeping this in mind as you write your answer. For example, in some cases it might be worth giving a slightly longer / more expansive / more detailed explanation rather than just giving a short response to the specific question asked, in order to address other similar-but-not-precisely-the-same questions that other people might have.

Finally: Please think very carefully before downvoting any questions, remember this is the place to ask stupid questions!

^{^}
If you'd like to join, head over to Rob's Discord and introduce yourself!
^{^}
Via the feedback form.

75 Reactions

Mentioned in

39EA & LW Forums Weekly Summary (31st Oct - 6th Nov 22')

More posts like this

Comments83

Sorted by

New & upvoted

Click to highlight new comments since: Today at 4:56 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

YadavNov 2 202221

If companies like OpenAI and Deepmind have safety teams, it seems to me that they anticipate that speeding up AI capabilities can be very bad, so why don't they press the brakes on their capabilities research until we come up with more solutions to alignment?

Greg_Colbourn ⏸️

Nov 5 2022

Possible reasons: * Their leaderships (erroneously) believe the benefits outweigh the risk (i.e. they don't appreciate the scale of the risk, or just don't care enough even if they do?). * They worry about losing the race to AGI to competitors who aren't as aligned (with safety), and they aren't taking heed of the warnings of Szilárd and Ellsberg re race dynamics. Perhaps if they actually took the lead in publicly (and verifiably) slowing down, that would be enough to slow down the whole field globally, given that they are the leaders. There isn't really precedence for this for something so globally important, but perhaps the "killing" of the electric car (that delayed development by ~a decade?) is instructive.

Ben_West🔸Nov 2 202221

In this post, one of the lines of discussion is about whether values are "fragile". My summary, which might be wrong:

Katja says maybe they aren't: for example, GANs seem to make faces which are pretty close to normal human faces
Nate responds: sure, but if you used the discriminator to make the maximally face-like object, this would not at all be like a human face
cfoster0 replies: yeah but nobody uses GANs that way

And then I lose the thread. Why isn't cfoster0's response compelling?

plex

Nov 3 2022

Great question! I think the core of the answer comes down to the fact that the real danger of AI systems does not come from tools, but from agents. There are strong incentives to build more agenty AIs, agenty AIs are more useful and powerful than tools, it's likely to be relatively easy to build agents once you can build powerful tools, and tools may naturally slide into becoming agents at a certain level of capability. If you're a human directing a tool, it's pretty easy to point the optimization power of the tool in beneficial ways. Once you have a system which has its own goals which it's maximizing for, then you have much bigger problems. Consequentialists seek power more effectively than other systems, so when you're doing a large enough program search with a diverse training task attached to a reinforcement signal they will tend to be dominant. Internally targetable maximization-flavored search is an extremely broadly useful mechanism which will be stumbled on and upwrighted by gradient descent. See Rohin Shah's AI risk from Program Search threat model for more details. The system which emerges from recursive self-improvement is likely to be a maximizer of some kind. And maximizing AI is dangerous (and hard to avoid!), as explored in this Rob Miles video. To tie this back to your question: Weak and narrow AIs can be safely used as tools, we can have a human in the outer loop directing the optimization power. Once you have a system much smarter than you, the thing it ends up pointed at maximizing is no longer corrigible by default, and you can't course correct if you misspecified the kind of facelikeness you were asking for. Specifying open ended goals for a sovereign maximizer to pursue in the real world which don't kill everyone is an unsolved problem.

[anonymous]Nov 2 202219

Asking beginner-level questions can be intimidating

To make it even less intimidating, maybe next time include a Google Form where people can ask questions anonymously that you'll then post in the thread a la https://www.lesswrong.com/posts/8c8AZq5hgifmnHKSN/agi-safety-faq-all-dumb-questions-allowed-thread?commentId=xm8TzbFDggYcYjG5e ?

plex

Nov 3 2022

Yes, this seems like a good idea, I'll do that next time unless someone else is up for handling incoming questions.

Alejandro AcelasNov 2 202218

What stops AI Safety orgs from just hiring ML talent outside EA for their junior/more generic roles?

JakubK

Nov 4 2022

I'd love to see a detailed answer to this question. I think a key bottleneck for AI alignment at the moment is finding people who can identify research directions (and then lead relevant projects) that might actually reduce x-risk, so I'm also confused why some career guides include software and ML engineering as one of the best ways to contribute. I struggle to see how software and ML engineering could be a bottleneck given that there are so many talented software and ML engineers out there. Counterpoint: infohazards mean you can't just hire anyone.

Geoffrey MillerNov 2 202216

This is a question about some anthropocentrism that seems latent in the AI safety research that I've seen so far:

Why do AI alignment researchers seem to focus only on aligning with human values, preferences, and goals, without considering alignment with the values, preferences, and goals of non-human animals?

I see a disconnect between EA work on AI alignment and EA work on animal welfare, and it's puzzling to me, given that any transformative AI will transform not just 8 billion human lives, but trillions of other sentient lives on Earth. Are any AI researchers trying to figure out how AI can align with even simple cases like the interests of the few species of pets and livestock?

If we view AI development not just as a matter of human technology, but as a 'major evolutionary transition' for life on our planet more generally, it would seem prudent to consider broader issues of alignment with the other 5,400 species of mammals, the other 45,000 species of vertebrates, etc...

kokotajlodNov 2 202215

IMO: Most of the difficulty in technical alignment is figuring out how to robustly align to any particular values whatsoever. Mine, yours, humanitys, all sentient life on earth, etc. All roughly equally difficult. "Human values" is probably the catchphrase mostly for instrumental reasons--we are talking to other humans, after all, and in particular to liberal egalitarian humans who are concerned about some people being left out and especially concerned about an individual or small group hoarding power. Insofar as lots of humans were super concerned that animals would be left out too, we'd be saying humans+animals. The hard part isn't deciding who to align to, it's figuring out how to align to anything at all.

Geoffrey Miller

Nov 2 2022

I've heard this argument several times, that once we figure out how to align AI with the values of any sentient being, the rest of AI alignment with all the other billions/trillions of different sentient beings will be trivially easy. I'm not at all convinced that this is true, or even superficially plausible, given the diversity, complexity, and heterogeneity of values, and given that sentient beings are severely & ubiquitously unaligned with each other (see: evolutionary game theory, economic competition, ideological conflict). What is the origin of this faith among the AI alignment community that 'alignment in general is hard, but once we solve generic alignment, alignment with billions/trillions of specific beings and specific values will be easy'? I'm truly puzzled on this point, and can't figure out how it became such a common view in AI safety.

Ben_West🔸

Nov 2 2022

One perhaps obvious point: if you make some rationality assumptions, there is a single unique solution to how those preferences should be aggregated. So if you are able to align an AI with a single individual, you can iterate this alignment with all the individuals and use Harsanyi's theorem to aggregate their preferences. This (assuming rationality) is the uniquely best method to aggregate preferences. There are criticisms to be made of this solution, but it at least seems reasonable, and I don't think there's an analogous simple "reasonably good" solution to aligning AI with an individual.

Geoffrey MillerNov 3 202210

Ben - thanks for the reminder about Harsanyi.

Trouble is, (1) the rationality assumption is demonstrably false, (2) there's no reason for human groups to agree to aggregate their preferences in this way -- any more than they'd be willing to dissolve their nation-states and hand unlimited power over to a United Nations that promises to use Harsanyi's theorem fairly and incorruptibly.

Yes, we could try to align AI with some kind of lowest-common-denominator aggregated human (or mammal, or vertebrate) preferences. But if most humans would not be happy with that strategy, it's a non-starter for solving alignment.

RyanCarey

Nov 2 2022

I agree that a lot of people believe alignment to any agent is the hard part, and aligning to a particular human is relatively easy, or a mere "AI capabilities" problem. Why? I think it's a sincere belief, but ultimately most people think it because it's an agreed assumption by the AIS community, held for a mixture of intrinsic and instrumental reasons. The intrinsic reasons are that a lot of the fundamental conceptual problems in AI safety seem not to care which human you're aligning the AI system to, e.g. the fact that human values are complex, that wireheading may arise, and that it's hard to describe how the AI system should want to change its values over time. The instrumental reason is that it's a central premise of the field, similar to the "DNA->RNA->protein ->cellular functions" perspective in molecular biology. The vision for AIS as a field is that we try not to indulge futurist and political topics, and why try not to argue with each other about things like whose values to align the AI to. You can see some of this instrumentalist perspective in Eliezer's Coherent Extrapolated Volition paper: Presumably the prices have gone up with the increased EA wealth, and down again this year..

Geoffrey Miller

Nov 2 2022

Ryan - thanks for this helpful post about this 'central dogma' in AI safety. It sounds like much of this view may have been shaped by Yudkowsky's initial writings about alignment and coherent extrapolated volition? And maybe reflects a LessWrong ethos that cosmic-scale considerations mean we should ignore current political, religious, and ideological conflicts of values and interests among humans? My main concern here is that if this central dogma about AI alignment (that 'alignment to any agent is the hard part, and aligning to a particular human is relatively easy, or a mere "AI capabilities" problem', as you put it) is wrong -- then we may be radically underestimating the difficult of alignment, and it might end up being much harder to align with the specific & conflicting values of 8 billion people and trillions of animals, than it is to just 'align in principle' with one example agent. And that would be very bad news for our species. IMHO, one might even argue that failure to challenge this central dogma in AI safety is a big potential failure mode, and perhaps an X risk in its own right....

RyanCarey

Nov 2 2022

Yes, I personally think it was shaped by EY and that broader LessWrong ethos. I don't really have a strong sense of whether you're right about aligning to many agents being much harder than one ideal agent. I suppose that if you have an AHI system that can align to one human, then you could align many of them to different randomly selected humans, and simulate a debates between the resulting agents. You could then could consult the humans regarding whether their positions were adequately represented in that parliament. I suppose it wouldn't be that much harder than just aligning to one agent. A broader thought is that you may want to be clear about how an inability to align to n humans would cause catastrophe. It could be directly catastrophic, because it means we make a less ethical AI. Or it could be indirectly catastrophic, because our inability to design a system that aligns to n humans makes nations less able to cooperate, exacerbating any arms race.

kokotajlod

Nov 2 2022

I think that it is unfair to characterize it as something that hasn't been questioned. It has in fact been argued for at length. See e.g. the literature on the inner alignment problem. I agree there are also instrumental reasons supporting this dogma, but even if there weren't, I'd still believe it and most alignment researchers would still believe it, because it is a pretty straightforward inference to make if you understand the alignment literature.

Geoffrey Miller

Nov 2 2022

Could you please say more about this? I don't see how the so-called 'inner alignment problem' is relevant here, or what you mean by 'instrumental reasons supporting this dogma'. And it sounds like you're saying I'd agree with the AI alignment experts if only I understood the alignment literature... but I'm moderately familiar with the literature; I just don't agree with some of its key assumptions.

kokotajlod

Nov 5 2022

OK, sure. Instrumental reasons supporting this dogma: The dogma helps us all stay sane and focused on the mission instead of fighting each other, so we have reason to promote it that is independent of whether or not it is true. (By contrast, an epistemic reason supporting the dogma would be a reason to think it is true, rather than merely a reason to think it is helpful/useful/etc.) Inner alignment problem: Well, it's generally considered to be an open unsolved problem. We don't know how to make the goals/values/etc of the hypothetical superhuman AGI correspond in any predictable way to the reward signal or training setup--I mean, yeah, no doubt there is a correspondence, but we don't understand it well enough to say "Given such-and-such a training environment and reward signal, the eventual goals/values/etc of the eventual AGI will be so-and-so." So we can't make the learning process zero in on even fairly simple goals like "maximize the amount of diamond in the universe." For an example of an attempt to do so, a proposal that maaaybe might work, see https://www.lesswrong.com/posts/k4AQqboXz8iE5TNXK/a-shot-at-the-diamond-alignment-problem Though actually this isn't even a proposal to get that, it's a proposal to get the much weaker thing of an AGI that makes a lot of diamond eventually.

Geoffrey Miller

Nov 6 2022

Thanks; those are helpful clarifications. Appreciate it.

robertskmiles

Nov 13 2022

My perspective, I think, is that most of the difficulties that people think of as being the extra, hard part of to one->many alignment, are already present in one->one alignment. A single human is already a barely coherent mess of conflicting wants and goals interacting chaotically, and the strong form of "being aligned to one human" requires a solution that can resolve values conflicts between incompatible 'parts' of that human and find outcomes that are satisfactory to all interests. Expanding this to more than one person is a change of degree but not kind. There is a weaker form of "being aligned to one human" that's just like "don't kill that human and follow their commands in more or less the way they intend", and if that's all we can get then that only translates to "don't drive humanity extinct and follow the wishes of at least some subset of people", and I'd consider that a dramatically suboptimal outcome. At this point I'd take it though.

Geoffrey Miller

Nov 13 2022

Hi Robert, thanks for your perspective on this. I love your YouTube videos by the way -- very informative and clear, and helpful for AI alignment newbies like me. My main concern is that we still have massive uncertainty about what proportion of 'alignment with all humans' can be solved by 'alignment with one human'. It sounds like your bet is that it's somewhere above 50% (maybe?? I'm just guessing); whereas my bet is that it's under 20% -- i.e. I think that aligning with one human leaves most of the hard problems, and the X risk, unsolved. And part of my skepticism in that regard is that a great many humans -- perhaps most of the 8 billion on Earth -- would be happy to use AI to inflict harm, up to and including death and genocide, on certain other individuals and groups of humans. So, AI that's aligned with frequently homicidal/genocidal individual humans would be AI that's deeply anti-aligned with other individuals and groups.

JakubK

Nov 4 2022

Intent alignment seeks to build an AI that does what its designer wants. You seem to want an alternative: build an AI that does what is best for all sentient life (or at least for humanity). Some reasons that we (maybe) shouldn't focus on this problem: 1. it seems horribly intractable (but I'd love to hear your ideas for solutions!) at both a technical and philosophical level -- this is my biggest qualm 2. with an AGI that does exactly what Facebook engineer no. 13,882 wants, we "only" need that engineer to want things that are good for all sentient life 3. (maybe) scenarios with advanced AI killing all sentient life are substantially more likely than scenarios with animal suffering There are definitely counterarguments to these. E.g. maybe animal suffering scenarios are still higher expected value to work on because of their severity (imagine factory farms continuing to exist for billions of years).

Geoffrey Miller

Nov 4 2022

It sounds quite innocuous to 'build an AI that does what its designer wants' -- as long as we ignore the true diversity of its designers (and users) might actually want. If an AI designer or user is a misanthropic nihilist who wants humanity to go extinct, or it a religious or political terrorist, or is an authoritarian censor who wants to suppress free speech, then we shouldn't want the AI to do what they want. Is this problem 'horribly intractable'? Maybe it is. But if we ignore the truly, horribly intractable problems in AI alignment, then we increase X risk. I increasingly get the sense that AI alignment as a field is defining itself so narrowly, and limiting the alignment problems that it considers 'legitimate' to consider so narrowly, that it we could end up in a situation where alignment looks 'solved' at a narrow technical level, and this gives reassurance to corporate AI development teams that they can go full steam ahead towards AGI -- but where alignment is very very far from solved at the actual real-world level of billions of diverse people with seriously conflicting interests.

JakubK

Nov 5 2022

Totally agree that intent alignment does basically nothing to solve misuse risks. To weigh the importance of misuse risks, we should consider (a) how quickly AI to AGI happens, (b) whether the first group to deploy AGI will use it to prevent other groups from developing AGI, (c) how quickly AGI to superintelligence happens, (d) how widely accessible AI will be to the public as it develops, (e) the destructive power of AI misuse at various stages of AI capability, etc. Paul Christiano's 2019 EAG-SF talk highlights how there are so many other important subproblems within "make AI go well" besides intent alignment. Of course, Paul doesn't speak for "AI alignment as a field."

Ben_West🔸

Nov 2 2022

ICYMI: Steering AI to care for animals, and soon discusses this, as do some posts in this topic.

Geoffrey Miller

Nov 3 2022

Thank you! Appreciate the links.

PeterNov 2 202215

What things would make people less worried about AI safety if they happened? What developments in the next 0-5 years should make people more worried if they happen?

plex

Nov 3 2022

on the good side 1. We hit some hard physical limit on computation which dramatically slows the relevant variants of Moore's law. 2. A major world power wakes up and started seriously focusing on AI safety as a top priority. 3. We build much better infrastructure for scaling the research field (e.g. AI tools to automatically connect people with relevant research, and help accelerate learning with contextual knowledge of what a person has read and is interested in, likely in the form of a unified feed (funding offers welcome!)). Apart's AI Safety Ideas also falls into the category of key infrastructure. 4. Direct progress on alignment, some research paradigm emerges which seems likely to result in a real solution. 5. Dramatic slow in the rate of capabilities breakthroughs, discovering some crucial category of task which needs fundamental breakthroughs which are not quick to be achieved. 6. More SBFs entering the funding ecosystem. and on the bad side 1. Lots of capabilities breakthroughs. 2. Stagnation or unhealthy dynamics in the alignment research space (e.g. vultures, loss of collective sensemaking as we scale, self-promotion becoming the winning strategy). 3. US-China race dynamics, especially both countries explicitly pushing for AGI without really good safety considerations. 4. Funding crunch.

Stephen ClareNov 2 202213

I'm going to repeat my question from the "Ask EA anything" thread. Why do people talk about artificial general intelligence, rather than something like advanced AI? For some AI risk scenarios, it doesn't seem necessary that the AI be "generally" intelligent.

plex

Nov 3 2022

There are some AI risks which don't require generality, but the ones which have the potential to be an x-risk will likely involve fairly general capabilities. In particular, capabilities to automate innovation, like OpenPhil's Process for Automating Scientific and Technological Advancement. Several other overlapping terms have been used, such as Transformative AI, AI existential safety, AI alignment, AGI safety. We're planning to have a question on Stampy which covers these different terms. This Rob Miles video covers why the risks from this class of AI is likely the most important:

JakubK

Nov 4 2022

I think there are also (unfortunately) some likely AI x-risks that don't involve general-purpose reasoning. For instance, so much of our lives already involves automated systems that determine what we read, how we travel, who we date, etc, and this dependence will only increase with more advanced AI. These systems will probably pursue easy-to-measure goals like "maximize user's time on the screen" and "maximize reported well-being," and these goals won't be perfectly aligned with "promote human flourishing." One doesn't need to be especially creative to imagine how this situation could create worlds in which most humans live unhappy lives (and are powerless to change their situation). Some of these scenarios would be worse than human extinction. There are more scenarios in "What failure looks like" and "What multipolar failure looks like" that don't require AGI. A counterargument is that we might eventually build AGI in these worlds anyways, at which point the concerns in Rob's talk become relevant. (Side note: from my perspective, Rob's talk says very little about why x-risk from AGI could be more pressing than x-risk from narrow AI.)

Geoffrey MillerNov 2 20229

Cool. That all makes sense.

Seems like a lot of alignment research at the moment is analogous to physicists at Los Alamos National Labs running computer simulations to show that a next-generation nuke will reliably give a yield of 5 megatons plus or minus 0.1 megatons, and will not explode accidentally, and is therefore aligned with the Pentagon's mission of developing 'safe and reliable' nuclear weaponry.... and then saying 'We'll worry about the risks of nuclear arms races, nuclear escalation, nuclear accidents, nuclear winter, and nuclear terrorism later -- they're just implementation details'.

kokotajlodNov 5 202221

The situation is much worse than that. It's more like: They are worried about the possibility that the first ever nuclear explosion will ignite the upper atmosphere and set the whole earth ablaze. (You've probably read, this is a real concern they had). Except in this hypothetical the preliminary calculations are turning up the answer of Yes no matter how they run them. So they are continuing to massage the calculations and make the modelling software more realistic in the hopes of a No answer, and also advocating for changes to the design of the bomb that'll hopefully mitigate the risk, and also advocating for the whole project to slow down before it's too late, but the higher-ups have go fever & so it looks like in a few years the whole world will be on fire. Meanwhile, some other people are talking to the handful of Los Alamos physicists and saying "but even if the atmosphere doesn't catch on fire, what about arms races, accidents, terrorism, etc.?" and the physicists are like "lol yeah that's gonna be a whole big problem if we manage so survive the first test, which unfortunately we probably won't. We'd be working on that problem if this one didn't take priority."

Geoffrey Miller

Nov 6 2022

That's a vivid but perhaps all too accurate (and all too horrifying) analogy.

Lorenzo Buonanno🔸Nov 2 20226

Why do you think AI Safety is tractable?

Related, have we made any tangible progress in the past ~5 years that a significant consensus of AI Safety experts agree is decreasing P(doom) or prolonging timelines?

Edit: I hadn't noticed there was already a similar question

JakubK

Nov 4 2022

I share this concern and would like to see more responses to this question. Importance and neglectedness seem very high, but tractability is harder to justify. Given 100 additional talented people (especially people anywhere close to the level of someone like Paul Christiano) working on new research directions, it sounds intuitively absurd (to me) to say the probability of an AI catastrophe would not meaningfully decrease. But the only justification I can give is that generally people make progress when they work on research questions in related fields (e.g. in math, physics, computer science, etc, although the case is weaker for philosophy). "Significant concensus of AI Safety experts agree" is a high bar. Personally, I'm more excited about work that a smaller group of experts (e.g. Nate Soares) agree is actually useful. People disagree on what work is helpful in AI safety. Some might say a key achievement was that reward learning from human feedback gained prominence years before it would have otherwise (I think I saw Richard Ngo write this somewhere). Others might say that it was really important to clarify and formalize concepts related to inner alignment. I encourage you to read an overview of all the research agendas here (or this shorter cheat sheet) and come to your own conclusions. Throughout, I've been answering through the lens of "AI Safety = technical AI alignment research." My answer completely ignores some other important categories of work like AI governance and AI safety field building, which are relevant to prolonging timelines.

PeterNov 2 20226

What are good ways to test your fit for technical AI Alignment research? And which ways are best if you have no technical background?

Iyngkarran Kumar

Nov 4 2022

This is a really comprehensive post on pursuing a career in technical AI safety, including how to test fit and skill up

Peter

Nov 16 2022

Thank you - I had forgotten about that post and it was really helpful.

Lorenzo Buonanno🔸Nov 2 20225

How would you recommend deciding which AI Safety orgs are actually doing useful work?
According to this comment, and my very casual reading of LessWrong, there is definitely no consensus on whether any given org is net-positive, net-neutral, or net-negative.

If you're working in a supporting role (e.g. engineering or hr) and can't really evaluate the theory yourself, how would you decide which orgs are net-positive to help?

plex

Nov 3 2022

Firstly, make sure you're working on doing safety rather than capabilities. Distinguishing between who is doing the best safety work if you can't evaluate the research directly is challenging. Your best path is probably to find a person who seems trustworthy and competent to you, and get their opinions on the work of the organizations you're considering. This could be a direct contact, or from a review such as Lark's one.

Amber DawnNov 2 20225

This is a great idea! I expect to use these threads to ask many many basic questions.

One on my mind recently: assuming we succeed in creating aligned AI, whose values, or which values, will the AI be aligned with? We talk of 'human values', but humans have wildly differing values. Are people in the AI safety community thinking about this? Should we be concerned that an aligned AI's values will be set by (for example) the small team that created it, who might have idiosyncratic and/or bad values?

Koen Holtman

Nov 3 2022

Yes. They think about this more on the policy side than on the technical side, but there is technical/policy cross-over work too. Yes. There is significant of talk about 'aligned with whom exactly'. But many of the more technical papers and blog posts on x-risk style alignment tend to ignore this part of the problem, or mention it only in one or two sentences and then move on. This does not necessarily mean that the authors are unconcerned about this question, it more often means that they feel they have little new to say about it. If you want to see an example of a vigorous and occasionally politically sophisticated debate on solving the 'aligned with whom' question, instead of the moral philosophy 101/201 debate which is still the dominant form of discourse in the x-risk community, you can dip into the literature on AI fairness.

plex

Nov 3 2022

An AI could be aligned to something other than humanity's shared values, and this could potentially prevent most of the value in the universe from being realized. Nate Soares talks about this in Don't leave your fingerprints on the future. Most of the focus goes on being able to align an AI at all, as this is necessary for any win-state. There seems to be consensus among the relevant actors that seizing the cosmic endowment for themselves would be a Bad Thing. Hopefully this will hold.

David MNov 2 20225

What should people do differently to contribute to AI safety if they have long vs short timelines?

plex

Nov 3 2022

Activities which pay off over longer time horizons than your timelines should be dramatically downweighted. e.g. if your plan is to spend 6 years building career capital in a minimal impact position then leverage that into an AI Safety job, this is not a good plan if your timelines are 7 years. Generally, I advise moving as rapidly as possible into doing something which actually helps,[1] either directly as a researcher or by building and maintaining infrastructure which improves the alignment ecosystem and allows research to move in the right direction faster. 1. ^ I happen to know you're helping with impact markets! This seems like a great place to be if we have at least mid single digit years, maybe even low single digit years.

David M

Nov 7 2022

Here are my personal notes on this question.

oh54321Nov 2 20224

I'm relatively unconvinced by most arguments I've read that claim deceptive alignment will be a thing (which I understand to be a model that intentionally behaves differently on its training data and test data to avoid changing its parameters in training).

Most toy examples I've seen, or thought experiments, don't really seem to actually be examples of deceptive alignment since the model is actually trained on the "test" data in these examples. For example, while humans can decieve their teachers in etiquette school then use poor manners outside the s... (read more)

Ben Millwood🔸

Nov 5 2022

In my mind this rests on: * Figuring out that you're an ML agent being trained by RL is very difficult, and probably beyond the reach of current capabilities, which is why we don't see "real" deceptive alignment so far. * However, it's in principle possible, and helped by the fact that your training data probably includes information about what RL is etc. Seems like an AGI could do it. * Explicitly modelling your reward function seems like it would be really helpful for getting more reward, so RL agents might be pushed in the direction of this explicit modelling * Once you have "optimize my reward function" as an instrumental goal, you're pretty much as good at getting reward as any other agent that has that as an instrumental goal, regardless of what your terminal goal is, so at this point you may appear to be aligned while in reality your true terminal goals are just no longer directing your behaviour for as long as you believe you're in training. * It's hard to really comprehensively hide from the agent whether it's in training or not, so it should be relatively easy for the agent to optimize for reward in training and then do something else when deployed.

oh54321

Nov 6 2022

Thanks! I think most of this made sense to me. I'm a bit fuzzy on the fourth bullet. Also, I'm still confused why a model would even develop an alternative goal to maximizing its reward function, even if it's theoretically able to pursue one.

robertskmiles

Nov 13 2022

I have a couple of videos that talk about this! This one sets up the general idea: This one talks about how like this is to happen in practice:

oh54321Nov 4 20223

Should we expect power-seeking to often be naturally found by gradient descent? Or should we primarily expect it to come up when people are deliberately trying to make power-seeking AI, and train the model as such?

Greg_Colbourn ⏸️

Nov 8 2022

With powerful enough systems, convergent instrumental goals emerge, and inner alignment is something that needs to be addressed (i.e. stopping unintended misaligned agents emerging within the model). Optimal Policies Tend To Seek Power.

oh54321

Nov 9 2022

Right, so I'm pretty on board with optimal policies (i.E., "global maximum" policies) usually involve seeking power. However, gradient descent only finds local maximums, not global maximums. It's unclear to me whether these global maximums would involve something like power-seeking. My intuition for why this might not be the case is that "small tweaks" in the direction of power-seeking would probably not reap immediate benefits, so gradient descent wouldn't go down this path. This is where my question kind of arose from. If you have empirical examples of power-seeking coming up in tasks where it's nontrivial that it would come up, I'd find that particularly helpful. Does the paper you sent address this? If so, I'll spend more time reading it.

Greg_Colbourn ⏸️

Nov 9 2022

Afaik, it remains an open area of research to find examples of emergent power-seeking in real ML systems. Finding such examples would do a lot for raising the alarm about AGI x-risk I think.

oh54321

Nov 9 2022

Ok, cool, that's helpful to know. Is your intuition that these examples will definitely occur and we just haven't seen them yet (due to model size or something like this)? If so, why?

Greg_Colbourn ⏸️

Nov 9 2022

My intuition is that they will occur, hopefully before it's too late (but it's possible that due to incentives for deception etc we may not see it before it's too late). More here: Evaluating LM power-seeking .

YuliaNov 3 20223

Generally prosaic alignment is much more tractable than conceptual or mathematical alignment., atleast if we're talking about where you will make marginal progress if you started now – @acylhalide

What is prosaic alignment? What are examples for prosaic alignment?

Geoffrey MillerNov 2 20223

OK, let's say a foreign superpower develops what you're calling 'weakly aligned AI', and they do 'muster the necessary power to force the world into a configuration where [X risk] is lowered'... by, for example, developing a decisive military and economic advantage over other countries, imposing their ideology on everybody, and thereby reducing the risk of great-power conflict.

I still don't understand how we could call such an AI 'aligned with humanity' in any broad sense; it would simply be aligned with its host government and their interests, and s... (read more)

YuliaNov 2 20223

To what extent is AI alignment tractable?

I'm especially interested in subfields that are very tractable – as well as fields that are not tractable at all (with people still working there).

Pseudonym101Nov 2 20222

Won't other people take care of this - why should I additionally care?

plex

Nov 3 2022

I can't track it down, but there is a tweet by I think Holden who runs OpenPhil where he says that people sometimes tell him that he's got it covered and he wants to shout "NO WE DON'T". We're very very far from a safe path to a good future. It is a hard problem and we're not rising to the challenge as a species. Why should you care? You and everyone you've ever known will die if we get this wrong.

Pseudonym101

Nov 5 2022

But people are already dying - no mights required, why not focus on more immediate problems like global health and development

Jay Bailey

Nov 5 2022

A 10% chance of a million people dying is as bad as 100,000 people dying with certainty, if you're risk-neutral. Essentially that's the main argument for working on a speculative cause like AGI - if there's a small chance of the end of humanity, that still matters a great deal. As for "Won't other people take care of this", well...you could make that same argument about global health and development, too. More people is good for increasing potential impact of both fields. (Also worth noting - EA as a whole does devote a lot of resources to global health and development, you just don't see as many posts about it because there's less to discuss/argue about)

HumboltNov 20 20221

A single AGI with its utility function seems almost impossible to make safe. What happens if you have a population of AGIs each with its own utility function? Probably dangerous to make biological analogies ... but biological systems are often kept stable by the interplay between agonist and antagonist processes. For instance, one AGI in the population wants to collect stamps and another wants to keep that from happening.

Agustín Covarrubias 🔸Nov 8 20221

To what degree is having good software engineering experience helpful for AI Safety research?

Greg_Colbourn ⏸️

Nov 8 2022

AI Safety Needs Great Engineers.

Agustín Covarrubias 🔸Nov 8 20221

Related question to the one posed by Yadav:

Does the fact that OpenAI and DeepMind have AI Safety teams factor significantly into AI x-risk estimates?

My independent impression is that it's very positive, but I haven't seen this factor being taken explicitly into account in risk estimates.

PatoNov 4 20221

Oh, I didn't know that the field was so against AI X-risks. Because when I saw this https://aiimpacts.org/what-do-ml-researchers-think-about-ai-in-2022/ 5-10% of X-risks seemed enough to take them seriously. Is that survey not representative? Or is there a gap between people recognizing the risks and giving them legitimacy?

[anonymous]Nov 4 20221

According to the CLR, since resource acquisition is an instrumental goal - regardless of the utility function of the AGI - , it is possible that such goal can lead to a race where each AGI can threaten others such that the target has an incentive to hand over resources or comply with the threateners’ demands. Is such a conflict scenario (potentially leading to x-risks) from two AGIs possible if these two AGIs have a different intelligence level? If so, isn't there a level of intelligence gap at which x-risks become unlikely? How to characterize this f... (read more)

1[comment deleted]Nov 4 2022

[anonymous]Nov 4 20221

If AGI has several terminal goals, how does it classify them? Some kind of linear combination?

[anonymous]Nov 4 20221

I have the feeling that there is a tendency in the AI safety community to think that if we solve the alignment problem, we’re done and the future must be necessarily flourishing (I observe that some EAs say that either we go extinct or it’s heaven on earth depending on the alignment problem, in a very binary way actually). However, it seems to me that post aligned-AGI scenario merit attention as well: game theory provides us a sufficient rationale to state that even rational agents (in this cases >2 AGIs) can take sub-optimal decisions (including catastrophic scenarios) when face with some social dilemma. Any thoughts on this please?

Greg_Colbourn ⏸️

Nov 8 2022

I think to the extent that there would be post-AGI sub-optimal decision making (or catastrophe), that would be basically a failure of alignment (i.e. the alignment problem would not in actual fact have been solved!). More concretely, there are many things that need aligning beyond single human : single AGI, the most difficult being multi-human : multi-AGI, but there is also alignment needed at every relevant step in the human decision making chain.

PatoNov 4 20221

Interesting.

I'm not sure I understood the first part and what f(A,B) is. In the example that you gave B is only relevant with respect of how much it affects A ("damage the reputability of the AI risk ideas in the eye of anyone who hasn't yet seriously engaged with them and is deciding whether or not to"). So, in a way you are still trying to maximize |A| (or probably a subset of it: people who can also make progress on it (|A'|)). But in "among other things" I guess that you could be thinking of ways in which B could oppose A, so maybe that's why you... (read more)

aaron_maiNov 3 20221

Why is there so much more talk about the existential risk from AI as opposed to the amount by which individuals (e.g. researchers) should expect to reduce these risks through their work?

The second number seems much more decision-guiding for individuals than the first. Is the main reason that its much harder to estimate? If so, why?

Greg_Colbourn ⏸️

Nov 8 2022

Here is an attempt by Jordan Taylor: Expected ethical value of a career in AI safety (which you can plug your own numbers into).

oh54321Nov 2 20221

"RL agents with coherent preference functions will tend to be deceptively aligned by default." - Why?

PatoNov 2 20221

I think EA has the resources to make the alignment problem viral or at least in STEM circles. Wouldn't that be good? I'm not asking if it would be an effective way of doing good, just a way.

Because I'm surprise that not even AI doomers seem to be trying to reach the mainstream.

WeaverNov 2 20221

If someone was looking to get a graduate degree in AI, what would they look for? Is it different for other grad schools?

Greg_Colbourn ⏸️

Nov 5 2022

Specific to AGI Safety, I would recommend looking at trying to join relevant active research groups in academia i.e. getting into PhDs they are offering; perhaps doing a relevant Masters - ML/AI, or even Philosophy or Maths - at the university they are based at first. There are some groups listed here (see the table). Universities include Berkeley (Stuart Russell, Jacob Steinhardt), Cambridge (David Krueger), NYU (Sam Bowman), Oxford (FHI).

Jakob_JNov 2 20221

How dependent is current AGI safety work on deep RL? Recently there has been a lot of emphasis on advances in deep RL (and ML more generally), so it would be interesting to know what the implications would be if it turns out that this particular paradigm cannot lead to AGI.

Koen Holtman

Nov 3 2022

There is some AGI safety work that specifically targets deep RL, under the asumption that deep RL might scale to AGI. But there is also a lot of other work, both on failure modes and on solutions, that is much more independent of the method being used to create the AGI. I do not have percentages on how it breaks down. Things are in flux. A lot of the new technical alignment startups seem to be mostly working in a deep RL context. But a significant part of the more theoretical work, and even some of the experimental work, involves reasoning about a very broad class of hypothetical future AGI systems, not just those that might be produced by deep RL.