Steven Byrnes

Hi I'm Steve Byrnes, an AGI safety researcher in Boston, MA, USA, with a particular focus on brain algorithms—see

Topic Contributions


A tale of 2.75 orthogonality theses

The quote above is an excerpt from here, and immediately after listing those four points, Eliezer says “But there are further reasons why the above problem might be difficult to solve, as opposed to being the sort of thing you can handle straightforwardly with a moderate effort…”.

A tale of 2.75 orthogonality theses

My concern is with the non-experts…

My perspective is “orthogonality thesis is one little ingredient of an argument that AGI safety is an important cause area”. One possible different perspective is “orthogonality thesis is the reason why AGI safety is an important cause area”. Your belief is that a lot of non-experts hold the latter perspective, right? If so, I’m skeptical.

I think I’m reasonably familiar with popular expositions of the case for AGI safety, and with what people inside and outside the field say about why or why not to work on AGI safety. And I haven’t come across “orthogonality thesis is the reason why AGI safety is an important cause area” as a common opinion, or even a rare opinion, as far as I can recall.

For example, Brian Christian, Stuart Russell, and Nick Bostrom all talk about Goodhart’s law and/or instrumental convergence in addition to (or instead of) orthogonality, Sam Harris talks about arms races and fragility-of-value, Ajeya Cotra talks about inner misalignment, Rob Miles talks about all of the above, Toby Ord uses the “second species argument”, etc. People way outside the field don’t talk about “orthogonality thesis” because they’ve never heard of it.

So if lots of people are saying “orthogonality thesis is the reason why AGI safety is an important cause area”, I don’t know where they would have gotten that idea, and I remain skeptical that this is actually the case.

I don't know how to understand 'the space of all possible intelligent algorithms' as a statistical relationship without imagining it populated with actual instances.

My main claim here is that asking random EA people about the properties of “intelligence” (in the abstract) is different from asking them about the properties of “intelligent algorithms that will actually be created by future AI programmers”. I suspect that most people would feel that these are two different things, and correspondingly give different answers to questions depending on which one you ask about. (This could be tested, of course.)

A separate question is how random EA people conceptualize “intelligence” (in the abstract). I suspect “lots of different ways”, and those ways might be more or less coherent. For example, one coherent possibility is to consider the set of all 2^8000000 possible 1-megabyte source code algorithms, then select the subset that is “intelligent” (operationalized somehow), and then start talking about the properties of algorithms in that set.

A tale of 2.75 orthogonality theses

I think the "real" orthogonality thesis is what you call the motte. I don't think the orthogonality thesis by itself proves "alignment is hard"; rather you need additional arguments (things like Goodhart's law, instrumental convergence, arguments about inner misalignment, etc.). 

I don't want to say that nobody has ever made the argument "orthogonality, therefore alignment is hard"—people say all kinds of things, especially non-experts—but it's a wrong argument and I think you're overstating how popular it is among experts.

Armstrong initially states that he’s arguing for the thesis that ‘high-intelligence agents can exist having more or less any final goals’ - ie theoretical possibility - but then adds that he will ‘be looking at proving the … still weaker thesis [that] the fact of being of high intelligence provides extremely little constraint on what final goals an agent could have’ - which I think Armstrong meant as ‘there are very few impossible pairings of high intelligence and motivation’, but which much more naturally reads to me as ‘high intelligence is almost equally as likely to be paired with any set of motivations as any other’.

I think the last part of this excerpt ("almost equally") is unfair. I mean, maybe some readers are interpreting it that way, but if so, I claim that those readers don't know what the word "constraint" means. Right?

I posted one poll asking ‘what the orthogonality thesis implies about [a relationship between] intelligence and terminal goals’, to which 14 of 16 respondents selected the option ‘there is no relationship or only an extremely weak relationship between intelligence and goals’, but someone pointed out that respondents might have interpreted ‘no relationship’ as ‘no strict logical implication from one to the other’. The other options hopefully gave context, but in a differently worded version of the poll 10 of 13 people picked options describing theoretical possibility.

I think the key reason that knowledgeable optimistic people are optimistic is the fact that humans will be trying to make aligned AGIs. But neither of the polls mention that. The statement “There is no statistical relationship between intelligence and goals” is very different from “An AGI created by human programmers will have a uniformly-randomly-selected goal”; I subscribe to (something like) the former (in the sense of selecting from “the space of all possible intelligent algorithms” or something) but I put much lower probability on (something like) the latter, despite being pessimistic about AGI doom. Human programmers are not uniformly-randomly sampling the space of all possible intelligent algorithms (I sure hope!)

Is AI safety still neglected?

OK, thanks. Here’s a chart I made:

Source: my post here.

I think the problem is that when I said “technical AGI safety”, I was thinking the red box, whereas you were thinking “any technical topic in either the red or blue boxes”. I agree that there are technical topics in the top-right blue box in particular, and that’s where “conflict scenarios” would mainly be. My understanding is that working on those topics does not have much of a direct connection to AGI, in the sense that technologies for reducing human-human conflict tend to overlap with technologies for reducing AGI-AGI conflict. (At least, according to this comment thread, I haven’t personally thought about it much.)

Anyway, I guess you would say “in a more s-risk-focused world, we would be working more on the top-right blue box and less on the red box”. But really, in a more s-risk-focused world, we would be working more on all three colored boxes. :-P I’m not an expert on the ITN of particular projects within the blue boxes, and therefore don’t have a strong opinion about how to weigh them against particular projects within the red box. I am concerned / pessimistic about prospects for success in the red box. But maybe if I knew more about the state of the blue boxes, I would be equally concerned / pessimistic about those too!! ¯\_(ツ)_/¯

Is AI safety still neglected?

maybe no one wants to do ambitious value learning

"Maybe no one" is actually an overstatement, sorry, here are some exceptions: 1,2,3. (I have corrected my previous comment.)

I guess I think of current value learning work as being principally along the lines of “What does value learning even mean? How do we operationalize that?” And if we’re still confused about that question, it makes it a bit hard to productively think about failure modes.

It seems pretty clear to me that “unprincipled, making-it-up-as-you-go-along, alignment schemes” would be bad for s-risks, for such reasons as you mentioned. So trying to gain clarity about the lay of the land seems good.

Is AI safety still neglected?

Oops, I was thinking more specifically about technical AGI safety. Or do you think "conflict scenarios" impact that too?

Is AI safety still neglected?

What is neglected within AI safety is suffering-focused AI safety for preventing S-risks. Most AI safety research and existential risk research in general seems to be focused on reducing extinction risks and on colonizing space, rather than on reducing the risk of worse than death scenarios.

I disagree, I think if AGI safety researchers cared exclusively about s-risk, their research output would look substantially the same as it does today, e.g. see here and discussion thread.

For example, suppose there is an AI aligned to reflect human values. Yet "human values" could include religious hells.

Ambitious value learning and CEV are not a particularly large share of what AGI safety researchers are working on on a day-to-day basis, AFAICT. And insofar as researchers are thinking about those things, a lot of that work is trying to figure out whether those things are good ideas the first place, e.g. whether they would lead to religious hell.

The role of academia in AI Safety.

It's not obvious to me that "the academic community has a comparative advantage at solving sufficiently defined problems".  For example, mechanistic interpretability has been a well-defined problem for the past two years at least, but it seems that a disproportionate amount of progress on it has been made outside of academia, by Chris Olah & collaborators at OpenAI & Anthropic. There are various concrete problems here but it seems that more progress is being made by independent researchers (e.g. Vanessa Kosoy, John Wentworth) and researchers at nonprofits (MIRI) than by anyone in academia. In other domains, I tend to think of big challenging technical projects as being done more often by the private or public sector—for example, academic groups are not building rocket ships, or ultra-precise telescope mirrors, etc., instead companies and governments are. Yet another example: In the domain of AI capabilities research, DeepMind and OpenAI and FAIR and Microsoft Research etc. give academic labs a run for their money in solving concrete problems. Also, quasi-independent-researcher Jeremy Howard beat a bunch of ML benchmarks while arguably kicking off the pre-trained-language-model revolution here.

My perspective is: academia has a bunch of (1) talent and (2) resources. I think it's worth trying to coax that talent and resources towards solving important problems like AI alignment, instead of the various less-important and less-time-sensitive things that they do.

However, I think it's MUCH less clear that any particular Person X would be more productive as a grad student than as a nonprofit employee, or more productive as a professor than as a nonprofit technical co-founder. In fact, I strongly expect the reverse.

And in that case, we should really be framing it as "There are tons of talented people in academia, and we should be trying to convince them that AGI x-risk is a problem they should work on. And likewise, there are tons of resources in academia, and we should be trying to direct those resources towards AGI x-risk research." Note the difference: in this framing, we're not pre-supposing that pushing people and projects from outside academia to inside academia is a good thing. It might or might not be, depending on the details.

The role of academia in AI Safety.

OK, thanks for clarifying. So my proposal would be: if a person wants to do / found / fund an AGI-x-risk-mitigating research project, they should consider their background, their situation, the specific nature of the research project, etc., and decide on a case-by-case basis whether the best home for that research project is academia (e.g. CHAI) versus industry (e.g. DeepMind, Anthropic) versus nonprofits (e.g. MIRI) versus independent research. And a priori, it could be any of those. Do you agree with that?

The role of academia in AI Safety.

There are a few possible claims mixed up here:

  • Possible claim 1: "We want people in academia to be doing lots of good AGI-x-risk-mitigating research." Yes, I don't think this is controversial.
  • Possible claim 2: "We should stop giving independent researchers and nonprofits money to do AGI-x-risk-mitigating research, because academia is better." You didn't exactly say this, but sorta imply it. I disagree. Academia has strengths and weaknesses, and certain types of projects and people might or might not be suited to academia, and I think we shouldn't make a priori blanket declarations about academia being appropriate for everything versus nothing. My wish-list of AGI safety research projects (blog post is forthcoming UPDATE: here it is) has a bunch of items that are clearly well-suited to academia and a bunch of others that are equally clearly a terrible fit to academia. Likewise, some people who might work on AGI safety are in a great position to do so within academia (e.g. because they're already faculty) and some are in a terrible position to do so within academia (e.g. because they lack relevant credentials). Let's just have everyone do what makes sense for them!
  • Possible claim 3: "We should do field-building to make good AGI-x-risk-mitigating research more common, and better, within academia." The goal seems uncontroversially good. Whether any specific plan will accomplish that goal is a different question. For example, a proposal to fund a particular project led by such-and-such professor (say, Jacob Steinhardt or Stuart Russell) is very different from a proposal to endow a university professorship in AGI safety. In the latter case, I would suspect that universities will happily take the money and spend it on whatever their professors would have been doing anyway, and they'll just shoehorn the words "AGI safety" into the relevant press releases and CVs. Whereas in the former case, it's just another project, and we can evaluate it on its merits, including comparing it to possible projects done outside academia.
  • Possible claim 4: "We should turn AGI safety into a paradigmatic field with well-defined widely-accepted research problems and approaches which contribute meaningfully towards x-risk reduction, and also would be legible to journals, NSF grant applications, etc." Yes that would be great (and is already true to a certain extent), but you can't just wish that into existence! Nobody wants the field to be preparadigmatic! It's preparadigmatic not by choice, but rather to the extent that we are still searching for the right paradigms.

(COI note.)

Load More