LessWrong dev & admin as of July 5th, 2022.
I think the modal no-Anthropic counterfactual does not have an alignment-agnostic AI company that's remotely competitive with OpenAI, which means there's no external target for this Amazon investment. It's not an accident that Anthropic was founded by former OpenAI staff who were substantially responsible for OpenAI's earlier GPT scaling successes.
I don't know if it's commonly agreed upon; that's just my current belief based on available evidence (to the extent that the claim is even philosophically sound enough to be pointing at a real thing).
Re: ontological shifts, see this arbital page: https://arbital.com/p/ontology_identification.
The fact that natural selection produced species with different goals/values/whatever isn't evidence that that's the only way to get those values, because "selection pressure" isn't a mechanistic explanation. You need more info about how values are actually implemented to rule out that a proposed alternative route to natural selection succeeds in reproducing them.
I'm not claiming that evolution is the only way to get those values, merely that there's no reason to expect you'll get them by default by a totally different mechanism. The fact that we don't have a good understanding of how values form even in the biological domain is a reason for pessimism, not optimism.
At best, these theory-first efforts did very little to improve our understanding of how to align powerful AI. And they may have been net negative, insofar as they propagated a variety of actively misleading ways of thinking both among alignment researchers and the broader public. Some examples include the now-debunked analogy from evolution, the false distinction between “inner” and “outer” alignment, and the idea that AIs will be rigid utility maximizing consequentialists (here, here, and here).
Random aside, but I think this paragraph is unjustified in both its core argument (that the referenced theory-first efforts propagated actively misleading ways of thinking about alignment) and none of the citations provide the claimed support.
The first post (re: evolutionary analogy as evidence for a sharp left turn) sees substantial pushback in the comments, and that pushback seems more correct to me than not, and in any case seems to misunderstand the position it's arguing against.
The second post presents an interesting case for a set of claims that are different from "there is no distinction between inner and outer alignment"; I do not consider it to be a full refutation of that conceptual distinction. (See also Steven Byrnes' comment.)
The third post is at best playing games with the definitions of words (or misunderstanding the thing it's arguing against), at worst is just straightforwardly wrong.
I have less context on the fourth post, but from a quick skim of both the post and the comments, I think the way it's most relevant here is as a demonstration of how important it is to be careful and precise with one's claims. (The post is not making an argument about whether AIs will be "rigid utility maximizing consequentialists", it is making a variety of arguments about whether coherence theorems necessarily require that whatever ASI we might build will behave in a goal-directed way. Relatedly, Rohin's comment a year after writing that post indicated that he thinks we're likely to develop goal-directed agents; he just doesn't think that's entailed by arguments from coherence theorems, which may or may not have been made by e.g. Eliezer in other essays.)
My guess is that you did not include the fifth post as a smoke test to see if anyone was checking your citations, but I am having trouble coming up with a charitable explanation for its inclusion in support of your argument.
I'm not really sure what my takeaway is here, except that I didn't go scouring the essay for mistakes - the citation of Quintin's post was just the first thing that jumped out at me, since that wasn't all that long ago. I think the claims made in the paragraph are basically unsupported by the evidence, and the evidence itself is substantially mischaracterized. Based on other comments it looks like this is true of a bunch of other substantial claims and arguments in the post:
Though I'm sort of confused about what this back-and-forth is talking about, since it's referencing behind-the-scenes stuff that I'm not privy to.
Please stop saying that mind-space is an "enormously broad space." What does that even mean? How have you established a measure on mind-space that isn't totally arbitrary?
Why don't you make the positive case for the space of possible (or, if you wish, likely) minds being minds which have values compatible with the fulfillment of human values? I think we have pretty strong evidence that not all minds are like this even within the space of minds produced by evolution.
What if concepts and values are convergent when trained on similar data, just like we see convergent evolution in biology?
Concepts do seem to be convergent to some degree (though note that ontological shifts at increasing levels of intelligence seem likely), but I do in fact think that evidence from evolution suggests that values are strongly contingent on the kinds of selection pressures which produced various species.
The argument w.r.t. capabilities is disanalogous.
Yes, the training process is running a search where our steering is (sort of) effective for getting capabilities - though note that with e.g. LLMs we have approximately zero ability to reliably translate known inputs [X] into known capabilities [Y].
We are not doing the same thing to select for alignment, because "alignment" is:
I do think this disagreement is substantially downstream of a disagreement about what "alignment" represents, i.e. I think that you might attempt outer alignment of GPT-4 but not inner alignment, because GPT-4 doesn't have the internal bits which make inner alignment a relevant concern.
If instead of a terabyte of compiled code, you give me a trillion neural net weights, I can fine tune that network to do a lot of stuff.
But this is irrelevant to the original claim, right? Being able to fine-tune might make introspection on its interal algorithmic representations a bit cheaper, but in practice we observe that it takes us weeks or months of alignment researchers' time to figure out what extremely tiny slices of two-generations-old LLMs are doing.
I do not think the orthogonality thesis is a motte-and-bailey. The only evidence I know of that suggests that the goals developed by an ASI trained with something resembling modern methods would by default be picked from a distribution that's remotely favorable to us is the evidence we have from evolution[1], but I really think that ought to be screened off. The goals developed by various animal species (including humans) as a result of evolution are contingent on specific details of various evolutionary pressures and environmental circumstances, which we know with confidence won't apply to any AI trained with something resembling modern methods.
Absent a specific reason to believe that we will be sampling from an extremely tiny section of an enormously broad space, why should we believe we will hit the target?
Anticipating the argument that, since we're doing the training, we can shape the goals of the systems - this would certainly be reason for optimism if we had any idea what goals we would see emerge while training superintelligent systems, and had any way of actively steering those goals to our preferred ends. We don't have either, right now.
Which, mind you, is still unfavorable; I think the goals of most animal species, were they to be extrapolated outward to superhuman levels of intelligence, would not result in worlds that we would consider very good. Just not nearly as unfavorable as what I think the actual distribution we're facing is.
Every public company in America has a legally-mandated obligation to maximize shareholder returns
This is false. (The analogy between corporations and unaligned AGI is misleading for many other reasons, of course, not the least of which is that corporations are not actually coherent singleton agents, but are made of people.)
ETA: feel free to ignore the below, given your caveat, though you may find it helpful if you choose to write an expanded form of any of the arguments later to have some early objections.
Correct me if I'm wrong, but it seems like most of these reasons boil down to not expecting AI to be superhuman in any relevant sense (since if it is, effectively all of them break down as reasons for optimism)? To wit: