Rafael Harth

175 karmaJoined Jul 2022


Sorted by New


Well, a computer model is "literally" transparent in the sense that you can see everything, which means the only difficulty is only in understanding what it means. So the part where you spend 5 million dollars on a PET scanner doesn't exist for ANNs, and in that sense you can analyze them for "free".

If the understanding part is sufficiently difficult... which it sure seems to be... then this doesn't really help, but it is a coherent conceptual difference.

What if it just is the case that AI will be dangerous for reasons that current systems don't exhibit, and hence we don't have empirical data on? If that's the case, then limiting our concerns to only concepts that can be empirically tested seems like it means setting ourselves up for failure.

Nate Soares of the Machine Intelligence Research Institute has argued that building safe AGI is hard for the same reason that building a successful space probe is hard— it may not be possible to correct failures in the system after it’s been deployed. Eliezer Yudkowsky makes a similar argument:

“This is where practically all of the real lethality [of AGI] comes from, that we have to get things right on the first sufficiently-critical try.” — AGI Ruin: A List of Lethalities

Eliezer and Nate also both expect discontinuous Takeoff by default. I feel like it's a bit disingenuous to argue that the thinking of Eliezer et al has proven obsolete and misguided, but then also quote them as apparent authority figures in this one case where their arguments align with your essay. It has to be one or the other!

It's something like, "you'll keep pursuing the goal in new situations." In other words, goal-internalization is a generalization problem.

I think internalizing  means "pursuing  as a terminal goal", whereas RLHF arguably only makes model pursue  as an instrumental goal (in which case the model would be deceptively aligned). I'm not saying that GPT-4 has a distinction between instrumental and terminal goals, but a future AGI, whether an LLM or not, could have terminal goals that are different from instrumental goals.

You might argue that deceptive alignment is also an obsolete paradigm, but I would again respond that we don't know this, or at any rate, that the essay doesn't make the argument.

This essay seems predicated on a few major assumptions that aren't quite spelled out, or any rate not presented as assumptions.

Far from being “behind” capabilities, it seems that alignment research has made great strides in recent years. OpenAI and Anthropic showed that Reinforcement Learning from Human Feedback (RLHF) can be used to turn ungovernable large language models into helpful and harmless assistants. Scalable oversight techniques like Constitutional AI and model-written critiques show promise for aligning the very powerful models of the future. And just this week, it was shown that efficient instruction-following language models can be trained purely with synthetic text generated by a larger RLHF’d model, thereby removing unsafe or objectionable content from the training data and enabling far greater control.

This assumes that making AI behave nice is genuine progress in alignment. The opposing take is that all it's doing is making the AI play a nicer character, but doesn't lead it to internalize its goals, which is what alignment is actually about. And in fact, AI playing rude characters was never the problem to begin with.

You say that alignment is linked to capability in the essay, but this also seems predicated on the above. This kind of "alignment" makes the AI better at figuring out what the humans want, but historically, most thinkers in alignment have always assumed that AI gets good at figuring out what humans want, and that it's dangerous anyway.

What worries me the most is that the primary reason for this view that's presented in the essay seems to be a social one (or otherwise, I missed it).

We don’t need to speculate about what would happen to AI alignment research during a pause— we can look at the historical record. Before the launch of GPT-3 in 2020, the alignment community had nothing even remotely like a general intelligence to empirically study, and spent its time doing theoretical research, engaging in philosophical arguments on LessWrong, and occasionally performing toy experiments in reinforcement learning.

The Machine Intelligence Research Institute (MIRI), which was at the forefront of theoretical AI safety research during this period, has since admitted that its efforts have utterly failed. Stuart Russell’s “assistance game” research agenda, started in 2016, is now widely seen as mostly irrelevant to modern deep learning— see former student Rohin Shah’s review here, as well as Alex Turner’s comments here. The core argument of Nick Bostrom’s bestselling book Superintelligence has also aged quite poorly.[2]

At best, these theory-first efforts did very little to improve our understanding of how to align powerful AI. And they may have been net negative, insofar as they propagated a variety of actively misleading ways of thinking both among alignment researchers and the broader public. Some examples include the now-debunked analogy from evolution, the false distinction between “inner” and “outer” alignment, and the idea that AIs will be rigid utility maximizing consequentialists (here, here, and here).

During an AI pause, I expect alignment research would enter another “winter” in which progress stalls, and plausible-sounding-but-false speculations become entrenched as orthodoxy without empirical evidence to falsify them. [...]

I.e., Miri's approach to alignment hasn't worked out, therefore the current work is better. But this argument doesn't work -- but approaches can be failures! I think Eliezer would argue that Miri's work had a chance of leading to an alignment solution but has failed, whereas current alignment work (like RLHF on LLMs) has no chance of solving alignment.

If this is true, then the core argument of this essay collapses, and I don't see a strong argument here that it's not true. Why should we believe that Miri is wrong about alignment difficulty? The fact that their approach failed is not strong evidence of this; if they're right, then they weren't very likely to succeed in the first place.

And even if they're completely wrong, that still doesn't prove that current alignment approaches have a good chance of working.

Another assumption you make is that AGI is close and, in particular, will come out of LLMs. E.g.:

Such international persuasion is even less plausible if we assume short, 3-10 year timelines. Public sentiment about AI varies widely across countries, and notably, China is among the most optimistic.

This is a case where you agree with most Miri staff but, e.g., Stuart Russel and Steven Byrnes are on record saying that we likely will not get AGI out of LLMs. If this is true, then RLHF done on LLMs is probably even less useful for alignment, and it also means the hard verdict on arguments in superintelligence is unwarranted. Things could still play out a lot more like classical AI alignment thinking in the paradigm that will actually give us AGI.

And I'm also not ready to toss out the inner vs. outer paradigm just because there was one post criticizing it.

I increasingly realize just how emotionally inaccessible the concept of ethical offsetting is to most people. With regard to climate, the figure I remember from William MacAskill is that he estimates one dollar donated to Clean Air Taskforce to save 1 ton of CO2. If you take this seriously, then you basically don't need to bother with any efforts to have less waste in your personal life. (With the possible exception being your image, but only if you're a public figure.) 

The problem your talking about is actually being taken into account by "t".

If you intended it that way, then the formula is technically correct, but only because you've offloaded all the difficulty into defining this parameter. The value of t is now strongly dependent on the net proportion of well-being vs. suffering in the entire universe, which is extremely difficult to estimate and not something that people usually mean by tractability of a cause. (And in fact, it's also not what you talk about in this post in the section on tractability.)

The value we care about here is something like . If well-being and suffering are close together, this quantity becomes explosively larger, and so does the relative impact of improving WAW permanently relative to x-risk reduction. Since again, I don't think this is what anyone thinks  of when they talk about tractability, I think it should be in the formula.

My main qualitative reaction to this is that the buckets "permanently improving life quality" and "reducing extinction risk" are unusual and might not representative of what these fields generally do. Like, if you put it like the above, my intuition says that improving life quality is a lot better. But my (pretty gut level) conclusion is the opposite, i.e., that long-term AI stuff is therefore more important because it'll also have a greater impact on long term happiness than WAW, which in most cases probably won't affect the long term at all.

I think this formula (under "simple formula")

is wrong even given the assumption of a net-positive future. For example, suppose both problems are equally tractable and there is a 50% chance of extinction. Then . But if the future is only a super tiny bit positive on net, then increasing WOW longterm has massive effects. Like if well-being vs. suffering is distributed 51%-49%, then increasing well-being by 1% doubles how good the future is.

In general, I'm pretty sure the correct formula would have goodness of the future as a scalar, and that it would be the same formula whether the future is positive or not.

I don't entirely understand the other formula, but I don't believe it fixes the problem. Could be wrong.

EA people tend to be more concerned about the x-risk aspect of AI, which is distinct from the social problem of fakes or AI generated content.

Load more