Nora Belrose

Head of Interpretability @ EleutherAI
261 karmaJoined Working (0-5 years)


The goal realism section was an argument in the alternative. If you just agree with us that the indifference principle is invalid, then the counting argument fails, and it doesn't matter what you think about goal realism.

If you think that some form of indifference reasoning still works— in a way that saves the counting argument for scheming— the most plausible view on which that's true is goal realism combined with Huemer's restricted indifference principle. We attack goal realism to try to close off that line of reasoning.

I think the title overstates the strength of the conclusion

This seems like an isolated demand for rigor to me. I think it's fine to say something is "no evidence" when, speaking pedantically, it's only a negligible amount of evidence.

Ultimately I think you've only rebutted one argument for scheming—the counting argument

I mean, we do in fact discuss the simplicity argument, although we don't go in as much depth.

the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don't scheme

Without a concrete proposal about what that might look like, I don't feel the need to address this possibility.

If future AIs are "as aligned as humans", then AIs will probably scheme frequently

I think future AIs will be much more aligned than humans, because we will have dramatically more control over them than over humans.

I don't think you need to believe in any strong version of goal realism in order to accept the claim that AIs will intuitively have "goals" that they robustly attempt to pursue.

We did not intend to deny that some AIs will be well-described as having goals.

So, I definitely don't have the Solomonoff prior in mind when I talk about simplicity. I'm actively doing research at the moment to better characterize the sense in which neural nets are biased toward "simple" functions, but I would be shocked if it has anything to do with Kolmogorov complexity.

Anticipating the argument that, since we're doing the training, we can shape the goals of the systems - this would certainly be reason for optimism if we had any idea what goals we would see emerge while training superintelligent systems, and had any way of actively steering those goals to our preferred ends.  We don't have either, right now.

What does this even mean? I'm pretty skeptical of the realist attitude toward "goals" that seems to be presupposed in this statement. Goals are just somewhat useful fictions for predicting a system's behavior in some domains. But I think it's a leaky abstraction that will lead you astray if you take it too seriously / apply it out of the domain in which it was designed for.

We clearly can steer AI's behavior really well in the training environment. The question is just whether this generalizes. So it becomes a question of deep learning generalization. I think our current evidence from LLMs strongly suggests they'll generalize pretty well to unseen domains. And as I said in the essay I don't think the whole jailbreaking thing is any evidence for pessimism— it's exactly what you'd expect of aligned human mind uploads in the same situation.

The positive case is just super obvious, it's that we're trying very hard to make these systems aligned, and almost all the data we're dumping into these systems is generated by humans and is therefore dripping with human values and concepts.

I also think we have strong evidence from ML research that ANN generalization is due to symmetries in the parameter-function map which seem generic enough that they would apply mutatis mutandis to human brains, which also have a singular parameter-function map (see e.g. here).

I do in fact think that evidence from evolution suggests that values are strongly contingent on the kinds of selection pressures which produced various species.

Not really sure what you're getting at here/why this is supposed to help your side

I'm not conditioning on the global governance mechanism— I assign nonzero probability mass to the "standard treaty" thing— but I think in fact you would very likely need global governance, so that is the main causal mechanism through which tyranny happens in my model

And you've already agreed that it's implausible that these efforts would lead to tyranny, you think they will just fail.

I think that conditional on the efforts working, the chance of tyranny is quite high (ballpark 30-40%). I don't think they'll work, but if they do, it seems quite bad.

And since I think x-risk from technical AI alignment failure is in the 1-2% range, the risk of tyranny is the dominant effect of "actually enforced global AI pause" in my EV calculation, followed by the extra fast takeoff risks, and then followed by "maybe we get net positive alignment research."

I have now made a clarification at the very top of the post to make it 1000% clear that my opposition is disjunctive, because people repeatedly get confused / misunderstand me on this point.

Please stop saying that mind-space is an "enormously broad space." What does that even mean? How have you established a measure on mind-space that isn't totally arbitrary?

What if concepts and values are convergent when trained on similar data, just like we see convergent evolution in biology?

I think this post is best combined with my post. Together, these posts present a coherent, disjunctive set of arguments against pause.

Load more