This post is for the Future of Life Institute's most recent AI alignment podcast episode that I was a part of. I talk about a lot of stuff on it which I think is likely to be fairly relevant to other EAs here, including what I see as the biggest problems that need to be solved in AI safety as well as some of what I see as the most promising solutions.

I'm happy to answer any questions that people might have here if any come up. The full transcript of the podcast is also available at the above link in case anyone would rather read than listen.




Sorted by Click to highlight new comments since:

This was a particularly informative podcast, and you helped me get a better understanding of inner alignment issues, which I really appreciate.

To be clear I understand: the issue with inner alignment is that as an agent gets optimized for a reward/cost function on a training distribution, and to do well the agent needs to have a good enough world model to determine that it's in or could be undergoing training, then if the training ends up creating an optimizer, it's much more likely that that optimizer's reward function is bad or a proxy, and if it's sufficiently intelligent, it'll reason that it should figure out what you want it to do, and do that. This is because there are many different bad reward functions an inner optimizer can have, but only one that you want it to actually have, and each of those bad reward functions will pretend to have the good one.

Although the badly-aligned agents seem like they'd at least be optimizing for proxies of what you actually want, as early (dumber) agents with unrelated utility functions wouldn't do as well as alternative agents with approximately aligned utility functions.

Correct me on any mistakes please.

Also, because this depends on the agents being at least a little generally intelligent, I'm guessing there are no contemporary examples of such inner optimizers attempting deception.

Glad you enjoyed it!

So, I think what you're describing in terms of a model with a pseudo-aligned objective pretending to have the correct objective is a good description of specifically deceptive alignment, though the inner alignment problem is a more general term that encompasses any way in which a model might be running an optimization process for a different objective than the one it was trained on.

In terms of empirical examples, there definitely aren't good empirical examples of deceptive alignment right now for the reason you mentioned, though whether or not there are good empirical examples of inner alignment problems in general is more questionable. There are certainly lots of empirical examples of robustness/distributional shift problems, but because we don't really know whether our models are internally implementing optimization processes or not, it's hard to really say whether we're actually seeing inner alignment failures. This post provides a description of the sort of experiment which I think would need to be done to really definitely demonstrate an inner alignment failure (Rohin Shah at CHAI also has a similar proposal here).

More from evhub
Curated and popular this week
Relevant opportunities