This post is for the Future of Life Institute's most recent AI alignment podcast episode that I was a part of. I talk about a lot of stuff on it which I think is likely to be fairly relevant to other EAs here, including what I see as the biggest problems that need to be solved in AI safety as well as some of what I see as the most promising solutions.
I'm happy to answer any questions that people might have here if any come up. The full transcript of the podcast is also available at the above link in case anyone would rather read than listen.
Glad you enjoyed it!
So, I think what you're describing in terms of a model with a pseudo-aligned objective pretending to have the correct objective is a good description of specifically deceptive alignment, though the inner alignment problem is a more general term that encompasses any way in which a model might be running an optimization process for a different objective than the one it was trained on.
In terms of empirical examples, there definitely aren't good empirical examples of deceptive alignment right now for the reason you mentioned, though whether or not there are good empirical examples of inner alignment problems in general is more questionable. There are certainly lots of empirical examples of robustness/distributional shift problems, but because we don't really know whether our models are internally implementing optimization processes or not, it's hard to really say whether we're actually seeing inner alignment failures. This post provides a description of the sort of experiment which I think would need to be done to really definitely demonstrate an inner alignment failure (Rohin Shah at CHAI also has a similar proposal here).