(Told this to Jenny in person, but posting for the benefit of others)
AI safety is a young, pre-paradigmatic area of research without a universally accepted mathematical formalism, so if you're after cool math, my suggestion is to learn the basics of one or two well-established fields that are mathematically mature and have a decent chance of being relevant to AI safety.
In particular, I think Learning Theory and Causality are areas with plenty of Aesthetic Math™.
Learning theory
Statistical learning theory is the mathematical study of inductive reasoning—how can we make generalizations from past observations to future observations? It's an entire mathematically rich field devoted to formalizing Occam's razor.
Computational learning theory imposes the further restriction that learning algorithms be computationally efficient. It has rich connections to other parts of theoretical computer science (for example, there is a duality between computational learning theory and cryptography—positive results for one translate to negative results for the other!) And there are many fun problems of a combinatorial puzzle flavor.
Most of learning theory assumes that observations are drawn i.i.d. from a distribution. Online Learning asks what happens if we eliminate this assumption. Incredibly, it can be shown that inductive reasoning can be successful even when observations are handcrafted by an adversary. The key is to measure success in relative rather than absolute terms: how did you perform in comparison to the best member of a pre-specified class of predictors? There are beautiful connections to convex analysis.
Readings:
Causality
I don't know this area as well, but the material I have learned has been mathematically beautiful. In particular, I suggest learning about Judea Pearl's theory of causality, which has been very influential in computer science, statistics, and some of the natural and social sciences. (There are a few competing formalisms for causality, but Pearl's is the most mathematically beautiful as far as I can tell.) Pearl's theory generalizes the classical theory of probability to allow for reasoning about cause and effect, using a framework that involves manipulations of directed acyclic graphs.
Reading: Causality, by Pearl.
In outer alignment one can write down a correspondence between ML training schemes that learn from human feedback and complexity classes related to interactive proof schemes. If we model the human as a (choosable) polynomial time algorithm, then
1. Debate and amplification get to PSPACE, and more generally n-step debate gets to ΣnP.
2. Cross-examination gets to NEXP.
3. If one allows opaque pointers, there are schemes that go further: market making gets to R.
Moreover, we informally have constraints on which schemes are practical based on properties of their complexity class analogues. In particular, interactive proofs schemes are only interesting if they relativize: we also have IP=PSPACE and thus a single prover gets to PSPACE given an arbitrary polynomial time verifier, but w.r.t. a typical oracle IPO<PSPACEO. My sense is there are further obstacles that can be found: my intuition is that "market making = R" isn't the right theorem once obstacles are taken into account, but don't have a formalized model of this intuition.
The reason this type of intuition is useful is humans are unreliable, and schemes that reach high complexity class analogies should (everything else equal) give more help to the humans in noticing problems with ML systems.
I think there's quite a bit of useful work that can be done pushing this type of reasoning further, but (full disclosure) it isn't of the "solve a fully formalized problem" sort. Two examples:
1. As mentioned above, I find "market making = R" unlikely to the right result. But this doesn't mean that market making isn't an interesting scheme: there are connections between market making and Paul Christiano's learning the prior scheme. As previously formalized, market making misses a practical limitation on the available human data (the n-way assumption in that link), so there may be work to do to reformalize it into a more limited complexity class in a more useful way.
2. Two-player debate is only one of many possible schemes using self-play to train systems, and in particular one could try to shift to n-player schemes in order to reduce playing-for-variance strategies where a behind player goes for risky lies in order to possibly win. But the "polynomial time judge" model can't model this situation, as there is no variance when trying to convince a deterministic algorithm. As a result, there is a need for more precise formalization that can pick up the difference between self-play schemes that are more or less robust to human error, possibly related to CRMDPs.