I am a mathematics grad student. I think that working on AI safety research would be a valuable thing for me to do, if the research were something I felt intellectually motivated by. Unfortunately, whether I feel intellectually motivated by a problem has little to do with what is useful or important; it basically just depends on how cool/aesthetic/elegant the math involved is.

I've taken a semester of ML and read a handful (~5) AI safety papers as part of a Zoom reading group, and thus far none of it appeals. It might be that this is because nothing in AI research will be adequately appealing, but it might also be that I just haven't found the right topic yet. So to that end: what's the coolest math involved in AI safety research? What problems might I really like reading about or working on?

New Answer
Ask Related Question
New Comment

8 Answers

By Scott Garrabrant et al:

By John Wentworth

By myself:

These are the sort of thing I'm looking for! In that, on first glance, they're a lot of solid "maybe"s where mostly I've been finding "no"s. So that's encouraging --thank you so much for the suggestions!

The second and third strike me as useful ideas and kind of conceptually cool, but not terribly math-y; rather than feeling like these are interesting math problems, the math feels almost like an afterthought. (I've read a little about corrigibility before, and had the same feeling then.) The first is the coolest, but also seems like the least practical -- doing math about weird simulation thought experiments is fun but I don't personally expect it to come to much use.

Thank you for sharing all of these! I sincerely appreciate the help collecting data about how existing AI work does or doesn't mesh with my particular sensibilities.

7Owen Cotton-Barratt21d
To me they feel like pre-formal math? Like the discussion of corrigibility gives me a tingly sense of "there's what on the surface looks like an interesting concept here, and now the math-y question is whether one can formulate definitions which capture that and give something worth exploring". (I definitely identify more with the "theory builder" of Gower's two cultures [https://www.dpmms.cam.ac.uk/~wtg10/2cultures.pdf].)
3Jenny K E21d
Ah, that's a good way of putting it! I'm much more of a "problem solver."

Cool!

My opinionated takes for problem solvers:

(1) Over time we'll predictably move in the direction from "need theory builders" to "need problem solvers", so even if you look around now and can't find anything, it might be worth checking back every now and again.

(2) I'd look at ELK now for sure, as one of the best and further-in-this-direction things.

(3) Actually many things have at least some interesting problems to solve as you get deep enough. Like I expect curricula teaching ML to very much not do this, but if you have mastery of ML and are trying to achieve new things with it, much more of the interesting-problems-to-solve to come up. Unfortunately I don't know how to predict how much of the itch this will address for you ... maybe one question is how much do you find satisfaction in solving problems outside of pure mathematics? (e.g. logic puzzles, but also things in other domains of life)

6Jenny K E20d
The point about checking back in every now and then is a good one; I had been thinking in more binary terms and it's helpful to be reminded that "not yet, maybe later" is also a possible answer to whether to do AI safety research. I like logic puzzles, and I like programming insofar as it's like logic puzzles. I'm not particularly interested in e.g. economics or physics or philosophy. My preferred type of problem is very clear-cut and abstract, in the sense of being solvable without reference to how the real world works. More "is there an algorithm with time complexity Y that solves math problem X" than "is there a way to formalize real-world problem X into a math problem for which one might design an algorithm." Unfortunately AI safety seems to be a lot of the latter!
3Max_Daniel21d
(Terry Tao's distinction between 'pre-rigorous', 'rigorous', and 'post-rigorous' maths [https://terrytao.wordpress.com/career-advice/theres-more-to-mathematics-than-rigour-and-proofs/] might also be relevant.)
2Max_Daniel21d
Maybe the notes on 'ascription universality' [https://ai-alignment.com/towards-formalizing-universality-409ab893a456] on ai-alignment.com are a better match for your sensibilities.

(Told this to Jenny in person, but posting for the benefit of others)

AI safety is a young, pre-paradigmatic area of research without a universally accepted mathematical formalism, so if you're after cool math, my suggestion is to learn the basics of one or two well-established fields that are mathematically mature and have a decent chance of being relevant to AI safety.

In particular, I think Learning Theory and Causality are areas with plenty of Aesthetic Math™.

Learning theory

Statistical learning theory is the mathematical study of inductive reasoning—how can we make generalizations from past observations to future observations? It's an entire mathematically rich field devoted to formalizing Occam's razor.

Computational learning theory imposes the further restriction that learning algorithms be computationally efficient. It has rich connections to other parts of theoretical computer science (for example, there is a duality between computational learning theory and cryptography—positive results for one translate to negative results for the other!) And there are many fun problems of a combinatorial puzzle flavor.

Most of learning theory assumes that observations are drawn i.i.d. from a distribution. Online Learning asks what happens if we eliminate this assumption. Incredibly, it can be shown that inductive reasoning can be successful even when observations are handcrafted by an adversary. The key is to measure success in relative rather than absolute terms: how did you perform in comparison to the best member of a pre-specified class of predictors? There are beautiful connections to convex analysis.

Readings:

Causality

I don't know this area as well, but the material I have learned has been mathematically beautiful. In particular, I suggest learning about Judea Pearl's theory of causality, which has been very influential in computer science, statistics, and some of the natural and social sciences. (There are a few competing formalisms for causality, but Pearl's is the most mathematically beautiful as far as I can tell.) Pearl's theory generalizes the classical theory of probability to allow for reasoning about cause and effect, using a framework that involves manipulations of directed acyclic graphs.

Reading: Causality, by Pearl.

Not sure what mathematically interests you, but you should probably check out Vanessa Kosoy's learning-theoretic research agenda (she is hiring mathematicians!). Also, the Topos Institute are doing many interesting things in AI safety and other things (I'm personally particularly interested in their compositionality/modeling work, which seems very cool to me).

A couple of unasked-for pieces of advice that may be relevant (would be for my past self who was sort of in a similar position):

  1. Sadly, many times we should expect tradeoffs between impact and interest, where to actually implement innovations requires doing hard manual work. Especially in academic fields, where the impactful uninteresting work is more neglected. 
  2. Our interests change quite a bit over time, and it's usually hard to predict how it might change. That said, for many people they find stuff more interesting the more competence they feel at it and the more they care about the problem they try to solve or about the product they intend to deliver. 

Your points (1) and (2) are ones I know all too well, though it was quite reasonable to point them out in case I didn't, and they may yet prove helpful to other readers of this post.

Regarding Vanessa Kosoy's work, I think I need to know more math to follow it (specifically learning theory, says Ben; for the benefit of those unlucky readers who are not married to him, he wrote his answer in more detail below). I did find myself enjoying reading what parts of the post I could follow, at least.

Regarding the Topos Institute, someone I trust has a low opinion of them; epistemic status secondhand and I don't know the details (though I intend to ask about it).

Thanks very much for the suggestions!

I'm did a pure maths undergrad and recently switched to doing mechanistic interpretability work - my day job isn't exactly doing maths, but I find it has a strong aesthetic appeal in a similar way. My job is not to train an ML model (with all the mess and frustration that involves), it's to take a model someone else has trained, and try to rigorously understand what is going on with it. I want to take some behaviour I know it's capable of and understand how it does that, and ideally try to decompile the operations it's running into something human understandable. And, fundamentally, a neural network is just a stack of matrix multiplications. So I'm trying to build tools and lenses for analysing this stack of matrices, and converting it into something understandable. Day-to-day, this looks like having ideas for experiments, writing code and running them, getting feedback and iterating, but I've found a handful of times where having good intuitions around linear algebra, or how gradients work, and spending some time working through algebra has been really useful and clarifying. 

If you're interested in learning more, Zoom In is a good overview of a particular agenda for mechanistic interpretability in vision models (which I personally find super inspiring!), and my team wrote a pretty mathsy paper giving a framework to breakdown and understand small, attention-only transformers (I expect the paper to only make sense after reading an overview of autoregressive transformers like this one). If you're interested in working on this, there are currently teams at Anthropic, Redwood Research, DeepMind and Conjecture doing work along these lines!

Thanks very much for the suggestions, I appreciate it a lot! Zoom In was a fun read -- not very math-y but pretty cool anyway. The Transformers paper also seems kind of fun. I'm not really sure whether it's math-y enough for me to be interested in it qua math...but in any event it was fun to read about, which is a good sign. I guess "degree of mathiness" is only one neuron of several neurons sending signals to the "coolness" layer, if I may misuse metaphors.

Some mathy AI safety pieces or other related material off the top of my head (in no particular order, and definitely not comprehensive nor weighted toward impact or influence):

You might be interested in this great intro sequence to embedded agency. There's also corrigibility and MIRI's other work on agent foundations.

Also, coherence arguments and consequentialist cognition.

AI safety is a young field; for most open problems we don't yet know of a way to crisply state them in a way that can be resolved mathematically. So if you enjoy taking messy questions and turning them into neat math you'll probably find much to work on.

ETA: oh and of course ELK.

In outer alignment one can write down a correspondence between ML training schemes that learn from human feedback and complexity classes related to interactive proof schemes.  If we model the human as a (choosable) polynomial time algorithm, then

1. Debate and amplification get to PSPACE, and more generally -step debate gets to .
2. Cross-examination gets to NEXP.
3. If one allows opaque pointers, there are schemes that go further: market making gets to R.

Moreover, we informally have constraints on which schemes are practical based on properties of their complexity class analogues.  In particular, interactive proofs schemes are only interesting if they relativize: we also have  and thus a single prover gets to PSPACE given an arbitrary polynomial time verifier, but w.r.t. a typical oracle .  My sense is there are further obstacles that can be found: my intuition is that "market making = R" isn't the right theorem once obstacles are taken into account, but don't have a formalized model of this intuition.

The reason this type of intuition is useful is humans are unreliable, and schemes that reach high complexity class analogies should (everything else equal) give more help to the humans in noticing problems with ML systems.

I think there's quite a bit of useful work that can be done pushing this type of reasoning further, but (full disclosure) it isn't of the "solve a fully formalized problem" sort.  Two examples:

1. As mentioned above, I find "market making = R" unlikely to the right result.  But this doesn't mean that market making isn't an interesting scheme: there are connections between market making and Paul Christiano's learning the prior scheme.  As previously formalized, market making misses a practical limitation on the available human data (the -way assumption in that link), so there may be work to do to reformalize it into a more limited complexity class in a more useful way.

2. Two-player debate is only one of many possible schemes using self-play to train systems, and in particular one could try to shift to -player schemes in order to reduce playing-for-variance strategies where a behind player goes for risky lies in order to possibly win.  But the "polynomial time judge" model can't model this situation, as there is no variance when trying to convince a deterministic algorithm.  As a result, there is a need for more precise formalization that can pick up the difference between self-play schemes that are more or less robust to human error, possibly related to CRMDPs.

9 comments, sorted by Click to highlight new comments since: Today at 5:34 AM

What are examples of things you find cool/aesthetic/elegant?

My favorite fields of math are abstract algebra, algebraic topology, graph theory, and computational complexity. The latter two are my current research fields. This may seem to contradict my claim of being a pure mathematician, but I think my natural approach to research is a pure mathematician's approach, and I have on many occasions jokingly lamented the fact that TCS is in the CS department, instead of in the math department where it belongs. (This joke is meant as a statement about my own preferences, not a claim about how the world should be.)

Some examples of specific topics I've found particularly fun to explain to people: the halting problem, P vs NP and the idea of poly-time reductions, Kempe's false proof of the four-color theorem, the basics of group theory.

Absolutely agree with everything you've said here! AI safety is by no means the only math-y impactful work.

Most of these don't quite feel like what I'm looking for, in that the math is being used to do something useful or valuable but the math itself isn't very pretty. "Racing to the Precipice" looks closest to being the kind of thing I enjoy.

Thank you for the suggestions!

Love this question! I too would identify as a hopelessly pure mathematician (I'm currently working on a master's thesis in category theory), and I too spent some time trying to relate my academic interests to AI safety. I didn't have much success; in particular, nothing ML-related ever appealed. I hope it works out better for you!

You might be interested in this paper on 'Backprop as Functor'.

(I'm personally not compelled by the safety case for such work, but YMMV, and I think I know at least a few people who are more optimistic.)

Question: would an impactful but not cool/popular/elegant topic interest you? What's your balance between coolness and impactfulness?

I am not intellectually motivated by things on the basis of their impactfulness. If I were, I wouldn't need to ask this question.

Elaborating on this, thanks to Spencer Becker-Kahn for prompting me to think about this more:

From a standpoint of my values and what I think is good, I'm an EA. But doing intellectual work, specifically, takes more than just my moral values. I can't work on problems I don't think are cool. I mean, I have, and I did, during undergrad, but it was a huge relief to be done with it after I finished my quals and I have zero desire to go back to it. It would be -- at minimum unsustainable -- for me to try to work on a problem where my main motivation for doing it is "it would be morally good for me to solve this." I struggle a bit with motivation at the best of times, or rather, on the best of problems. So, if I can find something in AI safety that I think is approximately as cool as what I'm currently doing, I'll do it, but the coolness is actually a requirement, because I won't be successful or happy otherwise. I'm not built for it (and I think most EAs aren't; fortunately some of them have different tastes than I do, as to what is or isn't cool).