Neel Nanda

Topic Contributions


What are the coolest topics in AI safety, to a hopelessly pure mathematician?

I'm did a pure maths undergrad and recently switched to doing mechanistic interpretability work - my day job isn't exactly doing maths, but I find it has a strong aesthetic appeal in a similar way. My job is not to train an ML model (with all the mess and frustration that involves), it's to take a model someone else has trained, and try to rigorously understand what is going on with it. I want to take some behaviour I know it's capable of and understand how it does that, and ideally try to decompile the operations it's running into something human understandable. And, fundamentally, a neural network is just a stack of matrix multiplications. So I'm trying to build tools and lenses for analysing this stack of matrices, and converting it into something understandable. Day-to-day, this looks like having ideas for experiments, writing code and running them, getting feedback and iterating, but I've found a handful of times where having good intuitions around linear algebra, or how gradients work, and spending some time working through algebra has been really useful and clarifying. 

If you're interested in learning more, Zoom In is a good overview of a particular agenda for mechanistic interpretability in vision models (which I personally find super inspiring!), and my team wrote a pretty mathsy paper giving a framework to breakdown and understand small, attention-only transformers (I expect the paper to only make sense after reading an overview of autoregressive transformers like this one). If you're interested in working on this, there are currently teams at Anthropic, Redwood Research, DeepMind and Conjecture doing work along these lines!

Can we agree on a better name than 'near-termist'? "Not-longermist"? "Not-full-longtermist"?

the reason the "longtermists working on AI risk" care about the total doom in 15 years is because it could cause extinction preclude the possibility of a trillion-happy-sentient-beings in the long term. Not because it will be bad for people alive today.

As a personal example, I work on AI risk and care a lot about harm to people alive today! I can't speak for the rest of the field, but I think the argument for working on AI risk goes through if you just care about people alive today and hold beliefs which are common in the field

 - see this post I wrote on the topic, and a post by Scott Alexander on the same theme.

[Book rec] The War with the Newts as “EA fiction”

Thanks for the recommendation! I've just finished reading it and really enjoyed it. Note for future readers that the titular "war" only really happens towards the end of the book, and most of it is about set up and exploring the idea of introducing newts to society

"Long-Termism" vs. "Existential Risk"

No worries, I'm excited to see more people saying this! (Though I did have some eerie deja vu when reading your post initially...)

I'd be curious if you have any easy-to-articulate feedback re why my post didn't feel like it was saying the same thing, or how to edit it to be better? 

(EDIT: I guess the easiest object-level fix is to edit in a link at the top to your's, and say that I consider you to be making substantially the same point...)

How I Formed My Own Views About AI Safety

Inside view feels deeply emotional and tied to how I feel the world to be, independent impression feels cold and abstract

What is the new EA question?

How can we best allocate our limited resources to improve the world? Sub-question: Which resources are worth the effort to optimise the allocation of, and which are not, given that we all have limited time, effort and willpower?

I find this framing most helpful. In particular, for young people, the most valuable resource they have is their future labor. Initially, converting this to money and the money to donations was very effective, but now this is often outcompeted by working directly on high priority paths. But the underlying question remains. And I'd argue we often reach the point where optimising our use of money, as it manifests as frugality and thrift, is not worth the willpower and opportunity costs, given that there's a lot more money than vetting capacity or labor. (Implicit assumption: thrift has cost and is the non default option. This feels true for me but may not generalise)

How I Formed My Own Views About AI Safety

The complaint that it's confusing jargon is fair. Though I do think the Tetlock sense + phrase inside view captures something important - my inside view is what feels true to me, according to my personal best guess and internal impressions. Deferring doesn't feel true in the same way, it feels like I'm overriding my beliefs, not like how they world is

This mostly comes under the motivation point - maybe, for motivation, inside views matter but independent impressions don't? And people differ on how they feel about the two?

How I Formed My Own Views About AI Safety

One thing I disagree with: the importance of forming inside views for community epistemic health. I think it's pretty important. E.g. I think that ~2 years ago, the arguments for the longterm importance of AGI safety were pretty underdeveloped; that since then lots more people have come out with their insidee views about it; and that now the arguments are in much better shape.

I want to push back against this. The aggregate benefit may have been high, but when you divide it by all the people trying, I'm not convinced it's all that high.

Further, that's an overestimate - the actual question is more like 'if the people who are least enthusiastic about it stop trying to form inside views, how bad is that?'. And I'd both guess that impact is fairly heavy tailed, and that the people most willing to give up are the least likely to have a major positive impact.

I'm not confident in the above, but it's definitely not obvious

Simplify EA Pitches to "Holy Shit, X-Risk"

Fair point re tractability

What argument do you think works on people who already think they're working on important and neglected problems? I can't think of any argument that doesn't just boil down to one of those

The value of small donations from a longtermist perspective

Thanks for the post! I broadly agree with the arguments you give, though I think you understate the tensions between promoting earning to give vs direct work.

Personal example: I'm currently doing AI Safety work, and I expect it to be fairly impactful. But I came fairly close to going into finance as it was a safe, stable path I was confident I'd enjoy. And part of this motivation was a fuzzy feeling that donations was still somewhat good. And this made it harder to internalise just how much higher the value from direct work was. Anecdotally, a lot of smart mathematicians I know are tempted by finance and have a similar problem. And in cases like this, I think that promoting longtermist donations is actively in tension with high impact career advice

Load More