Neel Nanda

3311 karmaJoined Nov 2019


Independent mechanistic interpretability researcher.


"You’ll need to get hands-on. The best ML and alignment research engages heavily with neural networks (with only a few exceptions). Even if you’re more theoretically-minded, you should plan to be interacting with models regularly, and gain the relevant coding skills. In particular, I see a lot of junior researchers who want to do “conceptual research”. But you should assume that such research is useless until it cashes out in writing code or proving theorems, and that you’ll need to do the cashing out yourself (with threat modeling being the main exception, since it forces a different type of concreteness). ..."

This seems strongly true to me

I agree re PhD skillsets (though think that some fraction of people gain a lot of high value skills during a PhD, esp re research taste and agenda settings).

I think you're way overrating OpenAI though - in particular, Anthropic's early employees/founders include more than half of the GPT-3 first authors!! I think the company has become much more oriented around massive distributed LLM training runs in the last few years though, so maybe your inference that people would gain those skills is more reasonable now.

This seems fair, I'm significantly pushing back on this as criticism of Redwood, and as focus on the "Redwood has been overfunded" narrative. I agree that they probably consumed a bunch of grant makers time, and am sympathetic to the idea that OpenPhil is making a bunch of mistakes here.

I'm curious which academics you have in mind as slam dunks?

  • I personally found MLAB extremely valuable. It was very well-designed and well-taught and was the best teaching/learning experience I've had by a fairly wide margin

Strong +1, I was really impressed with the quality of MLAB. I got a moderate amount out of doing it over the summer, and would have gotten much much more if I had done it a year or two before. I think that kind of outreach is high value, though plausibly a distraction from the core mission

Sorry for the long + rambly comment! I appreciate the pushback, and found clarifying my thoughts on this useful

I broadly agree that all of the funding ideas you point to seem decent. My biggest crux is that the counterfactual of not funding Redwood is not that one of those gets funded, and that the real constraints here around logistical effort, grantmaker time, etc. I wrote a comment downthread with further thoughts on these points.

And that it is not Redwood's job to solve this - they're pursuing a theory of change that does not depend on these, and it seems very unreasonable to suggest that they should pursue one of these other uses of money instead, even if you think that the use of money is a great idea.

Re 1, concretely, I've been trying to help one of those professors get more funding for his lab, and think this is a high impact use of money. But think that evaluating professors is hard, thinking through capabilities externalities is hard, figuring out a lab's room for more funding is hard, it's harder to burn a ton of money productively in academia, eg >$1mn (eg, it's pretty hard to just hire a bunch of engineers, and interp doesn't really need a ton of compute). There's also dumb network problems where the academics don't know how to get funding, it's not very legible how to apply to OpenPhil, not everyone is comfortable taking EA money, etc (I would like these problems to be solved, to be clear). I don't think it's a matter of just having more money.

Poach experienced researchers who are executing well on interpretability but working on what (by Redwood's lights) are less important problems, and redirect them to more important problems. Not everyone would want to be "redirected", but there's a decent fraction of people who would love to work on more ambitious problems but are currently not incentivized to do so

I don't know anyone like this. If you do, please intro me! (I met someone vaguely in this category and helped them to get an FTX grant at the start of November.... But they only tangentially fit your description). I'm pretty unconvinced there's many people like this out there who could be redirected to productively do what I consider good interp work - beyond just motivation and interest in doing independent-ish work, there's also significant considerations of research taste, having mentorship to do work I think is important, etc.

Make one-year seed grants of around $100k to 20 early-career researchers (PhD students, independent researchers) to work on interpretability, nudging them towards a list of problems viewed important by Redwood. Provide low-touch mentorship (e.g. once a month call). Scale up the grants and/or hire people from the projects that did well after the one-year trial.

Seems good, I'd be excited about this happening. I consider my MATS scholars to be vaguely in the spirit of this, and I've been very impressed with them. But, like, this is so not bottlenecked on money. It's a substantial program that would take effort to run, it's not clear to me that these people would do good work without mentorship (1/month might be sufficient), it's not clear that this adds much value beyond existing independent researcher grants, etc. But I do think it's a decent idea - if anyone is interested in making this happen, please reach out!

However, the denominator is very large, so I still expect the majority of TAIS-relevant interpretability work to happen outside TAIS organizations

There's some work I think is cool, but it tends to be concentrated in a small handful of actually good labs (eg I like ROME and Emergent World Representations a lot). There's a bunch of work I think isn't great, but sometimes has great gems in it. But honestly I think that well over a majority of impact weighted TAIS work was done by the TAIS community (specifically, Chris Olah + collaborator's work is quite possibly a majority in my mind). I'd be interested in being pointed to work that you think is great that I'm missing - I personally find literature reviews to be pretty tedious, and think I underinvest in this kind of thing.

More broadly, my position is that engaging with academia is a theory of change, but one of many. It's a significant investment of time, some people are much better at it than others (eg, I personally just hate writing papers, and am much worse at it than just directly trying to do good research, or write blog posts/educational materials/good tooling), it's hard to direct in targeted ways, benefits a bunch from legible signalling and credentials, etc. I also think Redwood are more pessimistic on it than I am, and eg I am personally not convinced that trying to get grokking into ICLR was a good use of time and effort (though I hope it was!). I think Redwood are making a pretty reasonable bet here.

As a negative example here, I think Distill was a major investment of effort into influencing academia, including on doing better interp work, and it basically failed as far as I can tell (despite, to my eyes, Distill papers being notably higher quality and more interesting than conference papers)

I'm curious if you view this as being significantly more costly than I do, or the improvements to the project from peer-review as being less significant.

I want to distinguish two things - putting in the effort to make a write-up really good, and putting in the effort to eg get it accepted at ICLR/ICML/NeurIPS. I am pretty pro making write-ups really good (I personally am not very good at it and try to avoid it where possible, but this is a personal taste not a value judgement). Eg I really like Anthropic interp papers (though am biased) and think the effort put into presentation and clarity is pretty well spent. And I think that part of submitting to a top conference is making things tightly and clearly phrased, having good figures, making them well presented, having good evidence for your results.

IMO the biggest cost is shaping the results and narrative of your work to fit the kind of thing that reviewers look for, and think is good. I broadly think this just isn't that correlated with what good interp work looks like. I think this can be extremely expensive if you let it shape your research process, choice of projects, etc for "this would make a good publication". In cases like grokking, I did the research I wanted to do, and we then decided to go for a publication, which I think was basically fine, and much less costly. But it did involve significant reshaping and optimisation of the narrative (I am personally sad that progress measures got into the title lol).

Idk, these are complex questions, and there are people I respect who are way more or less pro academia + publishing than me. I am personally pretty biased against academia and publishing, and this affects my value judgements here.

Fwiw, my read is that a lot of "must have an ML PhD" requirements are gatekeeping nonsense. I think you learn useful skills doing a PhD in ML, and I think you learn some skills doing a non-ML PhD (but much less that's relevant, though physics PhDs are probably notably more relevant than maths). But also that eg academia can be pretty terrible for teaching you skills like ML engineering and software engineering, lots of work in academia is pretty irrelevant in the world of the bitter lesson, and lots of PhDs have terrible mentorship.

I care about people having skills, but think that a PhD is only an OK proxy for them, and would broadly respect the skills of someone who worked at one of the top AI labs for four years straight out of undergrad notably more than someone straight out of a PhD program

I particularly think that in interpretability, lots of standard ML experience isn't that helpful, and can actively teach bad research taste and focus on pretty unhelpful problems

(I do think that Redwood should prioritise "hiring people with ML experience" more, fwiw, though I hold this opinion much more strongly around their adversarial training work than their interp work)

There are other TAIS labs (academic and not) that we believe could absorb and spend considerably more funding than they currently receive.

My understanding is that, had Redwood not existed, OpenPhil would not have significantly increased their funding to these other places, and broadly has more money than they know what to do with (especially in the previous EA funding environment!). I don't know whether those other places have applied for grants, or why they aren't as funded as they could be, but this doesn't seem that related to me. And more broadly there are a bunch of constraints on grant makers like time to evaluate a grant, having enough context to competently evaluate it or external advisors with context who they trust, etc. Eg, I'm a bit hesitant about funding Interpretability academics who I think will go full steam ahead on capabilities (I think it's often worth doing anyway, but not obvious to me, and the one time I recommended a grant here it did consume quite a lot of my time to evaluate the nuances)

And that grant making is just really not an efficient market, and there's lots of good grants that don't happen fordumb reasons

Concretely, it's plausible to me that taking themarginal 1 million given to Redwood and dividing it evenly among the other labs you mention seems good. But that doesn't feel like the right counterfactual here.

Neel Nanda, Tom Lieberum and others, mentored by Jacob Steinhardt

I will clarify in my personal case that I did the grokking work as an independent research project and that Jacob only became involved in the project after I had done the core research, and his mentorship was specifically about the process of distillation and writing up the results (to be clear, his mentorship here was high value! But I think that the paper benefited less from his mentorship than is implied by the reference class of having him as the final author)

Answer by Neel NandaApr 01, 202371

Proofreading a job application seems completely fine and socially normal to me, including for content. The thing that crosses a line, by my lights, is having someone (or GPT-4) write it for you.

Load more