I lead the DeepMind mechanistic interpretability team
1 is very true, 2 I agree with apart from the word main, it seems hard to label any factor as "the main" thing, and there's a bunch of complex reasoning about counterfactuals - eg if GDM stopped work that wouldn't stop Meta, so is GDM working on capabilities actually the main thing?
I'm pretty unconvinced that not sharing results with frontier labs is tenable - leaving aside that these labs are often the best places to do certain kinds of safety work, if our work is to matter, we need the labs to use it! And you often get valuable feedback on the work by seeing it actually used in production. Having a bunch of safety people who work in secret and then unveil their safety plan at the last minute seems very unlikely to work to me
I personally think that "does this advance capabilities" is the wrong question to ask, and instead you should ask "how much does this advance capabilities relative to safety". Safer models are just more useful, and more profitable a lot of the time! Eg I care a lot about avoiding deception. But honest models are just generally more useful to users (beyond white lies I guess). And I think it would be silly for no one to work on detecting or reducing deception. I think most good safety work will inherently advance capabilities in some sense, and this is a sign that it's actually doing anything real. I struggle to think of any work I think is both useful and doesn't advance capabilities at all
Strong +1 to Richard, this seems a clear incorrect moderation call and I encourage you to reverse it.
I'm personally very strongly opposed to killing people because they eat meat, and the general ethos behind that. I don't feel in the slightest offended or bothered by that post, it's just one in a string of hypothetical questions, and it clearly is not intended as a call to action or to encourage action.
If the EA Forum isn't somewhere where you can ask a perfectly legitimate hypothetical question like that, what are we even doing here?
I imagine you can get a lot of the value here more cheaply by reaching out to people in the field and asking them a bunch of questions about what signals do and do not impress them?
Doing internships etc is valuable to get the supervision to DO the impressive projects, of course.
EDIT: Speaking as someone who does hiring of interpretability researchers, I think there's a bunch of signals I look for and ones I don't care about, and sometimes people new to the field have very inaccurate guesses here
Seems reasonable (tbh with that context I'm somewhat OK with the original ban), thanks for clarifying!