"People saying things that are mildly offensive but not worth risking an argument by calling out, and get tiring after repeated exposure" is just obviously a type of comment that exists, and is what most people mean when they say microaggression. Your paper debunking it alternates between much stricter definitions and claiming an absence of evidence for something that very clearly is going to be extremely hard to measure rigorously.

I'll edit to comment to note that you dispute it, but I stand by the comment. The AI system trained is only as safe as the mentor, so the system is only safe if the mentor knows what is safe. By "restrict", I meant for performance reasons, so that it's feasible to train and deploy in new environments.

Again, I like your work and would like to see more similar work from you and others. I am just disputing the way you summarized it in this post, because I think that portrayal makes its lack of splash in the alignment community a much stronger point against the community's epistemics than it deserves.

How does Rationalist Community Attention/Consensus compare? I'd like to mention a paper of mine published at the top AI theory conference which proves that when a certain parameter of a certain agent is set sufficiently high, the agent will not aim to kill everyone, while still achieving at least human-level intelligence. This follows from Corollary 14 and Corollary 6. I am quite sure most AI safety researchers would have confidently predicted no such theorems ever appearing in the academic literature. And yet there are no traces of any minds being blown. The associated Alignment Forum post only has 22 upvotes and one comment, and I bet you've never heard any of your EA friends discuss it. It hasn't appeared, to my knowledge, in any AI safety syllabuses. People don't seem to bother investigating or discussing whether their concerns with the proposal are surmountable. I'm reluctant to bring up this example since it has the air of a personal grievance, but I think the disinterest from the Rationality Community is erroneous enough that it calls for an autopsy. (To be clear, I'm not saying everyone should be hailing this as an answer to AI existential risk, only that it should definitely be of significant interest.)


I'm someone who has read your work (this paper and FGOIL, the latter of which I have included in a syllabus), and who would like to see more work in similar vein, as well as more formalism in AI safety. I say this to establish my bona fides, the way you established your AI safety bona fides.

 I don't think this paper is  mind-blowing, and I would call it representative of one of the ways in which tailoring theoretical work for the peer-review process can go wrong. In particular, you don't show that "when a certain parameter of a certain agent is set sufficiently high, the agent will not aim to kill everyone", you show something more like "when you can design and implement an agent that acts and updates its beliefs in a certain way and can restrict the initial beliefs to a set containing the desired ones and  incorporate a human into the process who has access to the ground truth of the universe, then you can set a parameter high enough that the agent will not aim to kill everyone" [edit: Michael disputes this last point, see his comment below and my response], which is not at all the same thing. The standard academic failure mode is to make a number of assumptions for tractability that severely lower the relevance of the results (and the more pernicious failure mode is to hide those assumptions).

You'd be right if you said that most AI safety people did not read the paper and come to that conclusion themselves, and even if you said that most weren't even aware of it. Very little of the community has the relevant background for it (and I would like to see a shift in that direction), especially the newcomers that are the targets of syllabi. All that said, I'm confident that you got enough qualified eyes on it that if you had shown what you said in your summary, it would have had an impact similar in scale to what you think is appropriate.

This comment is somewhat of a digression from the main post, but I am concerned that if someone took your comments about the paper at face value, they would come away with an overly negative perception of how the AI safety community engages with academic work.

An EA steelman example of similar points of thinking are EAs who are incredibly anti-working for OpenAI or Deepmind at all because it safety washes and pushes capabilities anyways. The criticism here is the way EA views problems means EA will only go towards solution that are piecemeal rather than transformative. A lot of Marxists felt similarly to welfare reform in that it quelled the political will for "transformative" change to capitalism. 

For instance they would say a lot of companies are pursuing RLHF in AI Safety not because it's the correct way to go but because it's the easiest low hanging fruit (even if it produces deceptive alignment). 

I want to address this point not to argue against the animal activist's point, but rather because it is a bad analogy for that point. The argument against working for safety teams at capabilities orgs or RLHF is not that they reduce x-risk to an "acceptable" level, causing orgs to give up on further reductions, but rather than they don't reduce x-risk.

Thanks for the communication, and especially giving percentages. Would you be able to either break it down by grants for individuals vs. grants to organizations, or note if the two groups were affected equally? While I appreciate knowing how high the bar has risen in general, I would be particularly interested in  how high it has risen for the kinds of applications I might submit in the future.

What are your intuitions regarding length? What's the minimum time needed for a fellowship to be impactful, and at what length does it hit diminishing returns?

Seems worth asking in interviews "I'm concerned about advancing capabilities and shortening timelines, what actions is your organization taking to prevent that", with the caveat that you will be BSed.

Bonus: You can turn down roles explicitly because they're doing capabilities work, which if it becomes a pattern may incentivize them to change their plan.

This comment is object-level, perhaps nitpicky, and I quite like your post on a high level.

Saving a life via, say, malaria nets gets you two benefits:

1. The person saved doesn't die, meeting their preference for continuing to exist

2. The externalities of that person continuing to live, such as foregone grief by their family and community.

I don't think it's too controversial to say that the majority of the benefit from saving a life goes to the person whose life is saved, rather than the people who would be sad that they died. But the IDinsights survey only provides information about the latter.

Consider what would happen if beneficiary surveys find the opposite conclusion in future communities, that certain beneficiaries did not care at all about the death of children under the age of 9. It would be ridiculous and immoral  to defer to that decision, and not provide any life-saving aid to those children.  The reason for this is that the community being surveyed is not the primary beneficiary of aid to their children, their children are, so their preferences make up a small fraction of the aid's value. But this also goes the other way, if the surveyed community overweights the lives of their children, that isn't a reason for major deferral. Especially if stated preferences contradict revealed preferences, as they often do.

There are lots of advantages to being based in the Bay Area. It seems both easier and higher upside to solve the Berkeley real estate issue that to coordinate a move away from the Bay Area.

I love the idea of a Library of EA! It would be helpful to eventually augment it with auxiliary and meta-information, probably through crowdsourcing among EAs. Each book could also be associated with short and medium summaries of the key arguments and takeaways, and warnings about which sections were later disproven or controversial (or a warning that the whole thing is a partial story/misleading). There's also a lot of overlap and superseding within the books (especially within the rationality and epistemology section), so it would be good to say "If you've read X, you don't need to read Y". It would also be great to have a "Summary of Y for people who have already read X" that just covers the key information.

I do strongly feel that a smaller library would be better. While there are advantages to being comprehensive, a smaller library is better at directing people to the most important books. It is really valuable to say that someone should start with a particular book on a subject, rather than their uninformed choice from a list. Parsimony in recommendations, at least on a personal level, is also important for conveying the importance of the recommendations you do make. It somewhat feels like you weren't confident enough to cut a book that was recommended by some subgroup, even if there were better options available.

There's a Pareto principle at play here, where reading 20% of the books will provide 80% of the value, and a repeated Pareto principle where 4% provide 64% of the value.  I think you could genuinely recommend four or five books from this list that provide two-thirds of the EA value of the entire list between them.  My picks would be The Most Good You Can Do, The Precipice,  Reasons and Persons, and Scout Mindset.  Curious what others would pick.

