1928 karmaJoined Jul 2016Working (0-5 years)


  • Completed the Introductory EA Virtual Program
  • Completed the Precipice Reading Group
  • Completed the AGI Safety Fundamentals Virtual Program
  • Attended an EA Global conference
  • Attended an EAGx conference
  • Attended more than three meetings with a local EA group
  • Received career coaching from 80,000 Hours



Great post! I've written a paper along similar lines for the SERI Conference in April 2023 here, titled "AI Alignment Is Not Enough to Make the Future Go Well." Here is the abstract:

AI alignment is commonly explained as aligning advanced AI systems with human values. Especially when combined with the idea that AI systems aim to optimize their world based on their goals, this has led to the belief that solving the problem of AI alignment will pave the way for an excellent future. However, this common definition of AI alignment is somewhat idealistic and misleading, as the majority of alignment research for cutting-edge systems is focused on aligning AI with task preferences (training AIs to solve user-provided tasks in a helpful manner), as well as reducing the risk that the AI would have the goal of causing catastrophe.

We can conceptualize three different targets of alignment: alignment to task preferences, human values, or idealized values.

Extrapolating from the deployment of advanced systems such as GPT-4 and from studying economic incentives, we can expect AIs aligned with task preferences to be the dominant form of aligned AIs by default.

Aligning AI to task preferences will not by itself solve major problems for the long-term future. Among other problems, these include moral progress, existential security, wild animal suffering, the well-being of digital minds, risks of catastrophic conflict, and optimizing for ideal values. Additional efforts are necessary to motivate society to have the capacity and will to solve these problems.

I don't necessarily think of humans as maximizing economic consumption, but I argue that power-seeking entities (e.g., some corporations or hegemonic governments using AIs) will have predominant influence, and these will not have altruistic goals to optimize for impartial value, by default.


Congrats on launching GWWC Local Groups! Community building infrastructure can be hard to set up, so I appreciate the work here.

It would be bad to create significant public pressure for a pause through advocacy, because this would cause relevant actors (particularly AGI labs) to spend their effort on looking good to the public, rather than doing what is actually good.

I think I can reasonably model the safety teams at AGI labs as genuinely trying to do good. But I don't know that the AGI labs as organizations are best modeled as trying to do good, rather than optimizing for objectives like outperforming competitors, attracting investment, and advancing exciting capabilities – subject to some safety-related concerns from leadership. That said, public pressure could manifest itself in a variety of ways, some of which might work toward more or less productive goals.

I agree that conditional pauses better than unconditional pauses, due to pragmatic factors. But I worry about AGI labs specification gaming their way through dangerous-capability evaluations, using brittle band-aid fixes that don't meaningfully contribute to safety.

I think GiveWell shouldn’t be modeled as wanting to recommend organizations that save as many current lives as possible. I think a more accurate way to model them is “GiveWell recommends organizations that are [within the Overton Window]/[have very sound data to back impact estimates] that save as many current lives as possible.”

This is correct if you look at GiveWell's criteria for evaluating donation opportunities. GiveWell’s highly publicized claim “We search for the charities that save or improve lives the most per dollar” is somewhat misleading given that they only consider organizations with RCT-style evidence backing their effectiveness.

Upvoted. This is what longtermism is already doing (relying heavily on non-quantitative, non-objective evidence) and the approach can make sense for more standard local causes as well.

What do you think are the main reasons behind wanting to deploy your own model instead of training an API? Some reasons I can think of:

For anyone interested, the Center for AI Safety is offering up to $500,000 in prizes for benchmark ideas: SafeBench (mlsafety.org)

Just so I understand, are all four of these quotes arguing against preference utilitarianism?

Load more