Hide table of contents

People often make arguments against “trying hard” (working very hard, pushing yourself to the brink, being intensely goal-directed, and so on) by pointing to the risks of burnout or of losing some kind of wholesomeness[1].

But there’s another, very simple argument against it that I have not seen anyone fully make explicit[2], even though I think it’s very important. It goes like this:
 

We face a lot of uncertainty about the sign of our impact.

Therefore, we should be very vigilant about our epistemics to make sure that we are not having a negative impact in expectation.

But trying hard deeply distorts our epistemics - it makes us more prone to motivated reasoning about what we’re doing, and leaves us with less slack to reflect on it.

Therefore, all else being equal, we should try less hard.
 

Crucially, this argument applies much more strongly to people working in “longtermist areas” - which other critiques of trying hard generally don’t do. For example, global health EAs whose terminal value is short-term welfare also face uncertainty about the impact of their actions - but much less (especially about the sign) than people trying to improve the long-term future. So this argument suggests that it’s especially dangerous for longtermists to try very hard.[3][4]

I’ll go through the steps in a little bit more detail.

Uncertainty

Much has been said in EA about cluelessness and crucial considerations, but I’ll highlight a few specific concerns that could make lots of current AI safety work[5] net negative (with no claim to novelty):

  • AI governance interventions are obviously high-variance: bad regulation can easily make things worse, many interventions could increase the risk of great power conflict, increased political polarization around AI could be really bad, more centralization of power increases authoritarianism risk, and so on. And technical work can have flow-through effects on these variables that outweigh its direct effects.[6]
  • Activist work can polarize people against the cause.[7]
  • Human takeover might be worse than AI takeover, and many AI safety interventions effectively attempt to make human takeover more likely relative to AI takeover.[8]
  • If powerful AI will be well-described as doing humanlike roleplaying, trying to control it could make it eventually dislike its “oppressors”, or make it less “mentally healthy” in some way. And even without that assumption, AI safety work could lead to an adversarial relationship with AI in other ways.
  • Future AIs may be moral patients themselves, which would substantially reduce the value of preventing human extinction, and increase the downside risk (including S-risk) of “AI control”-style interventions.
  • Useless work could contribute to “safety-washing” or a false sense of security.
  • There’s cultural concerns around scale, professionalization and “mainstreaming”[9]- decreases in integrity and epistemic virtue could be very bad for achieving good outcomes.
  • Capabilities externalities could accelerate AI progress, which many think is bad - people have raised this worry about RLHF historically, and raise it about interpretability and evals nowadays. Most infamously, AI safety activity, to varying extents, contributed to the foundings of all three of DeepMind, OpenAI and Anthropic.

These are very difficult to evaluate (and underargued here). My point is not that these are all valid worries - I’m skeptical of several of them. But I don’t think they can be neglected.

Epistemic distortion

In the past, I’ve felt a sense of being overwhelmed at all these considerations, and felt tempted to just avoid thinking about them - but that can’t be the answer. We have to take the uncertainty seriously. Even if I don’t currently have the capacity to go into some kind of deep reflection, I should attempt to make my actions as robust to the uncertainty as I can - for example, by making sure I can course-correct, and by keeping my epistemics in good shape.

Unfortunately, trying very hard conflicts with this - the harder we push toward a goal, the more we bend the evidence to justify it, and the less mental room we have left to step back and question what we're doing.[10]

And I think there’s something stronger, too - in an active inference framework, beliefs and desires are both just expectations about the world. Experientially, this rings true to me - the feelings of frustration at not getting what I want and at being taken off guard by something I hadn’t even been paying attention to are very similar. It seems deeply hard to distinguish between what we want and what we believe.

This is a bit more speculative, but sometimes I think people don’t fully absorb this point: It’s not psychological, it’s neurological. There’s a sense in which wanting anything distorts us away from pure self-supervised prediction of the world and compresses us internally into living in a specific hypothesis - a “gut-level” vision of the world, that gets upweighted on the level of our base perceptions. So we may not be able to fully adjust for it by only manipulating psychological factors, e.g. consciously trying to be more objective or less selfish.

Conclusion

To be clear, I have a lot of respect for people who try extremely hard - I wouldn’t be able to do it, I’m often in awe of them.[11] I’m also not trying to make a statement about how strong the update from this should be (I don’t even have a good enough knowledge about these spaces to have a precise sense of how hard various groups of people are trying). Maybe arguments for trying even harder actually outweigh this on the current margins, I wouldn’t know.

But I have the sense that this simple consideration is underrated, and I hope this post can provide a reference point for it and make people take it into account in their personal deliberations.

  1. ^

    Also, some apparently seminal academic philosophy stuff that seems interesting: Moral Saints by Susan Wolf, and Bernard Williams’ work.

  2. ^

    The closest thing to it is probably the strain of thought around What should you change in response to an “emergency”? And AI risk and Slack gives you space to notice/reflect on subtle things - but that still seems centered on the MIRI-style mindset of (strawmanning here) “more AI safety is definitely good and we just need to think hard to find “true” AI safety work” (which isn’t really my mindset). That is, the uncertainty about what to do comes more from their specific inside-view that alignment is very hard, rather than model-agnostic EA/philosophy-style cluelessness. So I think the argument in this post is a more general one that should be convincing to more people (nowadays, there are obviously a lot of non-MIRI-cluster people who are trying extremely hard on AI safety stuff, e.g. this post that I saw recently).

    There’s also the “Maximization is perilous” angle, but that’s more about naive optimization in general, and not about facing huge uncertainty specifically (e.g. it applies equally to global health EAs).

    Also, shoutout to Slack matters more than any outcome for a personally inspirational framing on related issues.

  3. ^

    Which is a little counterintuitive, because we usually see the opposite - longtermist EAs being more intense. Although longtermists also have a bigger moral scope and often more urgency (vis-à-vis AI timelines), so may reasonably trade off more sharply against other values and personal well-being.

  4. ^

    The same argument also works for virtue and general emotional health, but that’s out of scope for this post.

  5. ^

    Of course, AI safety interventions are extremely heterogenous - but that just increases the extent to which individual decisionmaking is crucial (as opposed to deferring to people).

  6. ^

    Holden Karnofsky: “Most things that touch policy at all in any way will move us along that spectrum in one direction or another, so therefore have a high chance of being negative [...]

    And then most things that you can do in AI at all will have some impact on policy. Even just alignment research: policy will be shaped by what we’re seeing from alignment research, how tractable it looks, what the interventions look like.” (h/t Anthony DiGiovanni)

  7. ^

    Holden Karnofsky: “there’s also a lot of micro ways in which you could do harm. Just literally working in safety and being annoying, you might do net harm. You might just talk to the wrong person at the wrong time, get on their nerves. I’ve heard lots of stories of this. Just like, this person does great safety work, but they really annoyed this one person, and that might be the reason we all go extinct” (h/t Anthony DiGiovanni)

  8. ^

    Among other things.

  9. ^

    I associate these with people like Richard Ngo (and here) and Oliver Habryka.

  10. ^

    This is well-known in psychology. Also, Opus 4.8 wrote that sentence.

  11. ^

    I definitely don’t want to imply that if only it weren’t for this argument, I would be an extremely hard worker too, haha.

  12. Show all footnotes

9

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities