434Claremont, CA, USAJoined Sep 2021


I'm Aaron, I've done Uni group organizing at the Claremont Colleges for a bit. Current cause prioritization is AI Alignment.


I agree that persuasion frames are often a bad way to think about community building.

I also agree that community members should feel valuable, much in the way that I want everybody in the world to feel valued/loved.

I probably disagree about the implications, as they are affected by some other factors. One intuition that helps me is to think about the donors who donate toward community building efforts. I expect that these donors are mostly people who care about preventing kids from dying of malaria, and many donors also donate lots of money towards charities that can save a kid’s like for $5000. They are, I assume, donating toward community building efforts because they think these efforts are on average a better deal, costing less than $5000 for a live saved in expectation.

For mental health reasons, I don’t think people should generally hold themselves to this bar and be like “is my expected impact higher than where money spent on me would go otherwise?” But I think when you’re using other peoples altruistic money to community build, you should definitely be making trade offs, crunching numbers, and otherwise be aiming to maximize the impact from those dollars.

Furthermore, I would be extremely worried if I learned that community builders aren’t attempting to quantify their impact or think about these things carefully (noting that I have found it very difficult to quantify impact here). Community building is often indistinguishable (at least from the outside) from “spending money on ourselves” and I think it’s reasonable to have a super high bar for doing this in the name of altruism.

Noting again that I think it’s hard to balance mental health with the whacky terrible state of the world where a few thousand dollars can save a life. Making a distinction between personal dollars and altruistic dollars can perhaps help folks preserve their mental health while thinking rigorously about how to help others the most. Interesting related ideas:

https://www.lesswrong.com/posts/3p3CYauiX8oLjmwRF/purchase-fuzzies-and-utilons-separately https://forum.effectivealtruism.org/posts/zu28unKfTHoxRWpGn/you-have-more-than-one-goal-and-that-s-fine

Sorry about the name mistake. Thanks for the reply. I'm somewhat pessimistic about us two making progress on our disagreements here because it seems to me like we're very confused about basic concepts related to what we're talking about. But I will think about this and maybe give a more thorough answer later. 

Edit: corrected name, some typos and word clarity fixed

Overall I found this post hard to read and I spent far too long trying to understand it. I suspect the author is about as confused about key concepts as I am. David, thanks for writing this, I am glad to see writing on this topic and I think some of your points are gesturing in a useful and important direction. Below are some tentative thoughts about the arguments. For each core argument I first try to summarize your claim and then respond, hopefully this makes it clearer where we actually disagree vs. where I am misunderstanding.

High level: The author makes a claim that the risk of deception arising is <1%, but they don’t provide numbers elsewhere. They argue that 3 conditions must all be satisfied for deception but neither of them are likely. The “how likely” affects that 1% number. My evaluation of the arguments (below) is that for each of these conjunctive conditions my rough probabilities (where higher means deception more likely) are: (totally unsure can’t reason about it) * (unsure but maybe low) * (high), yielding an unclear but probably >1% probability.

  • Key claims from post:
    • Why I expect an understanding of the base objective to happen before goal-directedness: “Models that are only pre-trained almost certainly don’t have consequentialist goals beyond the trivial next token prediction. Because a pre-trained model will already have high-level representations of key base goal concepts, all it will have to do to become aligned is to point them.” Roughly the argument is that pretraining on tons of data will give a good idea of the base objective but by not cause goal-directed behavior, and then we can just make the model do the base objective thing.
      • My take: It’s not obvious what the goals of pre-trained language models are or what the goals of RLHFed models; plausibly they both have a goal like “minimize loss on the next token” but the RLHF one is doing that on a different distribution. I am generally confused about what it means for a language model to have goals. Overall I’m just so unsure about this that I can’t reasonably put a probability on models developing an understanding of the base objective before goal directedness, but I wouldn’t confidently say this number is high or low. An example of the probability being high is if goal-directedness only emerges in response to RL (this seems unlikely); an example of the probability being low would be if models undergoing pre-training become goal-directed around predicting next tokens early on in training. Insofar as David thinks this probability is high, I do not understand why.
    • Why I expect an understanding of the base objective to happen significantly before optimizing across episodes/long-term goal horizons: You only get long-term goals via gradient descent finding them, but this is unlikely to happen because gradient descent operates on a hyper-local horizon. Training runs + oversight will be quite long periods, so even if gradient descent moves you to “slightly-long-term goals,” these won’t perform well.
      • My take: This argument makes the most sense to me, or at least I think we can reason about it easier than the others. Pointing in the other direction, phase changes seem somewhat likely here; humans (sometimes) generally don’t care about outcomes in the world 100 or 1,000 years out, and then they get sold on longtermism and suddenly care about 10,000 years out. "On what time-span do I care about my goals" is plausibly a measure that will be discontinuous. Perhaps this looks like the goal “where am I minimizing loss” jumping horizons from “next token” to “this sentence/response” and perhaps “all of my performance ever” or “all of the performance of models similar to myself ever” or “all of the performance of models similar to myself in the multiverse”. I’m also unconfident about how likely this is, including still being confused about having goals or understanding base objectives, but I would not be surprised if the author turned out to be right that models understand the base objective before doing long-term goal optimization. Unfortunately I expect that competitive pressures will lead AI developers to want their AIs to pursue long-term objectives, and that might mess things up.
    • Why I expect understanding of the base objective to happen significantly before situational awareness: The deception-relevant situational awareness, which involves understanding one’s future gradient updates, is unlikely to be selected for by gradient descent. This is because, assuming the model has a goal acting only in the current training batch, reasoning about future gradient updates is unlikely to improve short-term reward by much. Models might also get situational awareness via human generated prompts, but this is unlikely because we won’t want them to have this awareness of their own gradient updates.
      • My take: I think that GPT-3 probably has the relevant situational awareness in that its world model understands gradient descent. With the current approach to LLMs it seems this just comes pre-loaded into the models. I am not saying GPT-3 is dangerous due to this information, I am saying that the factual information relevant to situational awareness gets baked into pre-training at some point by default, and this could very well be before the model has an understanding of its base objective.
      • David writes “For a model to have this property, it would need goal horizons that extended at least beyond the next gradient update.” I’m not sure I buy this, but let’s assume it for now. We care about how likely all 3 of these required-for-deception characteristics are conjunctively. If we’re saying that the 3rd requires the 2nd then we can just condition on the 2nd: Given that the model gets long-term planning before an understanding of the base objective, how likely is dangerous situational awareness (before understanding of base objective)? Seems pretty likely now. For the question “how likely are we to get deception-relevant situational awareness before base objective?” I’m probably like 90% conditioning on long-term goals and still pretty high without conditioning. Yet again I am confused by what understanding the base objective means here.

FWIW I often vote on posts at the top without scrolling because I listened to the post via the Nonlinear podcast library or read it on a platform that wasn't logged in. Not all that important of a consideration, but worth being aware of. 

Here are my notes which might not be easier to understand, but they are shorter and capture the key ideas:

  • Uneasiness about chains of reasoning with imperfect concepts
    • Uneasy about conjunctiveness: It’s not clear how conjunctive AI doom is (AI doom being conjunctive would mean that Thing A and Thing B and Thing C all have to happen or be true in order for AI doom; this is opposed to being disjunctive where either A, or B, or C would be sufficient for AI Doom), and Nate Soares’s response to Carlsmith’s powerseeking AI report is not a silver bullet; there is social pressure in some places to just accept that Carlsmith’s report uses a biased methodology and to move on. But obviously there’s some element of conjunctiveness that has to be dealt with.
    • Don’t trust the concepts: a lot of the early AI Risk discussion’s came before Deep Learning. Some of the concepts should port over to near-term-likely AI systems, but not all of them (e.g., Alien values, Maximalist desire for world domination)
      • Uneasiness about in-the-limit reasoning: Many arguments go something like this: an arbitrarily intelligent AI will adopt instrumental power seeking tendencies and this will be very bad for humanity; progress is pushing toward that point, so that’s a big deal. Often this line of reasoning assumes we hit in-the-limit cases around or very soon after we hit greater than human intelligence; this may not be the case.
      • AGI, so what?: Thinking AGI will be transformative doesn’t mean maximally transformative. e.g., the Industrial revolution was such, because people adapted to it
    • I don’t trust chains of reasoning with imperfect concepts: When your concepts are not very clearly defined/understood, it is quite difficult to accurately use them in complex chains of reasoning.
  • Uneasiness about selection effects at the level of arguments
    • “there is a small but intelligent community of people who have spent significant time producing some convincing arguments about AGI, but no community which has spent the same amount of effort looking for arguments against”
    • The people who don’t believe the initial arguments don’t engage with the community or with further arguments. If you look at the reference class “people who have engaged with this argument for more than 1 hour” and see that they all worry about AI risk, you might conclude that the argument is compelling. However, you are ignoring the major selection effects in who engages with the argument for an hour. Many other ideological groups have a similar dynamic: the class “people who have read the new testament” is full of people who believe in the Christian god, which might lead you to believe that the balance of evidence is in their favor — but of course, that class of people is highly selected for those who already believe in god or are receptive to such a belief.
    • “the strongest case for scepticism is unlikely to be promulgated. If you could pin folks bouncing off down to explain their scepticism, their arguments probably won't be that strong/have good rebuttals from the AI risk crowd. But if you could force them to spend years working on their arguments, maybe their case would be much more competitive with proponent SOTA”
    • Ideally we want to sum all the evidence for and all the evidence against and compare. What happens instead is skeptics come with 20 evidence and we shoot them down with 50 evidence for AI risk. In reality there could be 100 evidence against and only 50 evidence for, and we would not know this if we didn’t have really-well-informed skeptics or we weren’t summing their arguments over time.
    • “It is interesting that when people move to the Bay area, this is often very “helpful” for them in terms of updating towards higher AI risk. I think that this is a sign that a bunch of social fuckery is going on.”
      • “More specifically, I think that “if I isolate people from their normal context, they are more likely to agree with my idiosyncratic beliefs” is a mechanisms that works for many types of beliefs, not just true ones. And more generally, I think that “AI doom is near” and associated beliefs are a memeplex, and I am inclined to discount their specifics.”
  • Miscellanea
    • Difference between in-argument reasoning and all-things-considered reasoning: Often the gung-ho people don’t make this distinction.
    • Methodological uncertainty: forecasting is hard
    • Uncertainty about unknown unknowns: Most of the unknown unknowns seem likely to delay AGI, things like Covid and nuclear war
    • Updating on virtue: You can update based on how morally or epistemically virtuous somebody is. Historically, some of those pushing AI Risk were doing so not for the goal of truth seeking but for the goal of convincing people
    • Industry vs AI safety community: Those in industry seem to be influenced somewhat by AI Safety, so it is hard to isolate what they think
  • Conclusion
    • Main classes of things pointed out: Distrust of reasoning chains using fuzzy concepts, Distrust of selection effects at the level of arguments, Distrust of community dynamics
    • Now in a position where it may be hard to update based on other people’s object-level arguments

This evidence doesn't update me very much. 

I would prefer an EA Forum without your critical writing on it, because I think your critical writing has similar problems to this post...

I interpret this quote to be saying, "this style of criticism — which seems to lack a ToC and especially fails to engage with the cruxes its critics have, which feels much closer to shouting into the void than making progress on existing disagreements — is bad for the forum discourse by my lights. And it's fine for me to dissuade people from writing content which hurts discourse"

Buck's top-level comment is gesturing at a "How to productively criticize EA via a forum post, according to Buck", and I think it's noble to explain this to somebody even if you don't think their proposals are good. I think the discourse around the EA community and criticisms would be significantly better if everybody read Buck's top level comment, and I plan on making it the reference I send to people on the topic. 

Personally I disagree with many of the proposals in this post and I also wish the people writing it had a better ToC, especially one that helps make progress on the disagreement, e.g., by commissioning a research project to better understand a relevant consideration, or by steelmanning existing positions held by people like me, with the intent to identify the best arguments for both sides. 

I expect a project like this is not worth the cost. I imagine doing this well would require dozens of hours of interviews with people who are more senior in the EA movement, and I think many of those people’s time is often quite valuable.

Regarding the pros you mention:

  1. I’m not convinced that building more EA ethos/identity based around shared history is a good thing. I expect this would make it even harder to pivot to new things or treat EA as a question, it also wouldn’t be unifying for many folks (e.g. who having been thinking about AI safety for a decade or who don’t buy longtermism). According to me, the bulk of people who call themselves EAs, like most groups, are too slow to update on new arguments and information and I would expect that having a written and agreed upon history would not help with this. Then again, my point might be made better if I could reference common historical cases of what I mean lol

  2. I don’t see how this helps build trust.

  3. I don’t see how having a written history makes the movement less likely to die. I also don’t know what it looks like for the EA movement to die or how bad this actually is; the EA movement is largely instrumental toward other things I care about: reducing suffering, increasing the chances of good stuff in the universe, my and my friends’ happiness to a lesser extent.

  4. This does seem like a value add to me, though the project I’m imagining only does a medium job at this given it’s goal is not “chronology of mistakes and missteps”. Maybe worth checking out https://www.openphilanthropy.org/research/some-case-studies-in-early-field-growth/

With ideas like this I sometimes ask myself “why hasn’t somebody done this yet”. Some reasons that come to mind: too busy doing other things they think are important, might come across as self aggrandizing, who’s going to read it?-and ways I expect it to get read are weird and indoctorinaty (“welcome to the club, here’s a book about our history”, as opposed to “oh, you want to do lots of good, here are some ideas that might be useful”), it doesn’t directly improve the world and the indirect path to impact is shakier than other meta things.

I’m not saying this is necessarily a bad idea. But so far I don’t see strong reasons to do this over the many other things openphil/cea/Kelsey piper/interviewees could be doing.

I like this comment and think it answers the question at the right level of analysis.

To try and summarize it back: EA’s big assumption is that you should purchase utilons, rather than fuzzies, with charity. This is very different from how many people think about the world and their relationship to charity. To claim that somebody’s way of “doing good” is not as good as they think is often interpreted by them as an attack on their character and identity, thus met with emotional defensiveness and counterattack.

EA ideas aim to change how people act and think (and for some core parts of their identity); such pressure is by default met with resistance.

Load more