All of David Johnston's Comments + Replies

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Independently of my other comment, I think you might be under-utilising your premises, in particular the "racing forward" assumption. Why would AI companies by racing to make the first AGI instead of the best AGI? The most likely answer seems to be: they expect large first-mover advantages that outweigh the benefits of taking more time to build a better system.

The fact that AI companies believe this is evidence that the first AGI past some capability threshold actually does have a particularly large impact. Furthermore, the fact that AI companies are chasing the first-mover advantage might also indicate that they're investing specifically in building the kind of AGI that captures this advantage.

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

I think this is a good piece, and I’m glad you wrote it. One disagreement I have is that I think the behaviour of a deep learning agent trained by HFDT is less predictable than this piece suggests. My guesses would be something like:

  1. 30%: the behaviour of HFDT-AGI is basically fine/instability is easy to manage
  2. 30%: the behaviour of HFDT-AGI is unstable in a  hard-to-manage way and includes seriously bad but not catastrophic behaviours
  3. 30%: the behaviour of HFDT-AGI is unstable and includes catastrophic behaviours
  4. 10%: the behaviour of HFDT-AGI converges
... (read more)
Probability distributions of Cost-Effectiveness can be misleading

Yeah, I was mentally substituting "effect" for "good" and "cost" for "bad"

Probability distributions of Cost-Effectiveness can be misleading

Ok, so say you have a fixed budget. Then you want to maximise mean(total effect), which is equal to mean(budget/cost * unit effect)

... I agree.

Also, infinite expected values come from having some chance of doing the thing an infinite number of times, where the problem is clearly the assumption that the effect is equal to budget/cost * unit effect when this is actually true only in the limit of small numbers of additional interventions.

Also, Lorenzo's proposal is ok when cost and effect are independent, while the error he identifies is still an error in this case.

2Vasco Grilo21d
The below is a reply to a previous version of the above comment. I do not think we want to maximise mean("effect" - "cost"). Note "effect" and "cost" have different units, so they cannot be combined in that way. "Effect" refers to the outcome, whereas "cost" corresponds to the amount of resources we have to spend. One might want to include "-cost" due to the desire of accounting for the counterfactual, but this is supposed to be included in "effect" = "factual effect" - "counterfactual effect". We want to maximise mean("effect") for "cost" <= "maximum cost" (see this [] comment).
How would a language model become goal-directed?

Is the motivation for 3 mainly something like "predictive performance and consequentialist behaviour are correlated in many measures over very large sets of algorithms", or is there a more concrete story about how this behaviour emerges from current AI paradigms?

Here is my story, I'm not sure if this is what you are referring to (it sounds like it probably is). Any prediction algorithm faces many internal tradeoffs about e.g. what to spend time thinking about and what to store in memory to reference in the future. An algorithm which makes those choices well across many different inputs will tend to do better, and in the limit I expect it to be possible to do better more robustly by making some of those choices in a consequentialist way (i.e. predicting the consequences of different possible options) rather than having all of them baked in by gradient descent or produced by simpler heuristics. If systems with consequentialist reasoning are able to make better predictions, then gradient descent will tend to select them. Of course all these lines are blurry. But I think that systems that are "consequentialist" in this sense will eventually tend to exhibit the failure modes we are concerned about, including (eventually) deceptive alignment. I think making this story more concrete would involve specifying particular examples of consequentialist cognition, describing how they are implemented in a given neural network architecture, and describing the trajectory by which gradient descent learns them on a given dataset. I think specifying these details can be quite involved both because they are likely to involve literally billions of separate pieces of machinery functioning together, and because designing such mechanisms is difficult (which is why we delegate it to SGD). But I do think we can fill them in well enough to verify that this kind of thing can happen in principle (even if we can't fill them in in a way that is realistic, given that we can't design performant trillion parameter models by hand).
On Deference and Yudkowsky's AI Risk Estimates

I agree with many of the comments here that this is overall a bit unfair, and there are good reasons to take Yudkowsky seriously even if you don't automatically accept his self-expressed level of confidence.

My main criticism of Yudkowsky is that he has many innovative/somewhat compelling ideas, but even with many years and a research institution their evolution has been unsatisfying. Many of them are still imprecise, and some of those that are precise(ish) are not satisfactory (e.g the orthogonality thesis, mesa-optimizers). Furthermore, he still doesn't seem very interested in improving this situation.

On Deference and Yudkowsky's AI Risk Estimates

FWIW I think "it was 20 years ago" is a good reason not to take these failed predictions too seriously, and "he has disavowed these predictions after seeing they were false" is a bad reason to take them unseriously.

Don't Over-Optimize Things

Lately, I tend to think of this as a distinction between the "proxy optmization" algorithm and the "optimality" of the actual plan. The algorithm: specify a proxy reward and a proxy set of plans, and search for the best one. You could call this "proxy optimization". 

The results: whatever actually happens, and how good it actually is. There's not really a verb associated with this - you can't just make something as good as it can possibly be (not even "in expectation" - you can only optimize proxies in expectation!). But it still seems like there's a l... (read more)

Which possible AI impacts should receive the most additional attention?

Impact: AI causes the extinction of people in the next 1000 years.

Why is this a priority? Extinction events are very bad from the point of view of people who want the future to be big and utopian. The 1000-year time frame (I think) is long enough to accommodate most timelines for very advanced AI, but short enough that we don't have to worry about "a butterfly flaps its wings and 10 million years later everyone is dead" type scenarios. While it is speculative, it does not seem reasonable given what we know right now to assign this event vanishingly low pro... (read more)

Longtermist slogans that need to be retired

How about this:
 A) Take top N interventions ranked by putting all effort into far future effects
 B) Take top N interventions ranked by putting more effort into near than far future effects

(you can use whatever method you like to prioritise the interventions you investigate). Then for most measures of value, group (A) will have much higher expected value than group (B). Hence "most of the expected value is in the far future".

Your initial comment was about slogan2 ("What matters most about our actions is their very long term effects"). I continue to think that this is not a useful framing. Some of our actions have big and predictable long-term effects, and we should focus on those. But most of our actions don't have predictable long-term effects, so we shouldn't be making generic statements about the long-term effects of an arbitrary action. Re slogan1 ("Most expected value is in the far future"), it sounds like you're interpreting it as being about the marginal EV of an action. I agree that it's possible for the top long-term focused interventions to currently have a higher marginal EV than near-term focused interventions. But as these interventions are funded, I expect their marginal EV to decline (ie. diminishing returns), possibly to a value lower than the marginal EV of near-focused interventions.
Longtermist slogans that need to be retired

"What matters most about our actions is their very long term effects."

I think my takeaway from this slogan is: given limited evaluation capacity + some actions under consideration, a substantial proportion of this capacity should be debited to thinking about long term effects.

It could be false: maybe it's easy to conclude that nothing important can be known about the long term effects. However, I don't think this has been demonstrated yet.

I would flip it around: we should seek out actions that have predictable long-term effects. So, instead of starting from the set of all possible actions and estimating the long-term effects for each one (an impossible task), we would start by restricting the action space to those with predictable long-term effects.
Replicating and extending the grabby aliens model

I haven't fully grokked this work yet, but I really appreciate the level of detail you've explained it in.

Replicating and extending the grabby aliens model

It seems plausible a significant fraction of ICs will choose to become GCs. Since matter and energy are likely to be instrumentally useful to most ICs, expanding to control as much volume as they can (thus becoming a GC) is likely to be desirable to many ICs with diverse aims.

Also, if an IC is a mixture of grabby and non-grabby elements, it will become a GC essentially immediately.

3Tristan Cook3mo
+1 I think Hanson et al. [] mention something like this too
Milan Griffes on EA blindspots

Now I wish there were numbers in the OP to make referencing easier

Edit: thanks

9Peter Wildeford5mo
Looks like the OP added numbers. Thanks OP!
[Cross-post] A nuclear war forecast is not a coin flip

You've gotten me interested in looking at total extinction risk as a follow up, are you interested in working together on it?

Sorry, I wouldn't have the time, since it's outside my focus at work, animal welfare, and I already have some other things I want to work on outside of my job.
On expected utility, part 2: Why it can be OK to predictably lose

From the title, I thought this was going to be a defense of being money pumped!

[Cross-post] A nuclear war forecast is not a coin flip

Do you know of work on this off the top of your head? I know if Ord has his estimate of 6% extinction in the next 100 years, but I don't know of attempts to extrapolate this or other estimates.

I think for long timescales, we wouldn't want to use an exchangeable model, because the "underlying risk" isn't stationary

I'm not sure I've seen any models where the discrepancy would have been large. I think most models with discount rates I've seen in EA use fixed constant yearly discount rates like "coin flips" (and sometimes not applied like probabilities at all, just actual multipliers on value, which can be misleading if marginal returns to additional resources are decreasing), although may do sensitivity analysis to the discount rate with very low minimum discount rates, so the bounds could still be valid. Some examples: 1. [] 2. [] (from [] ) 3. [] But I guess if we're being really humble, shouldn't we assign some positive probability to our descendants lasting forever (no heat death, etc., and no just as good or better other civilization taking our place in our light cone if we go extinct), so the expected future is effectively infinite in duration? I don't think most models allow for this. (There are also other potential infinities, like from acausal influence in a spatially unbounded universe and the many worlds interpretation of quantum mechanics, so duration might not be the most important one.)
This doesn't change the substance of your point, but Ord estimates a one-in-six chance of an existential catastrophe this century. Concerning extrapolation of this particular estimate, I think it's much clearer here that this would be incorrect, since the bulk of the risk in Toby's breakdown comes from AI, which is a step risk rather than a state risk [].
[Cross-post] A nuclear war forecast is not a coin flip
  1. If you think there's an exchangeable model underlying someone else's long-run prediction, I'm not sure of a good way to try to figure it out. Off the top of my head, you could do something like this:

     def model(a,b,conc_expert,expert_forecast):    
        # forecasted distribution over annual probability of nuclear war    
        prior_rate = numpyro.sample('rate',dist.Beta(a,b))
        with numpyro.plate('w',1000):        
            war = numpyro.sample('war',dist.Bernoulli(pri
... (read more)
Early-warning Forecasting Center: What it is, and why it'd be cool

However, I also expect them to be a large waste of resources (especially forecasting time), compared to idealized setups

Incidentally, I was just working on this post about efficiency of forecast markets vs consultancies.

I share your intuition that liquid/large scale markets may solve some of the inefficiency problems - forecasters would have a much better idea about what kinds of questions they can get paid for, specialise in particular kinds of questions, work in teams that can together answer broader ranges of questions and so forth. However, there's a k... (read more)

Potentially great ways forecasting can improve the longterm future

Are you at ~65% that marginal scientific acceleration is net negative, or is most of your weight on costs = benefits?

~65% that marginal scientific acceleration is net negative
On infinite ethics

"And suppose, per various respectable cosmologies, that the universe is filled with an infinite number of people very much like you"

I'm not familiar with these cosmologies: do they also say that the universe is filled with an equally large number of people quite like me except they make the opposite decision whenever considering a donation?

Bryan Caplan on EA groups

Someone should offer him a bet! Best EA vs best lib in a debate or something.

Idk what the debate formats are in America but I personally hated it because it isn't data-driven, you can't use research and stats to back your claims. Also slightly biases against epistemic humility.
Who is working on structured crowd forecasting?

Just to clarify: my understanding was that the MTAIR graph will eventually be extended with conditional probability estimates, so the whole model will define a probability distribution with conditional independences compatible with the underlying graph. This would make it a Bayesian Network in my eyes. However, it seems that we disagree on at least one of thing here!

Is the Analytica approach more robust to missing arrows not corresponding to conditional independences than a Bayesian network? If so, I'd be curious to hear a simplified explanation for why this is so.

Analytica allows you to define algebraic or other relationships between nodes, which can be real-valued, and have more complex relationships - but it can't propagate evidence without explicit directional dependence. That allows more flexibility - the nodes don't need to be conditionally independent, for example, and can be indexed to different viewpoints. This also means that it can't easily be used for lots of things that we use BNs for, since the algorithms used are really limited.
Who is working on structured crowd forecasting?

Thanks. I'm not working on anything at the moment, just curious about what has been done in the area. Did you consider other approaches to mapping out key hypotheses and cruxes for MTAIR? Do you have an idea of what advantages  and disadvantages you expect the big Bayesian network to have compared to other approaches? Have you found it to be better or worse in any particular ways?

A particular question I'm curious about: have you found the big Bayesian network approach is helpful in terms of decomposing the problem into sub-problems and efficiently allocating effort to subproblems ?

We looked at the options, and choose Analytica largely because of a paucity of other good options, and my familiarity with it. Having spoken with them in the past, and then specifically for this project, I also think that the fact that the company which makes it is happy to be supportive of our work is a big potential advantage. Why Not Large BNs? 1. BNs are expensive to elicit. (You needoutput∗∏inputivalues elicited per node, whereinputiandoutputare the number of discrete levels of each.) They also have relatively little flexibility due to needing those discrete buckets. There are clever ways around this, but they are complicated, and outside the specific expertise of our group. 2. BNs assume that every node is a single value, which may or may not make sense. Most software for BNs don't have great ways to do visual communication of clusters, and AFAIK, none have a way to leave parts undefined until later. You also need to strongly assert conditional independence, and if that assumption is broken, you need to redo large parts of the network. 3. The way we actually did this shares most of the advantages in terms of decomposition and splitting subproblems, though removing duplication /overlap is still tricky.
Linch's Shortform

The world's first slightly superhuman AI might be only slightly superhuman at AI alignment. Thus if creating it was a suicidal act by the world's leading AI researchers, it might be suicidal in exactly the same way. In the other hand, if it has a good grasp of alignment then it's creators might also have a good grasp of alignment.

In the first scenario (but not the second!), creating more capable but not fully aligned descendants seems like it must be a stable behaviour of intelligent agents, as by assumption

  1. behaviour of descendants is only weakly control
... (read more)
Simple comparison polling to create utility functions

I would be interested in this same concept but framed so as to compare personal utility instead of impersonal utility, because I feel like I'm trying to estimate other people's values for personal utility and aggregate them in order to get an idea of impersonal utility. It seems tricky, though:

 - How many {50} year old {friends/family members/strangers} would you save vs {5} year old {friends/family members/strangers}?

This seems straightforward, except maybe it's necessary to add "considering only your own benefit" if we want personal utilities that w... (read more)

What would you do if you had a lot of money/power/influence and you thought that AI timelines were very short?

One thing I'd want to do is to create an organisation that builds networks with add many AI research communities as possible, monitors AI research as comprehensively as possible and assesses the risk posed by different lines of research.

Some major challenges:

  • a lot of labs want to keep substantial parts of their work secret, even more so for e.g. military
  • encouraging sharing of more knowledge might inadvertently spread knowledge of how to do risky stuff
  • even knowing someone is doing something risky, might be hard to get them to change
  • might be hard to see
... (read more)
Donating money, buying happiness: new meta-analyses comparing the cost-effectiveness of cash transfers and psychotherapy in terms of subjective well-being

I share this concern. I don't have much of a baseline on how much meta-analysis overstated effect sizes, but I suspect it is substantial.

One comparison I do know about: as of about 2018, the average effect size of unusually careful studies funded by the EEF ( was 0.08, while the mean of meta-analytic effect sizes overall was allegedly 0.40(, suggesting that meta analysis in that field on aver... (read more)

Donating money, buying happiness: new meta-analyses comparing the cost-effectiveness of cash transfers and psychotherapy in terms of subjective well-being

I think the follow up is much more helpful, but I found the original helpful too. I think it may be possible to say the same content less rudely, but "I think strong minds research is poor" is still a useful comment to me.

I disagree. I should also say that the follow up looked very different when I commented on it; it was extensively edited after I had commented.
On the Universal Distribution

One thing to think about: in order to reason about "observations" using mathematical theory, we need to (and do) convert then into mathematical things. Probability theory can only address the mathematical things we get in the end.

Most schemes for doing this ignore a lot of important stuff. E.g. "measure my height in cm, write down the answer" is a procedure that produces a real number, but also one that is indifferent to almost every "observation" I might care about in the future.

(The quotes around observation are to indicate that I don't know if it's exa... (read more)

Why aren't you freaking out about OpenAI? At what point would you start?

I think this post and Yudkowski's Twitter thread that started it are probably harmful to the cause of AI safety.

OpenAI is one of the top AI labs worldwide, and the difference between their cooperation and antagonism to the AI safety community means a lot for the overall project. Elon Musk might be one of the top private funders of AI research, so his cooperation is also important.

I think that both this post and the Twitter thread reduce the likelihood of cooperation without accomplishing enough in return. I think that the potential to do harm to potential ... (read more)

Thanks for the recommendation. I spent about an hour looking for contact info, but was only able to find 5 public addresses of ex-OpenAI employees involved in the recent exodus. I emailed them all, and provided an anonymous Google Form as well. I'll provide an update if I do hear back from anyone.

Ah okay, so it is about not antagonizing our new overlords, got it.

Unfortunately we may be unlikely to get a statement from a departed safety researcher beyond mine (, at least currently.

How would you run the Petrov Day game?

It seems like the game would better approximate the game of mutually assured destruction if the two sides had unaligned aims somehow, and destroying the page could impede "their" ability to get in "our" way.

Maybe the site that gets more new registrations on Petrov day has the right to demand that the loser advertise something of their choice for 1 month after Petrov day. Preferably, make the competition something that will be close to 50/50 beforehand.

The two communities could try to negotiate an outcome acceptable to everyone or nuke the other to try to avoid having to trust them or do what they want.

3Ben Millwood10mo
Like Sanjay's answer, I think this is a correct diagnosis of a problem, but I think the advertising solution is worse than the problem. * A month of harm seems too long to me, * I can't think of anything we'd want to advertise on LW that we wouldn't already want to advertise on EAF, and we've chosen "no ads" in that case.
The motivated reasoning critique of effective altruism

Here's one possible way to distinguish the two: Under the optimizer's curse + judgement stickiness scenario retrospective evaluation should usually take a step towards the truth, though it could be a very small one if judgements are very sticky! Under motivated reasoning, retrospective evaluation should take a step towards the "desired truth" (or some combination of truth an desired truth, if the organisation wants both).

The motivated reasoning critique of effective altruism

I like this post. Some ideas inspired by it:

If "bias" is pervasive among EA organisations, the most direct implication of this seems to me to be that we shouldn't take judgements published by EA organisations at face value. That is, if we want to know what is true we should apply some kind of adjustment to their published judgements.

It might also be possible to reduce bias in EA organisations, but that depends on other propositions like how effective debiasing strategies actually are.

A question that arises is "what sort of adjustment should be applied?". T... (read more)

Thanks for your extensions! Worth pondering more. I think this is first-order correct (and what my post was trying to get at). Second-order, I think there's at least one important caveat (which I cut from my post) with just tallying total number (or importance-weighted number of) errors towards versus away from the desired conclusion as a proxy for motivated reasoning. Namely, you can't easily differentiate "motivated reasoning" biases from perfectly innocent traditional optimizer's curse [] . Suppose an organization is considering 20 possible interventions and do initial cost-effectiveness analyses for each of them. If they have a perfectly healthy and unbiased epistemic process, then the top 2 interventions that they've selected from that list would a) in expectation be better than the other 18 and b) in expectation will have more errors slanted towards higher impact rather than lower impact. If they then implement the top 2 interventions and do an impact assessment 1 year later, then I think it's likely the original errors (not necessarily biases) from the initial assessment will carry through. External red-teamers will then discover that these errors are systematically biased upwards, but at least on first blush "naive optimizer's curse issues" looks importantly different in form, mitigation measures, etc, from motivated reasoning concerns. I think it's likely that either formal Bayesian modeling or more qualitative assessments can allow us to differentiate the two hypotheses.