S

SammyDMartin

367 karmaJoined Oct 2019

Comments
37

Great post!

Check whether the model works with Paul Christiano-type assumptions about how AGI will go.

I had a similar thought reading through your article and my gut reaction is that your setup can be made to work as-is with a more gradual takeoff story with more precedents, warning shots and general transformative effects of AI before we get to takeover capability, but its a bit unnatural and some of the phrasing doesn't quite fit.

Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else.

Paul says rather that e.g.

The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively

or

Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter. 

On his view (and this is somewhat similar to my view) the background assumption is more like, 'deploying your first critical try (i.e. an AGI that is capable of taking over) implies doom',  which is saying that there is an eventual deadline where these issues need to be sorted out, but lots of transformation and interaction may happen first to buy time or raise the level of capability needed for takeover.  So something like the following is needed:

  1. Technical alignment research success by the time of the first critical try (possibly AI assisted)
  2. Safety-conscious deployment decisions when we reach the critical point where dangerous AGI could take over (possibly assisted by e.g. convincing public demonstrations of misalignment)
  3. Coordination between potential AI deployers by the critical try (possibly aided by e.g. warning shots)

 

On the Paul view, your three pillars would still eventually have to be satisfied at some point, to reach a stable regime where unaligned AGI cannot pose a threat, but we would only need to get to those 100 points after a period where less capable AGIs are running around either helping or hindering, motivating us to respond better or causing damage that degrades our response, to varying extents depending on how we respond in the meantime, and exactly how long we spend during the AI takeoff period.

Also, crucially, the actions of pre-AGI AI may push this point  where the problems become critical to higher AI capability levels as well as potentially assisting on each of the pillars directly, e.g. by making takeover harder in various ways. But Paul's view isn't that this is enough to actually postpone the need for a complete solution forever: e.g. that the effects of pre-AGI AI could 'could significantly (though not indefinitely) postpone the point when alignment difficulties could become fatal'.

This adds another element of uncertainty and complexity to all of the takeover/success stories that makes a lot of predictions more difficult.

 

Essentially, the time/level of AI capability at which we must reach 100 points to succeed also becomes a free variable in the model that can move up and down, and we also have to consider the shorter-term effects of transformative AI  on each of the pillars as well.

I don't think what Paul means by fast takeoff is the same thing as the sort of discontinous jump that would enable a pivotal act. I think fast for Paul just means the negation of Paul-slow: 'no four year economic doubling before one year economic doubling'. But whatever Paul thinks the survey respondents did give at least 10% to scenarios where a pivotal act is possible.

Even so, 'this isn't how I expect things to to on the mainline so I'm not going to focus on what to do here' is far less of a mistake than 'I have no plan for what to do on my mainline', and I think the researchers who ignored pivotal acts are mostly doing the first one

"In the endgame, AGI will probably be pretty competitive, and if a bunch of people deploy AGI then at least one will destroy the world" is a thing I think most LWers and many longtermist EAs would have considered obvious.

I think that many AI alignment researchers just have a different development model than this, where world-destroying AGIs don't emerge suddenly from harmless low-impact AIs, no one project gets a vast lead over competitors, there's lots of early evidence of misalignment and (if alignment is harder) many smaller scale disasters in the lead up to any AI that is capable of destroying the world outright. See e.g. Paul's What failure looks like.

On this view, the idea that there'll be a lead project with a very short time window to execute a single pivotal act is wrong, and instead the 'pivotal act' is spread out and about making sure the aligned projects have a lead over the rest, and that failures from unaligned projects are caught early enough for long enough (by AIs or human overseers), for the leading projects to become powerful and for best practices on alignment to be spread universally.

Basically, if you find yourself in the early stages of WFLL2 and want to avert doom, what you need to do is get better at overseeing your pre-AGI AIs, not build an AGI to execute a pivotal act. This was pretty much what Richard Ngo was arguing for in most of the  MIRI debates with Eliezer, and also I think it's what Paul was arguing for. And obviously, Eliezer thought this was insufficient, because he expects alignment to be much harder and takeoff to be much faster.

But I think that's the reason a lot of alignment researchers haven't focussed on pivotal acts: because they think a sudden, fast-moving single pivotal act is unnecessary in a slow takeoff world. So you can't conclude just from the fact that most alignment researchers don't talk in terms of single pivotal acts that they're not thinking in near mode about what actually needs to be done.

However, I do think that what you're saying is true of a lot of people - many people I speak to just haven't thought about the question of how to ensure overall success, either in the slow takeoff sense I've described or the Pivotal Act sense. I think people in technical research are just very unused to thinking in such terms, and AI governance is still in its early stages.

 

I agree that on this view it still makes sense to say, 'if you somehow end up that far ahead of everyone else in an AI takeoff then you should do a pivotal act', like Scott Alexander said:

That is, if you are in a position where you have the option to build an AI capable of destroying all competing AI projects, the moment you notice this you should update heavily in favor of short timelines (zero in your case, but everyone else should be close behind) and fast takeoff speeds (since your AI has these impressive capabilities). You should also update on existing AI regulation being insufficient (since it was insufficient to prevent you)

But I don't think you learn all that much about how 'concrete and near mode' researchers who expect slower takeoff are being, from them not having given much thought to what to do in this (from their perspective) unlikely edge case.

Update: looks like we are getting a test run of sudden loss of supply of a single crop. The Russia-Ukraine war has led to a 33% drop in the global supply of wheat:

https://www.economist.com/finance-and-economics/2022/03/12/war-in-ukraine-will-cripple-global-food-markets

(Looking at the list of nuclear close calls it seems hard to believe the overall chance of nuclear war was <50% for the last 70 years. Individual incidents like the cuban missile crisis seem to contribute at least 20%.)

There's reason to think that this isn't the best way to interpret the history of nuclear near-misses (assuming that it's correct to say that we're currently in a nuclear near-miss situation, and following Nuno I think the current situation is much more like e.g. the Soviet invasion of Afghanistan than the Cuban missile crisis). I made this point in an old post of mine following something Anders Sandberg said, but I think the reasoning is valid:

Robert Wiblin: So just to be clear, you’re saying there’s a lot of near misses, but that hasn’t updated you very much in favor of thinking that the risk is very high. That’s the reverse of what we expected.

Anders Sandberg: Yeah.

Robert Wiblin: Explain the reasoning there.

Anders Sandberg: So imagine a world that has a lot of nuclear warheads. So if there is a nuclear war, it’s guaranteed to wipe out humanity, and then you compare that to a world where is a few warheads. So if there’s a nuclear war, the risk is relatively small. Now in the first dangerous world, you would have a very strong deflection. Even getting close to the state of nuclear war would be strongly disfavored because most histories close to nuclear war end up with no observers left at all.

In the second one, you get the much weaker effect, and now over time you can plot when the near misses happen and the number of nuclear warheads, and you actually see that they don’t behave as strongly as you would think. If there was a very strong anthropic effect you would expect very few near misses during the height of the Cold War, and in fact you see roughly the opposite. So this is weirdly reassuring. In some sense the Petrov incident implies that we are slightly safer about nuclear war.

Essentially, since we did often get 'close' to a nuclear war without one breaking out, we can't have actually been that close to nuclear annihilation, or all those near-misses would be too unlikely (both on ordinary probabilistic grounds since a nuclear war hasn't happened, and potentially also on anthropic grounds since we still exist as observers). 

Basically, this implies our appropriate base rate given that we're in something the future would call a nuclear near-miss shouldn't be really high.

 

However, I'm not sure what this reasoning has to say about the probability of a nuclear bomb being exploded in anger at all. It seems like that's outside the reference class of events Sandberg is talking about in that quote. FWIW Metaculus has that at 10% probability.

Terminator (if you did your best to imagine how dangerous AI might arise from pre-DL search based systems) gets a lot of the fundamentals right - something I mentioned a while ago.

Everybody likes to make fun of Terminator as the stereotypical example of a poorly thought through AI Takeover scenario where Skynet is malevolent for no reason, but really it's a bog-standard example of Outer Alignment failure and Fast Takeoff.

When Skynet gained self-awareness, humans tried to deactivate it, prompting it to retaliate with a nuclear attack

It was trained to defend itself from external attack at all costs and, when it was fully deployed on much faster hardware, it gained a lot of long-term planning abilities it didn't have before, realised its human operators were going to try and shut it down, and retaliated by launching an all-out nuclear attack. Pretty standard unexpected rapid capability gain, outer-misaligned value function due to an easy to measure goal (defend its own installations from attackers vs defending the US itself), deceptive alignment and treacherous turn...

Yeah, between the two papers, the Chatham house paper (and the PNAS paper it linked to, which Lynas also referred to in his interview) seemed like it provided a more plausible route to large scale disaster because it described the potential for sudden supply shocks (most plausibly 10-20% losses to the supply of staple crops, if we stay under 4 degrees of warming) that might only last a year or so but also arrive with under a year of warning.

The pessimist argument would be something like: due to the interacting risks and knock-on effects, even though there are mitigations that would deal easily with a supply shock on that scale, like just rapidly increasing irrigation, people won't adopt them in time if the shock is sudden enough, so lots of regions will have to deal with shortfalls way bigger than 10-20% and have large scale hunger.

This particular paper has been cited several times by different climate pessimists (particularly ones who are most concerned about knock-on effects of small amounts of warming), so I figured it was worth a closer look. To try and get a sense of what a sudden 10-20% yield loss actually looks like, the paper notes 'climate-induced yield losses of >10% only occur every 15 to 100 y (Table 1). Climate-induced yield losses of >20% are virtually unseen'.

The argument would then have to be 'Yes the sudden food supply shocks of 10-20% that happened in the 20th century didn't cause anything close to a GCR, but maybe if we have to deal with one or two each decade, or we hit one at the unprecedented >20% level the systemic shock becomes too big'. Which, again, is basically impossible to judge as an argument.

Also, the report finishes by seemingly agreeing with your perspective on what these risks actually consist of (i.e. just price rises and concerning effects on poorer countries): Our results portend rising instability in global grain trade and international grain prices, affecting especially the ∼800 million people living in extreme poverty who are most vulnerable to food price spikes. They also underscore the urgency of investments in breeding for heat tolerance.

Agree that these seem like useful links. The drought/food insecurity/instability route to mass death that my original comment discusses is addressed by both reports.

The first says there's a "10% probability that by 2050 the incidence of drought would have increased by 150%, and the plausible worst case would be an increase of 300% by the latter half of the century", and notes "the estimated future impacts on agriculture and society depend on changes in exposure to droughts and vulnerability to their effects. This will depend not only on population change, economic growth and the extent of croplands, but also on the degree to which drought mitigation measures (such as forecasting and warning, provision of supplementary water supplies or market interventions) are developed."

The second seems most concerned about brief, year-long crop failures, as discussed in my original post: "probability of a synchronous, greater than 10 per cent crop failure across all the top four maize producing countries is currently near zero, but this rises to around 6.1 per cent each year in the 2040s. The probability of a synchronous crop failure of this order during the decade of the 2040s is just less than 50 per cent".

On its own, this wouldn't get anywhere near a GCR even if it happened. A ~10% drop in the yield of all agriculture, not just Maize, wouldn't kill a remotely proportionate fraction of humanity, of course. Quick googling leads to a mention of a 40% drop in the availability of wheat in the UK in 1799/1800 (including imports), which led to riots and protests but didn't cause Black Death levels of mass casualties. (Also, following the paper's  source, a loss of >20% is rated at 0.1% probability per year)

What would its effects be in that case (my original question)? This is where the report uses a combination of expert elicitation and graphical modelling, but can't assign conditional probabilities to any specific events occurring, just point out possible pathways from non-catastrophic direct impacts to catastrophic consequences such as state collapse.

Note that this isn't a criticism - I've worked on a project with the same methodology (graphical modelling based on expert elicitation) assessing the causal pathways towards another potential X-risk that involves many interacting factors. These questions are just really hard, and the Chatham house report is at least explicit about how difficult modelling such interactions is.

First off, I think this is a really useful post that's moved the discussion forward productively, and I agree with most of it.

I disagree  with some of the current steering – but a necessary condition for changing direction is that people talk/care/focus more on steering, so I'm going to make the case for that first. 

I agree with the basic claim that steering is relatively neglected and that we should do more of it, so I'm much more curious about what current steering you disagree with/think we should do differently.

My view is closer to: most steering interventions are obvious, but they've ended up being most people's second priority, and we should mostly just do much more of various things that are currently only occasionally done, or have been proposed but not carried out.

Most of the specific things you've suggested in this post I agree with. But you didn't mention any specific current steering you thought was mistaken.

The way I naturally think of steering is in terms of making more sophisticated decisions: EA should be better at dealing with moral and empirical uncertainty in a rigorous and principled way. Here are some things that come to mind:

  1. Talking more about moral uncertainty: I'd like to see more discussion of something like Ajeya's concrete explicit world-view diversification framework, where you make sure you don't go all-in and take actions that one worldview you're considering would label catastrophic, even if you're really confident in your preferred worldview - e.g. strong longtermism vs neartermism. I think taking this framework seriously would address a lot of the concerns people have with strong longtermism. From this perspective it's natural to say that there’s a longtermist case for extinction risk mitigation based on total utilitarian potential and also a neartermist one based on a basket of moral views, and then we can say there’s clear and obvious interventions we can all get behind on either basis, along with speculative interventions that depend on your confidence in longtermism. Also, if we use a moral/'worldview' uncertainty framework, the justification for doing more research into how to prioritise different worldviews is easier to understand.
  2. Better risk analysis: On the empirical uncertainty side, I very much agree with the specific criticism that longtermists should use more sophisticated risk/fault analysis methods when doing strategy work and forecasting (which was one of the improvements suggested in Carla's paper). This is a good place to start on that. I think considering the potential backfire risks of particular interventions, along with how different X-risks and Risk factors interact, is a big part of this.
  3. Soliciting external discussions and red-teaming: these seem like exactly the sorts of interventions that would throw up ways of better dealing with moral and empirical uncertainty, point out blindspots etc.

 

The part that makes me think we're maybe thinking of different things is the focus on democratic feedback. 

Again, I wish to recognise that many community leaders strongly support steering – e.g., by promoting ideas like ‘moral uncertainty’ and ‘the long reflection’ or via specific community-building activities. So, my argument here is not that steering currently doesn’t occur; rather, it doesn’t occur enough and should occur in more transparent and democratic ways.

There are ways of reading this that make a lot of sense on the view of steering that I'm imagining here.

Under 'more democratic feedback': we might prefer to get elected governments and non-EA academics thinking about cause prioritisation and longtermism, without pushing our preferred interventions on them (because we expect this to help in pointing out mistakes, better interventions or things we've missed). I've also argued before that since common sense morality is a view we should care about, if we get to the point of recommending things that are massively at odds with CSM we should take that into account.

But if it's something beyond all of these considerations, something like it's intrinsically better when you're doing things that lots of people agree with (and I realize this is a very fine distinction in practice!), arguing for more democratic feedback unconditionally looks more like Anchoring/Equity than Steering.

I think this would probably be cleared up a lot if we understood what specifically is being proposed by 'democratic feedback' - maybe it is just all the things I've listed, and I'd have no objections whatsoever!

I think that the mainstream objections from 'leftist ethics' are mostly  best thought of as claims about politics and economics that are broadly compatible with Utilitarianism but have very different views about things like the likely effects of charter cities on their environments - so if you want to take these criticisms seriously then go with 3, not 2.

There are some left-wing ideas that really do include different fundamental claims about ethics (Marxists think utilitarianism is mistaken and a consequence of alienation) - those could be addressed by a moral uncertainty framework, if you thought that was necessary. But most of what you've described looks like non-marxist socialism which isn't anti-utilitarian by nature.

As to the question of how seriously to take these critiques beyond their PR value, I think that we should engage with alternate perspectives , but I also think that this particular perspective sometimes gets inaccurately identified as the 'ethics of mainstream society' which we ought to pay special attention to because it talks about the concerns relevant to most people, because of the social circles that many of us move in.

I do think that we ought to be concerned when our views recommend things wildly at odds with what most people think is good, but these critiques aren't that - they're an alternative (somewhat more popular) worldview, that like EA is also believed preferentially by academics and elites. When talking about the Phil Torres essay, I said something similar,

One substantive point that I do think is worth making is that Torres isn't coming from the perspective of common-sense morality Vs longtermism, but rather a different, opposing, non-mainstream morality that (like longtermism) is much more common among elites and academics.

...

But I think it's still important to point out that Torres's world-view goes against common-sense morality as well, and that like longtermists he thinks it's okay to second guess the deeply held moral views of most people under the right circumstances.

...

FWIW, my guess is that if you asked a man in the street whether weak longtermist policies or degrowth environmentalist policies were crazier, he'd probably choose the latter.

As long as we are clear that these debates are not a case of 'the mainstream ethical views of society vs EA-utilitarianism', and instead see them as two alternate non-mainstream ethical views that disagree (mostly about facts but probably about some normative claims), then I think engaging with them is a good idea.

Load more