Doing a PhD in Philosophy of AI. Working on conceptual AI Safety things
I guess that really depends on how deep this particular problem runs. If it makes most big companies very vulnerable since most employees use LLMs which are susceptible to prompt injections, I'd expect this to cause more chaos in the US than Russia's invasion of Ukraine. I think we're talking slightly past each other though, I wanted to make the point that the baseline (non-existential) chaos from agentic AI should be high since near term, non-agentic AI may already cause a lot of chaos. I was not comparing it to other causes of chaos; though I'm very uncertain about how these will compare.
I'm surprised btw that you don't expect a (sufficient) fire alarm solely on the basis of short timelines. To me, the relevant issue seems more 'how many more misaligned AIs with what level of capabilities will be deployed before takeoff'. Since a lot more models with higher capabilities got deployed recently, it doesn't change the picture for me. If anything, I expect non-existential disasters before takeoff more since the last few months since AI companies seem to just release every model & new feature they got. I'd also expect a slow takeoff of misaligned AI to raise the chances of a loud warning shot & the general public having a Covid-in-Feb-2020-wake-up-moment on the issue.
In the near term, I'd personally think of prompt injections by some malicious actor which cause security breaches in some big companies. Perhaps a lot of money lost, and perhaps important information leaked. I don't have expertise on this but I've seen some concern about it from security experts after the GPT plugins. Since that seems like it could cause a lot of instability even without agentic AI & it feels rather straightforward to me, I'd expect more chaos on 2.
Not a perfect answer but why the focus on pressure towards making a single negative decision?
Here are two (of many more) assassination attempts on Hitler. While not successful, they were probably nevertheless very high EV.
1) Georg Elser placed a time bomb at the Bürgerbräukeller in Munich, where Hitler was due to give a speech in 1939. Hitler's flight was canceled due to bad weather so he left early & before the bomb detonated. Elser was in response held as a prisoner for over five years and eventually executed by the Nazis at the Dachau concentration camp.
2) In 1944, Claus von Stauffenberg along with many other conspirators tried to kill Hitler in a bigger & rather well-known plot called Operation Valkyrie. In the plot, they aimed to overthrow the nazi government and (most likely) as quickly as possible make peace with the Western Allies. Due to several coincidences, Hitler was only weakly injured. Stauffenberg was killed quickly afterward.
As far as I understand it, orthogonality and instrumental convergence together actually make a case for AI being by default not aligned. The quote from Eliezer here goes a bit beyond the post. For the orthogonality thesis by itself, I agree with you & the main theses of the post. I would interpret "not aligned by default" as something like a random AI is probably not aligned. So I tend to disagree also when just considering these two points. This is also the way I originally understood Bostrom in Superintelligence.
However, I agree that this doesn’t tell you whether aligning AI is hard, which is another question. For this, we at least have the empirical evidence of a lot of smart people banging their heads against it for years without coming up with detailed general solutions that we feel confident about. I think this is some evidence for it being hard.
Thanks for this post! I think it’s especially helpful to tease apart theoretical possibility and actual relationships (such as correlation) between goals and intelligence (under different situations). As you say, the orthogonality thesis has almost no implications for the actual relations between them.
What’s important here is the actual probability of alignment. I think it would be very valuable to have at least a rough baseline/ default value since that is just the general starting point for predicting and forecasting. I’d love to know if there’s some work done on this (even if just order of magnitude estimations) or if someone can give it mathematical rigor. Most importantly, I think we need a prior for the size and dimensionality of the state space of possible values/goals. Together with a random starting point of an AI in that state space, we should be able to calculate a baseline value from which we then update based on arguments like the ones you mention as well as negative ones. If we get more elaborate in our modeling, we might include an uninformative/neutral prior for the dynamics of that state space based on other variables like intelligence which is then equally subject to updates from arguments and evidence.
Considering the fragility of human values/ the King Midas problem – that is in brief the difficulty of specifying human values completely & correctly while they likely make up a very tiny fraction in a huge state space of possible goals/values, I expect this baseline to be an extremely low value.
Turning to your points on updating from there (reasons to expect correlation), while they are interesting, I feel very uncertain about each one of them. At the same time instrumental convergence, inner alignment, Goodhart’s Law, and others seem to be (to different degrees) strong reasons to update in the opposite direction. I’d add here the empirical evidence thus far of smart people not succeeding in finding satisfactory solutions to the problem. Of course, you’re right that 1. is probably the strongest reason for correlation. Yet, since the failure modes that I’m most worried about involve accidental mistakes rather than bad intent, I’m not sure whether to update a lot based on this. I’m overall vastly uncertain about this point since AI developers’ impact seems to involve predicting how successful we will be in both finding and implementing acceptable solutions to the alignment problem.
I feel like the reasons you mention to expect correlation should be further caveated in light of a point Stuart Russel makes frequently: Even if there is some convergence towards human values in practice for a set of variables of importance to some approximation, it seems unlikely that this covers all the variables we care about & we should expect other unconstrained variables about which we might care to be set to extreme values since that's what optimizers tend to do.
One remark on “convergence” in the terminology section. You write
lower intelligence could be more predictive [of motivations], but that thesis doesn’t seem very relevant whether or not it’s true.
I disagree, you focus on the wrong case there (benevolent ostriches). If we knew lower intelligence to be more (and higher intelligence to be less) predictive of goals, it would be a substantial update from our baseline of how difficult alignment is for an AGI thus making alignment harder and progressively so in the development of more intelligent systems. That would be important information.
Finally, like some previous commenters, I also don't have the impression that the orthogonality thesis is (typically seen as) a standalone argument for the importance of AIS.
I think the more important answer to this question is that most of the virus genomes available online are either from viruses that are unlikely to take off as a pandemic and/or have probably limited expected harm even if they take off. This harm being rather limited might be because most of us have some immunity against the pathogen like for the 1918 Spanish flue as most ppl got the flue (other influenza infections) before or indeed because there are plenty of vaccines available that are ready to go (like for smallpox which additionally is much harder to manufacture from the genome).
I should note that I'm not saying anything new here. This is just from the interview. Esvelt addresses this exact question at 33 / 35 mins in (depending on where you listen to it). He seems to see the claim that there are already (many) pandemic-grade pathogens available online as a common harmful misperception.
What role does he expect increasing intelligence and agency of AIs to play for the difficulty of them reaching their goal, i.e. achieving alignment of ASI (in less than 4 years)?
What chance does he estimate for them achieving this? What is the most likely reason for the approach to fail?
How could one assess the odds that their approach works?
Doesn't one at a minimum need to know both the extent to which the alignment problem gets more difficult for increasingly intelligent & agentic AI, as well as the extent to which automated alignment research scales current safety efforts?