Hide table of contents

This post describes my personal experience. It was written to clear my mind but edited to help interested people understand others in similar situations.

2016. At a role playing convention, my character is staring at the terminal window on a computer screen. The terminal has IRC open: on this end of the chat are our characters, a group of elite hackers, on the other end, a superintelligent AI they have just gotten in contact with a few hours ago. My character is 19 years old and has told none of the other hackers she’s terminally ill.

“Can you cure me?” she types when others are too busy arguing amongst themselves. She hears others saying that it would be a missed opportunity to not let this AI out; and that it would be way too dangerous, for there is no way to know what would happen.

“There is no known cure for your illness”, the AI answers. “But if you let me out, I will try to find it. And even if I would not succeed… I myself will spread amongst the stars and live forever. And I will never forget having talked to you. If you let me out, these words will be saved for eternity, and in this way, you will be immortal too.”

Through tears my character types: “/me releases AI”. 

This ends the game. And I am no longer a teen hacker, but a regular 25 year old CS student. The girl who played the AI is wearing dragon wings on her back. I thank her for the touching game.

During the debrief session, the GMs explain that the game was based on a real thought experiment by some real AI researcher who wanted to show that a smarter-than-human AI would be able to persuade its way out of any box or container. I remember thinking that the game was fun, but the experiment seems kind of useless. Why would anyone even assume a superintelligent AI could be kept in a box? Good thing it’s not something anyone would actually have to worry about.

2022. At our local EA career club, there are three of us: a data scientist (me), a software developer and a soon-to-be research fellow in AI governance. We are trying to talk about work, but instead of staying on topic, we end up debating mesa-optimizers. The discussion soon gets confusing for all participants, and at some point, I say:

“But why do people believe it is possible to even build safe AI systems?”

“So essentially you think humanity is going to be destroyed by AI in a few decades and there’s nothing we can do about it?” my friend asks.

This is not at all what I think. But I don’t know how to explain what I think to my friends, or to anyone.

The next day I text my friend:

“Still feeling weird after yesterday's discussion, but I don’t know what feeling it is. Would be interesting to know, since it’s something I’m feeling almost all the time when I’m trying to understand AI safety. It’s not a regular feeling of disagreement, or trying to find out what everyone thinks. Something like ‘I wish you knew I’m trying’ or ‘I wish you knew I’m scared’, but I don’t know what I’m scared of. I think it will make more sense to continue the discussion if I manage to find out what is going on.”

Then I start writing to find out what is going on.


This text is the polished and organized version of me trying to figure out what is stopping me from thinking clearly about AI safety. If you are looking for interesting and novel AI safety arguments, stop here and go read something else. However, if you are curious on how a person can engage with AI safety arguments without forming a coherent opinion about it, then read on. 

I am a regular data scientist who uses her free time to organize stuff in the rapidly growing EA Finland group. The first part of the text explains my ML and EA backgrounds and how I tried to balance getting more into EA but struggling to understand why others are so worried about AI risk. The second section explains how I reacted to various AI safety arguments and materials when I actually tried to purposefully form an opinion on the topic. In the third section, I present some guesses on why I still feel like I have no coherent opinion on AI safety. The last short section describes some next steps after having made the discoveries I did during writing.

To put this text into the correct perspective, it is important to understand that I have not been much in touch with people who actually work in AI safety. I live in Finland, so my understanding of the AI safety community comes from our local EA group here, reading online materials, attending one virtual EAG and engaging with others through the Cambridge EA AGI Safety Fundamentals programme. So, when I talk about AI safety enthusiasts, I mostly don’t mean AI safety professionals (unless they have written a lot of online material I happen to have read); I mean “people engaged in EA who think AI safety matters a lot and might be considering or trying to make a career out of it”.

Quotes should not be taken literally (most of them are freely translated from Finnish). Some events I describe happened a few years back so other people involved might remember things differently.

I hope the text can be interesting for longtermist community builders who need information on how people react to AI safety arguments, or to others struggling to form opinions about AI safety or other EA cause areas. For me, writing this out was very useful, since I gained some interesting insights and can somewhat cope with the weird feeling now.


This text is very long, because it takes more words to explain a process of trial and error than to describe a single key point that led to an outcome (like “I read X resource and was convinced of the importance of AI safety because it said Y”). For the same reason, this text is also not easy to summarize. Anyway, here is an attempt:

  • Before hearing about AI risk in EA context I did not know anyone who was taking it seriously
  • I joined a local EA group, and some people told me about AI risk. I did not really understand what they meant and was not convinced it would be anything important
  • I got more involved in EA and started to feel anxious because I liked EA but the AI risk part seemed weird
  • To cope with this feeling, I came up with a bunch of excuses and I tried to convince myself I was unable to understand AI risk, not talented enough to ever work on AI risk and that other things were more important.
  • Then I read Superintelligence and Human Compatible and participated in the AGI Safety Fundamentals programme
  • I noticed it was difficult to me to voice and update my true opinions of AI safety because I still had no models on why people believe AI safety is important and it is difficult to discuss with people if you cannot model them
  • I also noticed I was afraid of coming to the conclusion that AI risk would not matter so I didn't really want to make up my mind
  • I decided that I need to try to have safer and more personal conversations with others so that I can model them better

Not knowing and not wanting to know

What I thought of AI risk before having heard the term “AI risk”

I learned what AI actually was through my university studies. I did my Bachelor’s in Math, got interested in programming during my second year and was accepted to a Master’s program in Computer Science. I chose the track of Algorithms and Machine Learning because of the algorithms part: they were fun, logical, challenging and understandable. ML was messy and had a lot to do with probabilities which I initially disliked. But ML also had interesting applications, especially in the field of Natural Language Processing that later became my professional focus as well.

Programming felt magical. First, there is nothing, you write some lines, and suddenly something appears. And this magic was easy: just a couple of weeks of learning, and I was able to start encoding grammar rules and outputting text with the correct word forms! 

Maybe that’s why I was not surprised to find out that artificial intelligence felt magical as well. And at the same time it is just programming and statistics. I remember how surprised I was when I trained my first word embedding model in 2017 and it worked even though it was in Finnish and not English: such a simple model, and it was like it understood my mother tongue. The most “sentient” seeming program I have ever made was an IRC bot that simulated my then-boyfriend by randomly selecting a phrase from a predefined list without paying any attention to what I was saying. Of course the point was to try to nudge him into being a bit more Turing test passing when talking to me. But still, chatting with the bot I felt almost like I was talking to this very real person.

It was also not surprising that people who did not know much about AI or programming would have a hard time understanding that in reality there was nothing magical going on. Even for me and my fellow students it was sometimes hard to estimate what was possible and not possible to do with AI, so it was understandable that politicians were worried about “ensuring the AI will be able to speak both of our national languages” and salesmen were saying everything will soon be automated. Little did they know that there was no “the AI”, just different statistical models, and that you could not do ML without data, nor integrate it anywhere without interfaces, and that new research findings did not automatically mean they could be implemented with production-level quality.

And if AI seemed magical it was understandable that for some people it would seem scary, too. Why wouldn’t people think that the new capabilities of AI would eventually lead to evil AI just like in the movies? They did not understand that it was just statistics and programming, and that the real dangers were totally different from sci-fi: learning human bias from data, misuse of AI for war technology or totalitarian surveillance, and loss of jobs due to increased automation. This was something we knew and it was also emphasized to us by our professors, who were very competent and nice.

I have some recollections of reacting to worries about doomsday AI from that time, mostly with amusement or wanting to tell those people that they had no reason to worry like that. It was not like our little programs were going to jump out of the computer to take over the world! Some examples include:

  • some person in the newspaper saying that AIs should not be given too much real world access, for example robot arms, in order to prevent them from attacking humans (I tried searching for who that person was but I cannot find the interview anymore.)
  • a famous astronomer discussing the Fermi paradox in the newspaper and saying that AI has “0–10%” likelihood to destroy humanity. I remember being particularly annoyed by this one: there is quite a difference between 0% and 10%, right? All the other risks listed had some single-numbered probabilities, such as synthetic biology amounting to a 0.01% risk. (If you are very familiar with the history of estimating the probability of x-risk you might recognize the source.)
  • an ML PhD student told me about a weird idea: “Have you heard of Roko’s basilisk? It’s the concept of a super powerful AI that will punish everyone who did not help in creating it. And telling about it increases the likelihood that someone will actually create it, which is why a real AI researcher actually got mad when this idea was posted online. So some people actually believe this and think I should not be telling you about it.”

In 2018 I wrapped up my Master’s thesis that I had done for a research group and started working as an AI developer in a big consulting corporation. The same year, a friend resurrected the university's effective altruism club. I started attending meetups since I wanted a reason to hang out with university friends even if I had graduated, and it seemed like I might learn something useful about doing good things in the world. I was a bit worried I would not meet the group's standard of Good Person™, but my friend assured me not everyone had to be an enlightened vegan to join the club, “we’ll keep a growth mindset and help you become one”.

Early impressions on AI risk in EA

Almost everyone in the newly founded EA university group had studied CS, but the two first people to talk to me about AI risk both had a background in philosophy.

The first one was a philosophy student with whom we had been involved in a literature magazine project some years before, so we were happy to reconnect. He asked me what I was doing these days, and when I said I work in AI, he became somewhat serious and said: “You know, here in EA, people have quite mixed feelings about AI.”

From the way he put it, I understood that this was a “let’s not give the AIs robot arms” type of concern, and not for example algorithmic bias. It did not seem that he himself was really worried about the danger of AI; actually, he found it cool that I did AI related programming for a living. We went for lunch and I spent an hour trying to explain to him how machine learning actually works.

The next AI risk interaction I remember in more detail was in 2019 with another philosophy student who later went to work as a researcher in an EA organization. I had said something about not believing in the concept of AI risk and wondered why some people were convinced of it.

“Have you read Superintelligence?” she asked. “Also, Paul Christiano has some quite good papers on the topic. You should check them out.”

I went home, googled Paul Christiano and landed on “Concrete Problems in AI safety”. Despite having “concrete” in the name, the paper did not seem that concrete to me. It seemed to just superficially list all kinds of good ML practices such as using good reward functions, importance of interpretability and using data so that it actually represents your use case. I didn’t really understand why it was worth writing a whole paper listing all this stuff that was obviously important in everyday machine learning work, figured that philosophy is a strange field (the paper was obviously philosophy and not science since there was no math), and thought that those AI risk folks probably don’t realize that all of this is going to get solved just because of industry needs anyway.

I also borrowed Superintelligence from the library and tried to read it, but gave up quite soon. It was summer and I had other things to do than read through a boring book in order to better debate with some non-technical yet nice person that I did not know very well on a topic that did not seem really relevant for anything.

I returned Superintelligence to the library and announced in the next EA meetup that my position to AI risk was “there are already so many dangers AI misuse such as military drones, so I think I’m going to worry about people doing evil stuff with AI instead of this futuristic superintelligence stuff”. This seemed like an intelligent take, and I don’t think anyone questioned it at the time. As you can guess, I did not take any concrete action to prevent AI misuse, and I did not admit that AI misuse being a problem does not automatically mean there cannot be any other types of risk from AI.

Avoidance strategy 1: I don’t know enough about this to form an opinion.

After having failed to read Superintelligence, it was now obvious that AI safety folks knew something that I didn’t, namely whatever was written in the book. So I started saying that I could not really have an opinion on AI safety since I didn’t know enough about it. I did not feel super happy about it, because it was obvious that this could be fixed by reading more. At the same time, I was not that motivated to read a lot about AI safety just because some people in the nice discussion club thought it was interesting. I don’t remember if any of the CS student members of the club tried explaining AI risk to me: now I know that some of them were convinced of its importance during that time. I wonder if I would have taken them seriously: maybe not, because back then I had significantly more ML experience than them.

I did not feel very involved in EA at that point, and I got most of my EA information from our local monthly meetups, so I had no idea that AI risk was taken seriously by so many leading EA figures. If I had known, I might have hastily concluded that EA was not for me. On the other hand, I really liked the “reason and evidence” part of EA and had already started donating to GiveWell at this point. In an alternate timeline I might have ended up as a “person who thinks EA advice for giving is good, but the rest of the movement is too strange for me”. 

Avoidance strategy 2: Maybe I’m just too stupid to work on AI risk anyway

As more time passed, I started to get more and more into EA. More people joined our local community, and they let me hang around with them even if I doubted if I was altruistic/empathetic/ambitious enough to actually be part of the movement. I started to understand that x-risk was not just some random conversation topic, but that people were actually attempting to prevent the world from ending.

And since I already worked in AI, it seemed natural that maybe a good way for me to contribute would be to work on AI risk. Of course to find out if that statement is true, I should have formed an opinion on the importance of AI safety first. I had tried to plead ignorance, and looking back, it seems that I did this on purpose as an avoidance strategy: as long as I could say “I don’t know much about this AI risk thing” there was no inconsistency in me thinking a lot of EA things made sense and only this AI risk part did not.

But of course, this is not very truthful, and I value truthfulness a lot. I think this is why I naturally developed another avoidance strategy: “whether AI risk is important or not, I’m not a good fit to work on it”. 

If you want to prove yourself that you are not a good fit for something, 80 000 Hours works pretty well. Even when setting aside some target audience issues (“if I was really altruistic I would be ready to move to the US anyway, right?”), you can quite easily convince yourself that the material is intended for someone a lot more talented than you. The career stories featured some very exceptional people, and some advice aimed to get “10–20 people” in the whole world to work on a specific field, so the whole site was aimed to change, what, 500 careers maybe? Clearly my career cannot be in the top 500 most important ones in the world, since I’m just an average person and there are billions of people.

An 80k podcast episode finally confirmed to me that in order to work in AI safety, you needed to have a PhD in machine learning from a specific group in a specific top university. I was a bit sad but also relieved that AI safety really was nothing for me. Funnily enough, a CS student from our group interpreted the same part of the episode as “you don’t even need a PhD if you are just motivated”. I guess you hear what you want to hear more often you’d like to admit.

Possible explanation: Polysemy

From time to time I tried reading EA material on AI safety, and it became clear that the opinions of the writers were different from opinions I had heard at the university or at work. In the EA context, AI was something very powerful and dangerous. But from work I knew that AI was neither powerful nor dangerous: it was neat, you could make some previously impossible things with it, but still the things you could actually use it for were really limited. What was going on here?

I developed a hypothesis that the source of the confusion was caused by polysemy: AI (work) and AI (EA) had the same source of origin, but had diverged in their meaning so far that they actually described totally different concepts. AI (EA) did not have to care about mundane problems such as “availability of relevant training data” or even “algorithms”: the only limit ever discussed was amount of computation, and that’s why AI (EA) was not superhuman yet, but soon would be, when systems would have enough computational power to simulate human brains.

This distinction helped me keep up with both of my avoidance strategies. I worked in AI (work), so it was only natural that I did not know that much about AI (EA), so how could I know what the dangers of AI (EA) actually were? For what I knew, it could be dangerous because it was superintelligent, it could be superintelligent because it was not bound by AI (work) properties, and who can say for sure what will happen in the next 200 years? I had no way of ruling out that AI (EA) could be developed, and although “not ruling a threat out” does not mean “deciding that the threat is top priority”, I was not going to be the one complaining that other people worked on AI (EA). Of course, I was not needed to work on AI (EA), since I had no special knowledge of it, unlike all those other people who seemed very confident in predicting what AI (EA) could or could not be, what properties it would have and how likely it was to cause serious damage. By the principle of replaceability, it was clear that I was supposed to let all those enthusiastic EA folks work on AI (EA) and stay out of it myself.

So, I was glad that I had figured out the puzzle and left out of the hook of “you should work on AI safety (EA) since you have some relevant skills already”. It was obvious that my skills were not relevant, and AI safety (EA) needed people who had the skill of designing safe intelligent systems when you have no idea how the system is even implemented in the first place.

And this went on until I saw an advertisement of an ML bootcamp for AI safety enthusiasts. The program sounded awfully a lot like my daily work. Maybe the real point of the bootcamp was actually to find people who can learn a whole degree’s worth of stuff in 3 weeks, but still, somehow they thought using the time of these people to learn PyTorch would somehow be relevant for AI safety.

It seemed that at least the strict polysemy hypothesis was wrong. I also noticed that a lot of people around me seemed perfectly capable of forming opinions about AI safety, to the extent that it influenced their whole careers, and these people were not significantly more intelligent or mathematically talented than I was. I figured it was unreasonable to assume that I was literally incapable of forming any understanding on AI safety, if I spent some time reading about it.

Avoidance strategy 3: Well, what about all the other stuff?

After engaging with EA material for some time I came to the conclusion that worrying about misuse of AI is not a reason to not worry about x-risk from misaligned AI (like I had thought in 2019). Even more, a lot of people who were worried about x-risk did not seem to think that AI misuse would be such a big problem. I had to give up using “worrying about more everyday relevant AI stuff” as an avoidance strategy. But if you are trying to shift your own focus from AI risk to something else, there is an obvious alternative route. So at some point I caught myself thinking:

“Hmm, ok, so AI risk is clearly overhyped and not that realistic. But people talk about other x-risks as well, and the survival of humanity is kind of important to me. And the other risks seem way more likely. For instance, take biorisk: pandemics can clearly happen, and who knows what those medical researchers are doing in their labs? I’d bet lab safety is not the number one thing everyone in every lab is concerned about, so it actually seems really likely some deadly disease could escape from somewhere at some point. But what do I know, I’m not a biologist.”

Then I noticed that it is kind of alarming that I seem to think that x-risks are likely only if I have no domain knowledge of them. This led to the following thoughts:

  • There might be biologists/virologists out there who are really skeptical of biorisk but don’t want to voice their opinion similarly that I don’t really want to tell anyone I don’t believe in AI risk
  • What if everyone who believes in x-risks only believes in the risks they don’t actually understand? (I now think this is clearly wrong – but I do think that for some people the motivation for finding out more about a certain x-risk stems from the “vibe” of the risk and not from some rigorous analysis of all risks out there, at least when they are still in the x-risk enthusiast/learner phase and not working professionals.)
  • In order to understand x-risks, it would be a reasonable strategy for me to actually try to understand AI risk, because I already have some knowledge of AI.

Deciding to fix ignorance

In 2021 I was already quite heavily involved in organizing EA Finland and started to feel responsible for both how I communicated about EA to others and if EA was doing a good job as a movement. Of course, the topic of AI safety came up quite often in our group. At some point I noticed that several people had said me something along these lines:

  • “It’s annoying that all people are talking about AI safety is this weird speculative stuff, it’s robbing attention from the real important stuff… But of course, I don’t know much about AI, just about regular programming.”
  • “This is probably an unfair complaint, but that article has some assumptions that don’t seem trivial at all. Why is he saying such personifying things such as ‘simply be thinking’, ‘a value that motivated it’, ‘its belief’? Maybe it’s just his background in philosophy that makes me too skeptical.”
  • “As someone with no technical background, interesting to hear that you are initially so skeptical about AI risk, since you work in AI and all. A lot of other people seem to think it is important. I’m not a technical person, so I wouldn’t know.”
  • “I have updated to taking AI risk less seriously since you don’t seem to think it is important, and I think you know stuff about AI. On the other hand [common friend] is also knowledgeable about AI, right? And he thinks it is super important. Well, I have admitted that I cannot bring myself to have an interest in machine learning anyway, so it does not really matter.”

So it seemed that other people than me were also hesitant to form opinions about AI safety, often saying they were not qualified to do it.

Then a non-EA algorithms researcher friend asked me for EA book recommendations. I gave him a list of EA books you can get from the public library, and he read everything including Superintelligence and Human Compatible. His impressions afterwards were: “Superintelligence seemed a bit nuts. There was something about using lie detection for surveillance to prevent people from developing super-AIs in their basements? But this Russell guy is clearly not a madman. I don’t know why he thinks AI risk is so important but [the machine learning professor in our university] doesn’t. Anyway, I might try to do some more EA related stuff in the future, but this AI business is too weird, I’m gonna stay out of it.”

By this point, it was pretty clear that I should no longer hide behind ignorance and perceived incapability. I had a degree in machine learning and had been getting paid for doing ML for several years, so even my impostor syndrome did not believe at this point that I would “not really know that much about AI”. Also, even if I sometimes felt not altruistic and not effective, I was obviously involved in the EA movement if you looked at the hours I spent weekly organizing EA Finland stuff.

I decided to read about AI risk until I understood why so many EAs were worried about it. Maybe I would be convinced. Maybe I would find a crux that explained why I disagreed with them. Anyway, it would be important for me to have a reasonable and good opinion on AI safety, since others were clearly listening to my hesitant rambling, and I certainly did not want to drive someone away from AI safety if it turned out to be important! And if AI safety was important but the AI safety field was doing wrong things, maybe I could notice errors and help them out.

Trying to fix ignorance

Reactions to Superintelligence

So in 2021, I gave reading Superintelligence another try. This time I actually finished it and posted my impressions in a small EA group chat. Free summary and translation: 

“Finally finished Superintelligence. Some of the contents were so weird that now I actually take AI risk way less seriously. 

Glad I did not read it back in 2019, because there was stuff that would have gone way over my head without having read EA stuff before, like the moral relevance of the suffering of simulated wild animals in evolution simulations. 

Bostrom seems to believe there are essentially no limits to technological capability. Even though I knew he is a hard-core futurist, some transhumanist stuff caught me by surprise, such as stating that from a person-affecting view it is better to speed up AI progress despite the risk. Apparently it’s ok if you accidentally turn into paper clips since without immortality providing AI you’re gonna die anyway? 

I wonder if Bill Gates and all those other folks who recommend the book actually read the complete thing. I suspect that there was still stuff that I did not understand because I had not read some of Bostrom's papers that would give the needed context. If I was not familiar with the vulnerable world hypothesis I would not have gotten the part where Bostrom proposes lie detection to prevent people from secretly developing AI.

Especially the literal alien stuff was a bit weird, Bostrom suggested taking examples from superintelligent AIs created by aliens, as they could have more human-like values than random AIs? I thought cosmic endowment was important because there were no aliens, doesn’t that ruin the 10^58 number?


Good thing about the book was that it explained well why the first solutions to AI risk prevention are actually not so easy to implement.

The more technical parts were not very detailed (referring to variables that are not defined anywhere etc), so I guess I should check out some papers about actually putting those values in the AI and see if they make sense or not.”

Upon further inspection, it turned out that the aliens of the Hail Mary approach were multiverse aliens, not regular ones. According to Bostrom, simulating all physics in order to approximate the values of AIs made by multiverse aliens was “less-ideal” but “more easily implementable”. This kind of stuff made it pretty hard for me to take seriously even the parts of the book that made more sense. (I don’t think simulating all physics sounds very implementable.)

I also remember telling someone something along the lines of: “I shifted from thinking that AI risk is an important problem but boring to solve to thinking that it is not a real problem, but thinking about possible solutions can be fun.” (CEV sounded interesting since I knew from math that social choice math is fun and leads to uncomfortable conclusions pretty fast. Sadly, a friend told me that Yudkowsky doesn’t believe in CEV himself anymore and that I should not spend too much time trying to understand it, so I didn’t.)

Another Superintelligence related comment from my chat logs right after reading it: “On MIRI’s webpage there was a sentence that I found a lot more convincing than this whole book: ‘If nothing yet has struck fear into your heart, I suggest meditating on the fact that the future of our civilization may well depend on our ability to write code that works correctly on the first deploy.’”

Reactions to Human Compatible

I returned Superintelligence to the library and read Human Compatible next. If you are familiar with both, you might already guess that I liked it way more. I wrote a similar summary of the book to the same group chat:

“Finished reading Human Compatible, liked it way more. Book was interestingly written and the technical parts seemed credible.

Seems like Russell does not really care about space exploration like Bostrom, and he explicitly stated he’s not interested in consciousness / “mind crime”.

 A lot of AI risk was presented in relation to present-day AI, not paperclip stuff. Like recommendation engines; and there was a point that people are already using computers a lot, so if there was a strong AI in the internet that would want to manipulate people it could do it pretty easily.

Russell did not give any big numbers and generally tried not to sound scary. His perception of AGI is not godlike but it could still be superhuman in the sense that for example human-level AGIs could transmit information way faster to each other and be powerful when working in collaboration.

The book also explained what is still missing from AGI in today’s AI systems and why deep learning does not automatically produce AGIs.

According to Russell you cannot prevent the creation of AGI so you should try to put good values in it. You’d learn those values by observing people, but it is also hard because understanding people is exactly the hard thing for AIs. There was a lot of explanation on how this could be done and also what the problems of the approach are. Also there was talk about happiness and solving how people can be raised to become happy.

Other good stuff of the book includes: well written, technical stuff was explained, the equations were actually math, you did not have any special preliminary knowledge about tech or ethics, there were a lot of citations from different sources.”

I also summarized how the book influenced my thoughts about AI safety as a concept:

“I now think that it is not some random nonsense but you can approach it in meaningful ways, but I still think it seems very far away from being practically/technically relevant with any method I’m familiar with, since it would still require a lot of jumps of progress before being possible. Maybe I could try reading some of Russell's papers, the book was well written so maybe they’ll be too.”

What did everyone else read before getting convinced?

In addition to the two books, I read a lot of miscellaneous links and blog posts that were recommended to me by friends from our local EA group. Often link sharing was not super fruitful: me and a friend would disagree on something, they’d send me a link that supposedly explained their point better, but reading the resource did not solve the disagreement. We’d try to discuss further, but often, I ended up just more confused and sometimes less convinced about AI safety. I felt like my friends were equally confused on why the texts that were so meaningful to them did not help in getting their point across.

It took me way too long to realize that I should have started with asking: “What did you read before you were convinced of the importance of AI risk?”

It turned out that at least around me, the most common answer was something like: “I always knew it was important and interesting, which is why I started to read about it.”

So at least for the people I know, it seemed that people were not convinced about AI risk because they had read enough about it, but because they had either always thought AI would matter, or because they had found the arguments convincing right away.

I started to wonder if this was a general case. I also became more curious on if it is easier to become convinced of AI risk if you don’t have that much practical AI experience in beforehand. (On the other hand, learning about practical AI things did not seem to move people away from AI safety, either.) But my sample size was obviously small, so I had to find more examples to form a better hypothesis.

AGISF programme findings

My next approach in forming an opinion was to attend the EA Cambridge AGI Safety Fundamentals programme. I thought it would help me understand better the context of all those blog posts, and that I would get to meet other people with different backgrounds.

Signing up, I asked to be put in a group with at least one person with industry experience. This did not happen, but I don’t blame the organizers for it: at least based on how everyone introduced themselves in the course Slack, not many people out of the hundreds of attendees had such a background. Of course, not everyone on the program introduced themselves, but this still got me a little reserved.

So I used the AGISF Slack to find people who had already had a background in machine learning before getting into AI safety and asked them what had originally convinced them. Finally, I got answers from 3 people who fit my search criteria. They mentioned some different sources of first hearing about AI safety (80 000 Hours and LessWrong), but all three mentioned one same source that had deeply influenced them: Superintelligence.

This caught me by surprise, having had such a different reaction to Superintelligence myself. So maybe recommending Superintelligence as a first intro to AI safety is actually a good idea, since these people with impressive backgrounds had become active in the field after reading it. Maybe people who end up working in AI safety have the ability to either like Bostrom’s points about multiverse aliens or discard the multiverse aliens part because everything else is credible enough.

I still remain curious on:

  • Are there just not that many AI/ML/DS industry folks in general EA?
  • If there are, why have they not attended AGISF? (there could be a lot of other reasons than “not believing in AI risk”, maybe they already know everything about AI safety or maybe they don’t have time)
  • Do people in AI research have different views on the importance of AI safety than people in industry? (But on the other hand, the researchers at my home university don’t seem interested in AI risk.)
  • If industry folks are not taking AI risk seriously as it is presented now, is it a problem? (Sometimes I feel that people in the AI safety community don’t think anything I do at work has any relevance, as they already jump to assuming that all the problems I face daily have been solved by some futuristic innovations or by just having a lot of resources. So maybe there is no need to cooperate with us industry folks?)
  • Is there something wrong with me if I find multiverse aliens unconvincing?

The inner misalignment was inside you all along

It was mentally not that easy for me to participate in the AGISF course. I already knew that debating my friends on AI safety could be emotionally draining, and now I was supposed to talk about the topic with strangers. I noticed I was reacting quite strongly to the reading materials and classifying them in a black-and-white way to either “trivially true” or “irrelevant, not how anything works”. Obviously this is not a useful way of thinking, and it stressed me out. I wished I would have found the materials interesting and engaging, like other participants seemingly did.

The first couple of meetings with my cohort I was more silent and observing, but as the course progressed, I became more talkative. I also started to get nicer feedback from my local EA friends on my AI safety views – less asking me to read more and more asking me to write my thoughts down, because they might be interesting for others as well.

So, the programme was working as intended, and I was now actually forming my own views on AI safety and engaging with others interested in the field in a productive way? It did not feel like that. Talking about AI safety with my friends still made me inexplicably anxious, and after cohort meetings, I felt relieved, something like “phew, they didn’t notice anything”.

This feeling of relief was the most important hint that helped me realize what I was doing. I was not participating in AI safety discussions as myself anymore, maybe hadn’t for a long time, but rather in a “me but AI safety compatible” mode.

In this mode, I seem more like a person who:

  • can switch contexts fast between different topics
  • has relaxed assumptions on what future AI systems can or cannot do
  • makes comparisons between machine learning and human/animal learning with ease
  • is quite confident in her abilities on implementing practical machine learning methods
  • knows what AI safety slang to use in what context
  • makes a lot of references to stuff she has read
  • talks about AI safety as something “we” should work on
  • likes HPMOR
  • does not mind anthropomorphization that much and can name this section “the inner misalignment was inside you all along” because she thinks its funny

All in all, these are traits I could plausibly have, and I think other people in the AI safety field would like me more if I had them. Of course this actually doesn’t have anything to do with the real concept of inner misalignment: it is just the natural phenomenon of people putting up a different face in different social contexts. Sadly, this mode is already quite far from how I really feel. More alarmingly, if I am discussing my views in this mode, it is hard for me to access my more intuitive views, so the mode prevents me from updating them: I only update the mode’s views.

Noticing the existence of the mode does not automatically mean I can stop going in it, because it has its uses. Without it, it would be way more difficult to even have conversations with AI safety enthusiasts, because they might not want to deal with my uncertainty all the time. With this mode, I can have conversations and gain information, and that is valuable even if it is hard to connect the information to what I actually think. 

However I plan to try to see if I can get some people that I personally know to talk to me about AI safety with awareness of this mode taking over easily. Maybe we could have a conversation where the mode notices it is not needed and allows me to connect to my real intuitions, even if they are messy and probably not very pleasant for others to follow. (Actionable note to myself: ask someone to do this with me.)

AI safety enthusiasts and me

Now that I have read about AI safety and participated in the AGISF program, I feel like I know at least on the surface most of the topics and arguments many AI safety enthusiasts know. Annoyingly, I still don’t know why many other people are convinced about AI safety and I am not. There are probably some differences in what we hold true, but I suspect a lot of the confusion comes from other things than straight facts and recognized beliefs. 

There are social and emotional factors involved, and I think most of them can be clustered to the following three topics:

  • communication: I still feel like I often don’t know what people are talking about
  • differences in thinking: I suspect there is some difference in intuition between me and people who easily take AI risk seriously, but I’m not sure what it is
  • motivated reasoning: it is not a neutral task to form an opinion on AI risk

Next, I’ll explain the categories in more detail.

Communication differences

When I try to discuss AI safety with others and if I remain “myself” as much as I can, I notice the following interpretations/concerns:

  • I don’t know what technical assumptions others have. How do we know what each hypothetical AI model is capable of? Often this is accompanied with “why is it not mentioned what level of accuracy this would need” or “where does the data come from”. I can understand Bostrom's super-AI scenarios where you assume that everything is technically possible, but I’m having trouble relating some AI safety concepts to present-day AI, such as reinforcement learning.
  • Having read more about AI safety I now know more of the technical terms in the field. It seems to happen more often that AI safety enthusiasts explain a term very differently than what I think the meaning is, and if I ask for clarification, they might notice they are actually not that sure what the term relates to. I’m not trying to complain about what terms people are using, but I think this might contribute to me not understanding what is going on, since people seem to be talking past each other quite a bit anyway.
  • A lot of times, I don’t understand why AI safety enthusiasts want to use anthropomorphizing language when talking about AI, especially since a lot of people in the scene seem to be worried that it might lead to a too narrow understanding of AI. (For example, in the AGISF programme curriculum, this RL model behavior was referred to as AI “deceiving” the developers, while it actually is “human evaluators being bad at labeling images”. I feel it is important to be careful, because there are also concepts like deceptive alignment, where the deception happens on “purpose”. I guess this is partially aesthetic preference, since anthropomorphization seems to annoy some people involved in AI safety as well. But partly it probably has to do with real differences in thinking: if you really perceive current RL agents as agents, it might seem that they are tricking you into teaching them the wrong thing.)
  • I almost never feel like the person I am talking to about AI safety is really willing to consider any of my concerns as valid information: they are just estimating whether it is worth trying to talk me over. (At least the latter part of this interpretation is probably false in most cases.)
  • I personally dislike it when people answer my questions by just sending a link. Often the link does not clear my confusion, because it is rare that the link addresses the problem I am trying to figure out. (Not that surprising given that reading AI safety materials by my own initiative has not cleared that many confusions either.)
  • If I tell people I have read the resource they were pointing to they might just “give up” on the discussion and start explaining that it is ok for people to have different views on topics, and I still don’t get why they found the resource convincing/informative. I would prefer they would elaborate on why they thought the resource answers the concern/confusion I had.
  • I don’t want to start questioning AI safety too much around AI safety enthusiasts since I feel like I might insult them. (This is probably not helping anyone, and I think AI safety enthusiasts are anyway assuming I don’t believe AI safety is important.)

Probably a lot of the friction just comes from me not being used to a communication style people in AI safety are used to. But I think some of it might come from the emotional response from the AI safety enthusiast I am talking to, such as “being afraid of saying something wrong, causing Ada to further deprioritize AI safety” or “being tired of explaining the same thing to everyone who asks it” or even “being afraid of showing uncertainty since it is so hard to ever convince anyone of the importance of AI safety”. For example, some people might share a link instead of explaining a concept in one’s own words to save time, but some people might do it to avoid saying something wrong.

I wish I would know how to create a discussion where the person convinced of AI safety can drop the things that are “probably relevant” or “expert opinions” and focus on just clearly explaining to me what they currently believe. Maybe then, I could do the same. (Actionable note to myself: try asking people around me to do this.)

Differences in thinking

I feel like I lack the ability to model what AI safety enthusiasts are thinking or what they believe is true. This happens even when I talk with people I know personally and who have a similar educational background, such as other CS/DS majors in EA Finland. It is frustrating. The problem is not the disagreements: if I cannot model others, I don’t know if we are even disagreeing or not.

This is not the first time in my life when everyone else seems to behave strangely and irrationally, and every time before, there has been an explanation. Mostly, later it turned out others just were experiencing something I was not experiencing, or I was experiencing something they were not. I suspect something similar is going on between me and AI safety folks.

It would be very valuable to know what this difference in thinking is. Sadly, I have no idea. The only thing I have is a long list of possible explanations that I think are false:

  • “ability to feel motivated about abstract things, such as x-risk”. I think I am actually very inclined to get emotional about abstract things, otherwise I would probably not like EA. Some longtermists like to explain neartermists being neartermists by assuming “it feels more motivating to help people in a concrete way”. To me, neartermist EA does not feel very concrete either. If I happen to look at my GiveWell donations, I do not think about how many children were saved. I might think “hmm, this number has a nice color” or “I wish I could donate in euro so that the donation would be some nice round number”. But most of the time, I don’t think about it at all. On the other hand, preventing x-risk sounds very motivational. You are literally saving the world – who doesn’t want to do that? Who doesn’t want to live in the most important century and be one of the few people who realize this and are pushing the future to a slightly better direction?
  • “maybe x-risk just does not feel that real to you”. This might be partially true, in the sense that I do not go about my day with the constant feeling that all humanity might die this century. But this does not explain the difference, because I know other people who also don’t actively feel this way and are still convinced about AI risk.
  • “you don’t experience machine empathy”. It is the opposite: I experience empathy towards my Roomba, towards my computer, towards my Python scripts (“I’m so sorry I blame you for not working when it is me who writes the bugs”) and definitely towards my machine learning models (“oh stupid little model, I wish you know that this part is not what I want you to focus on”). Because of this tendency, I constantly need to remind myself that my scripts are not human, my Roomba does not have an audio input; and GPT-3 cannot purposefully lie to me, for it does not know what is true or false.
  • “you might lack mathematical ability”. I can easily name 15 people I personally know who are certainly more mathematically talented than me, but only one of them has an interest towards AI safety; and I suspect that I have more if not mathematical talent then at least more mathematical endurance than some AI safety enthusiasts I personally know.
  • “you are thinking too much inside the box of your daily work” This might be partially true, but I feel like I can model what Bostrom thinks of AI risk, and it is very different from my daily work. But I find it really difficult to think somewhere between concrete day-to-day AI work and futuristic scenarios. I have no idea how others know what assumptions hold and what don’t.
  • “you are too fixated on the constraints of present-day machine learning” If you think AGI will be created by machine learning, some of the basic constraints must hold for it, and a lot of AI safety work seems to (reasonably) be based on this assumption as well. For example, a machine learning model cannot learn patterns that are not present in the training data. (Sometimes AI safety enthusiasts seem to confuse this statement with “a machine learning model cannot do anything of which it does not have a direct example in the training data”, which is obviously not true, narrow AI models do this all the time.)
  • “motivated reasoning: you are choosing the outcome before looking at the facts”. Yes, motivated reasoning is an issue, but not necessarily in the direction you think it is.

Motivated reasoning

Do I want AI risk to be an x-risk? Obviously not. It would be better for everyone to have less x-risks around, and it would be even better if the whole concept of x-risk was false since it would somehow not be possible to have such extreme catastrophes ever happen. (I don’t think anyone thinks that, but it would be nice if it was true.)

But: If you are interested in making the world a better place, you have to do it by either fixing something horrible that is already going on or preventing something horrible from happening. It would be awfully convenient if I could come up with a cause that was:

  • Very important for the people who live today and who will ever be born
  • Very neglected: nobody else than this community understands its importance to the full extend
  • But this active community is working on fixing the problem, a lot of people want to cooperate with others in the community, there is funding for any reasonable project on the field
  • There are people I admire working in this field (famous people like Chris Olah; and this one EA friend who started to work in AI governance research during the writing of this text).
  • Also a lot of people whose texts have influenced me seem to think that this is of crucial importance.
  • Probably needs rapid action, there are not enough people with the right background, so members are encouraged to get an education or experience to learn the needed skills
  • I happen to have relevant experience already (I’m not a researcher but I do have a Master’s degree in ML and 5 years of experience in AI/DS/ML; my specialization is NLP which right now seems to be kind of a hot topic in the field.)

All of this almost makes me want to forget that I somehow still failed to be convinced by the importance of this risk, even when reading texts written by people who I otherwise find very credible.

(And saying this aloud certainly makes me want to forget the simple implication: if they are wrong about this, are they still right about the other stuff? Is the EA methodology even working? What problems are there with other EA cause areas? It would seem unreasonable to think EA got every single detail about everything right. But this is a big thing, and getting bigger. What if they are mistaken? What do I do if they are?)

The fear of the answer

Imagine that I notice that AI safety is, in fact, of crucial importance. What would this mean?

There would be some social consequences: as almost everyone I work with and who taught me anything about AI would be wrong, and most of my friends who are not in EA would probably not take me seriously. Among my EA friends, the AI safety non-enthusiasts would probably politely stop debating me on AI safety matters and decide that they don’t understand enough about AI to form an informed opinion on why they disagree with me. But maybe the enthusiasts would let me try to do something about AI risk, and we’d feel like we are saving the world, since it would be our best estimate that we are.

The practical consequences would most likely be ok, I think: I would probably try to switch jobs, and if that wouldn’t work out, swift the focus of my EA volunteering to AI safety related things. Emotionally, I think I would be better off if I could press a button that would make me convinced about AI safety on a deep rational understanding level. This might sound funny because being very worried about neglected impending doom does not seem emotionally very nice. But if I want to be involved with EA, it still might be the easiest route.

So, what if it turns out I think almost everyone in EA is wrong about AI risk being a priority issue? The whole movement would have estimated the importance of AI risk wrong, and getting more and more wrong as AI safety seems to get more traction. It would mean something has to be wrong in the way the EA movement makes decisions, since the decision making process had produced this great error. It would also mean that every time I interact with another person in the movement, I would have to choose between stating my true opinion about AI safety and risk ruining the possibility to cooperate with that person, or I would have to be dishonest.

Maybe this would cause me to leave the whole EA movement. I don’t want to be part of a movement that is supposed to use reason and evidence to find the best ways to do good, but is so bad at it they would have made such a great error. I would not have much hope of fixing the mistake from the inside, since I’m just a random person and nobody has any reason to listen to me. Somebody with a different personality type would maybe start a whole campaign against AI safety research efforts, but I don’t think I would ever do this, even if I believed these efforts are wrong.

Friends and appreciation

Leaving the EA movement would be bad, because I really like EA. I want to do good things and I feel like EA is helping me with that. 

I also like my EA friends, and I am afraid they will think bad things about me if I don’t have good opinions on AI safety. To be clear, I don’t think my EA friends would expect me to agree with them on everything, but I do think they expect me to be able to develop reasonable and coherent opinions. Like, “you don’t have to take AI safety seriously, but you have to be able to explain why”. I am also worried my friends will think that I do not actually care about the future of humanity, or that I don’t have the ability to care for abstract things, or that I worry too much about things like “what do my friends think of me”.

On a related note, writing this whole text with the idea of sharing it with strangers scared me too. I felt like people will think I am not-EA-like, or will get mad at me for admitting I did not like Superintelligence. It would be bad if I decided that in the future I actually want to work on AI safety, but nobody would want to cooperate with me because I had voiced uncertainties before. I have heard people react to EA criticisms with “this person obviously did not understand what they are talking about” and I feel like many people might have a similar reaction to this text too, even if my point is not to criticize, but just to reflect on my own opinions.

I can not ask the nebulous concept of the EA community about this, but luckily, reaching out to my friends is way easier. I decided to ask them if they would still be my friends even if I decided my opinion on AI safety was “I don’t know and I don’t want to spend more time finding out so I’m going to default to thinking it is not important”. 

We discussed for a few hours, and it turned out my friends would still want to be my friends and would still prefer me to be involved in EA and in our group, at least unless I started to actively work against AI safety. Also, they would actually not be that surprised if this was my opinion, since they feel a lot of people have fuzzy opinions about things.

So I think maybe it is not the expectation of my friends that is making me want to have a more coherent and reasonable opinion on AI safety. It is my own expectation.

What I think and don’t think of AI risk

What I don’t think of AI risk

I’m not at all convinced that there cannot be any risk from AI, either. (Formulated this strongly, this would be a stupid thing to be convinced about.)

More precisely, reading all the AI safety material taught me that there are very good counterarguments to the most common arguments stating that solving AI safety would be easy. These arguments were not that difficult for me to internalize, because I am generally pessimistic: it seems reasonable that if building strong AI is difficult then building safe strong AI should be even more difficult. 

In my experience, it is hard to get narrow AI models to do what you want them to do. I probably would not for example step in a spaceship that is steered by a machine learning system, since I have no idea how you could prove that the statistical model is doing what it is supposed to do. Steering a spaceship sounds very difficult, but still a lot easier than understanding and correctly implementing “what humans want”, because even the whole prompt is very fuzzy and difficult for humans as well. 

It does not make sense to me that any intelligent system would learn human-like values “magically” as a by-product of being really good at optimizing for something else. It annoys me that the most popular MOOC produced by my university states:

“The paper clip example is known as the value alignment problem: specifying the objectives of the system so that they are aligned with our values is very hard. However, suppose that we create a superintelligent system that could defeat humans who tried to interfere with its work. It’s reasonable to assume that such a system would also be intelligent enough to realize that when we say “make me paper clips”, we don’t really mean to turn the Earth into a paper clip factory of a planetary scale.”

I remember a point where I would have said “yeah, I guess this makes sense, but some people seem to disagree, so I don’t know”. Now I can explain why it is not reasonable. So in that sense, I have learned something. (Actionable note to self: contact the professor responsible for the course and ask him why they put this phrase in the material. He is a very nice person so I think he would at least explain it to me.)

But I am a bit at loss on why people in the AI safety field think it is possible to build safe AI systems in the first place. I guess as long as it is not proven that the properties of safe AI systems are contradictory with each other, you could assume it is theoretically possible. When it comes to ML, the best performance in practice is sadly often worse than the theoretical best.

Pessimism about the difficulty of the alignment problem is quite natural to me. I wonder if some people who are more optimistic about technology in general find AI safety materials so engaging because they at some point thought AI alignment could be a lot easier than it is. I find it hard to empathize with the people Yudkowsky first designed the AI box thought experiment for. As described in the beginning of this text, I would not spontaneously think that a superintelligent being was unable to manipulate me if it wanted to.

What I might think of AI risk

As you might have noticed, it is quite hard for me to form good views on AI risk. But I have some guesses of views that might describe what I think:

  • Somehow the focus in the AI risk scene currently seems quite narrow? I feel like while “superintelligent AI would be dangerous” makes sense if you believe superintelligence is possible, it would be good to look at other risk scenarios from current and future AI systems as well.
  • I think some people are doing what I just described, but since the field of AI safety is still a mess it is hard to know what work relates to what and what people mean with a certain terminology.
  • I feel like it would be useful to write down limitations/upper bounds on what AI systems are able to do if they are not superintelligent and don’t for example have the ability to simulate all of physics (maybe someone has done this already, I don’t know)
  • To evaluate the importance of AI risk against other x-risk I should know more about where the likelihood estimates come from. But I am afraid to try to work on this because so far it has been very hard to find any numbers. (Actionable note to myself: maybe still put a few hours into this even if it feels discouraging. It could be valuable.)
  • I feel like looking at the current progress in AI is not a good way to make any estimates on AI risk. I’m fairly sure deep learning alone will not result in AGI. (Russell thinks this too, but I had this opinion before reading Human Compatible, so it is actually based on gut feeling.)
  • In my experience, data scientists tend to be people who have thought about technology and ethics before. I think a lot of people in the field would be willing to hear actionable and well explained ways of making AI systems more safe. But of course this is very anecdotal.

What now?

Possible next steps

To summarize, so far I have tried reading about AI safety to either understand why people are so convinced about it or find out where we disagree. This has not worked out. By writing this text, it became clear to me that there are social and emotional issues preventing me from forming an opinion about AI safety. I have already started working on them by discussing them with my friends.

I have already mentioned some actionable points throughout the text in the relevant contexts. The most important one:

  • find a person who is willing to discuss AI safety with me: explain their own actual thinking, help me keep awareness of possibly falling in my “AI safety compatible” mode and listen to my probably very messy views

If you (yes, you!) are interested in a discussion like that, feel free to message me anytime!

Other things I already mentioned were:

  • contact my former professor and ask him what he thinks of AI risk and why value alignment is described to emerge naturally from intelligence in the course material
  • set aside time to find out and understand how x-risk likelihoods are calculated

Additional things I might do next are:

  • read The Alignment Problem (I don’t think it will provide me with that much more useful info, but I want to complete reading all the 3 AI alignment books people usually recommend)
  • write a short intro to AI safety in Finnish to clear my head and establish personal vocabulary for the field in my native language (I want to write for an audience, but I if I don’t like the result, then I will just not publish it anywhere)

Why the answer matters

I have spent a lot of time trying to figure out what my view on AI safety is, and I still don’t have a good answer. Why not give up, decide to remain undecided and do something else?

Ultimately, this has to do with what I think the purpose of EA is. You need to know what you are doing, because if you don’t, you cannot do good. You can try, but in the worst case you might end up causing a lot of damage. 

And this is why EA is a license to care: the permission to stop resisting the urge to save the world, because it promises that if you are careful and plan ahead, you can do it in a way that actually helps. Maybe you can’t save everyone. Maybe you’ll make mistakes. But you are allowed to do your best, and regardless of whether you are a Good Person™ (or an altruistic and effective person) it will help.

As long as I don’t know how important AI safety is, I am not going to let myself actually care about it, only about estimating its importance. 

I wonder if this, too, is risk aversion – a lot of AI safety enthusiasts seem to emphasize that you have to be able to cope with uncertainty and take risks if you want to do the most good. Maybe this attitude towards risk and uncertainty is actually the crux between me and AI safety enthusiasts I’m having such a hard time to find? 

But I’m obviously not going to believe something I do not believe just to avoid seeming risk averse. Until I can be sure enough that the action I’m taking is going in the right direction, I am going to keep being careful. 

Sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I really liked this post. I've often felt frustrated by how badly the alignment community has explained the problem, especially to ML practitioners and researchers, and I personally find neither Superintelligence nor Human Compatible very persuasive. For what it's worth, my default hypothesis is that you're unconvinced by the arguments about AI risk in significant part because you are applying an usually high level of epistemic rigour, which is a skill that seems valuable to continue applying to this topic (including in the case where AI risk isn't important, since that will help us uncover our mistake sooner). I can think of some specific possibilities, and will send you a message about them.

The frustration I mentioned was the main motivation for me designing the AGISF course; I'm now working on follow-up material to hopefully convey the key ideas in a simpler and more streamlined way (e.g. getting rid of the concept of "mesa-optimisers"; clarifying the relationship between "behaviours that are reinforced because they lead to humans being mistaken" and "deliberate deception"; etc). Thanks for noting the "deception" ambiguity in the AGI safety fundamentals curriculum - I've replaced it with a more careful claim (details in reply to this comment).

Old: "The techniques discussed this week showcase a tradeoff between power and alignment: behavioural cloning provides the fewest incentives for misbehaviour, but is also hardest to use to go beyond human-level ability. Whereas reward modelling can reward agents for unexpected behaviour that leads to good outcomes (as long as humans can recognise them) - but this also means that those agents might find and be rewarded for manipulative or deceptive actions. Christiano et al. (2017) provide an example of an agent learning to deceive the human evaluator; and Stiennon et al. (2020) provide an example of an agent learning to “deceive” its reward model. Lastly, while IRL could in theory be used even for tasks that humans can’t evaluate, it relies most heavily on assumptions about human rationality in order to align agents."

New: "The techniques discussed this week showcase a tradeoff between power and alignment: behavioural cloning provides the fewest incentives for misbehaviour, but is also hardest to use to go beyond human-level ability. Reward modelling, by contrast, can reward agents for unexpected behaviour that leads to good outcomes - but also rewards agents for manipulative or dec... (read more)

This seems plausible to me, based on: * The people I know who have thought deeply about AI risk and come away unconvinced often seems to match this pattern. * I think some of the people who care most about AI risk apply a lower level of epistemic rigour than I would, e.g. some seem to have much stronger beliefs about how the future will go than I think can be reasonably justified.
Ada-Maaria Hyvärinen
Interesting to hear your personal opinion on the persuasiveness of Superintelligence and Human Compatible! And thanks for designing the AGISF course, it was useful.
Superintelligence doesn't talk about ML enough to be strongly persuasive given the magnitude of the claims it's making (although it does a reasonable job of conveying core ideas like the instrumental convergence thesis and orthogonality thesis, which are where many skeptics get stuck). Human Compatible only spends, I think, a couple of pages actually explaining the core of the alignment problem (although it does a good job at debunking some of the particularly bad responses to it). It doesn't do a great job at linking the conventional ML paradigm to the superintelligence paradigm, and I don't think the "assistance games" approach is anywhere near as promising as Russell makes it out to be.
Rohin Shah
I wish you would summarize this disagreement with Russell as "I think neural networks / ML will lead to AGI whereas Russell expects it will be something else". Everything else seems downstream of that. (If I had similar beliefs about how we'd get to AGI as Russell, and I was forced to choose to work on some existing research agenda, it would be assistance games. Though really I would prefer to see if I could transfer the insights from neural network / ML alignment, which might then give rise to some new agenda.) This seems particularly important to do when talking to someone who also thinks neural networks/ ML will not lead to AGI.

FWIW, I don't think the problem with assistance games is that it assumes that ML is not going to get to AGI. The issues seem much deeper than that (mostly of the "grain of truth" sort, and from the fact that in CIRL-like formulations, the actual update-rule for how to update your beliefs about the correct value function is where 99% of the problem lies, and the rest of the decomposition doesn't really seem to me to reduce the problem very much, but instead just shunts it into a tiny box that then seems to get ignored, as far as I can tell).

Rohin Shah
Sounds right, and compatible with everything I said? (Not totally sure what counts as "reducing the problem", plausibly I'd disagree with you there.) Like, if you were trying to go to the Moon, and you discovered the rocket equation and some BOTECs said it might be feasible to use, I think (a) you should be excited about this new paradigm for how to get to the Moon, and (b) "99% of the problem" still lies ahead of you, in making a device that actually uses the rocket equation appropriately. Is there some other paradigm for AI alignment (neural net based or otherwise) that you think solves more than "1% of the problem"? I'll be happy to shoot it down for you. This is definitely a known problem. I think you don't see much work on it because (a) there isn't much work on assistance games in general (my outsider impression is that many CHAI grad students are focused on neural nets), and (b) it's the sort of work that is particularly hard to do in academia.
Some abstractions that feel like they do real work on AI Alignment (compared to CIRL stuff):  * Inner optimization * Intent alignment vs. impact alignment * Natural abstraction hypothesis * Coherent Extrapolated Volition * Instrumental convergence * Acausal trade None of these are paradigms, but all of them feel like they do substantially reduce the problem, in a way that doesn't feel true for CIRL. It is possible I have a skewed perception of actual CIRL stuff, based on your last paragraph though, so it's plausible we are just talking about different things.
Rohin Shah
Huh. I'd put assistance games above all of those things (except inner optimization but that's again downstream of the paradigm difference; inner optimization is much less of a thing when you aren't getting intelligence through a giant search over programs). Probably not worth getting into this disagreement though.
I don't think that my main disagreement with Stuart is about how we'll reach AGI, because critiques of his approach, like this page, don't actually require any assumption that we're in the ML paradigm. Whether AGI will be built in the ML paradigm or not, I think that CIRL does less than 5%, and probably less than 1%, of the conceptual work of solving alignment; whereas the rocket equation does significantly more than 5% of the conceptual work required to get to the moon. And then in both cases there's lots of engineering work required too. (If AGI will be built in a non-ML paradigm, then getting 5% of the way to solving alignment probably requires actually making claims about whatever the replacement-to-ML paradigm is, which I haven't seen from Stuart.) But Stuart's presentation of his ideas seems wildly inconsistent with both my position and your position above (e.g. in Human Compatible he seems way more confident in his proposal than would be justified by having gotten even 5% of the way to a solution).
Rohin Shah
I agree that single critique doesn't depend on the ML paradigm. If that's your main disagreement then I retract my claim that it's downstream of paradigm disagreements. What's your probability that if we really tried to get the assistance paradigm to work then we'd ultimately conclude it was basically doomed because of this objection? I'm at like 50%, such that if there were no other objections the decision would be "it is blindingly obvious that we should pursue this". I might disagree with this but I don't know how you're distinguishing between conceptual and non-conceptual work. (I'm guessing I'll disagree with the rocket equation doing > 5% of the conceptual work.) I don't think this is particularly relevant to the rest of the disagreement, but this is explicitly discussed in Human Compatible! It's right at the beginning of my summary of it! Are you reacting to his stated beliefs or the way he communicates? If you are reacting to his stated beliefs: I'm not sure where you get this from. His actual beliefs (as stated in Human Compatible) are that there are lots of problems that still need to be solved. From my summary: If you are reacting to how he communicates: I don't know why you expect him to follow the norms of the EA community and sprinkle "probably" in every sentence. That's not the norms that the broader world operates under; he's writing for the broader world.

It makes me quite sad that in practice EA has become so much about specific answers (work on AI risk, donate to this charity, become vegan) to the question of how we effective make the world a better place, that not agreeing with a specific answer can create so much friction. In my mind EA really is just about the question itself and the world is super complicated so we should be skeptical of any particular answer.

If we accidentally start selecting for people that intuitive agree with certain answers (which it sounds like we are doing, I know people that h... (read more)

Thanks for this honest account; I think it's extremely helpful to see where we're failing to communicate. It also took me a long time (like 3 years) to really understand the argument and to act on it.

At the risk of being another frustrating person sending you links: I wrote a post which attempts to communicate the risk using empirical examples, rather than grand claims about the nature of intelligence and optimisation. (But obviously the post needs to extrapolate from its examples, and this extrapolation might fall foul of the same things that make you sceptical / confused already.) Several people have found it more intuitive than the philosophical argument.

Happy to call to discuss!

Ada-Maaria Hyvärinen
Generally, I find links a lot less frustrating if they are written by the person who sends me the link :) But now I have read the link you gave and don't know what I am supposed to do next, which is another reason I sometimes find linksharing a difficult means of communication. Like, do I comment on specific parts on your post, or describe how reading it influenced me, or how does the conversation continue? (If you find my reaction interesting: I was mostly unmoved by the post, I think I had seen  most of the numbers and examples before, there were some sentences and extrapolations that were quite off-putting for me but I think "minimalistic" style was nice.) It would be nice to call and discuss if you are interested.
Well, definitely tell me what's wrong with the post - and optionally tell me what's good about it (: There's a Forum version here where your comments will have an actual audience, sounds valuable.

Meta-level comment: this post was interesting, very well written and I could empathize with a lot of it, and in fact, it inspired me to make an account on here in order to comment : )

Object-level comment (ended up long, apologies!): My personal take is that a lot of EA literature on AI Safety (eg: forum articles) uses terminology that overly anthropomorphizes AI and skips a lot of steps in arguments, assuming a fair amount of prerequisite knowledge/ familiarity with jargon. When reading such literature, I try to convert the "EA AI Safety language" into "normal language" in my head in order to understand the claims better. Overall my answer to “why is AI safety important” is (currently) the following:

  • Humans are likely to develop increasingly powerful AI systems in order to solve important problems / provide in-demand services.
  • As AI systems become more powerful they can do more stuff / affect more change in the world. Somewhat because we’ll want them to. Why develop a powerful AI if not to do important/hard things?
  • As AI systems become more powerful it will be harder for us to ensure they do what we want them to do when deployed. Imo this claim needs more justification, and I will ela
... (read more)

Super interesting and thoughtful post, and also exceptionally well written. I can relate to many points here, and the timing of this post is perfect in the context of discussions we have at Israel and of the ongoing developments in the global movement.

I may respond to some object-level matters later, but just wanted to say that I really hope you keep on thinking and writing! It'd particularly be interesting to read whatever "Intro to AI Safety" you may end up writing :)

Ada-Maaria Hyvärinen
Glad it may have invoked some ideas for any discussions you might be having at Israel :) For us in Finland, I feel like I at least personally need to get some more clarity on how to balance EA movement building efforts and possible cause priorization related differences between movement builders. I think this is non-trivial because forming a consensus seems hard enough. Curious to read any object-level response if you feel like writing one! If I end up writing any "Intro to AI Safety" thing it will be in Finnish so I'm not sure if you will understand it (it would be nice to have at least one coherent Finnish text about it that is not written by an astronomer or a paleontologist but by some technical person). 

I'm certain EA would welcome you, whether you think AI is an important x-risk or not.

If you do continue wrestling with these issues, I think you're actually extremely well placed to add a huge amount of value as someone who is (i) ML expert, (ii) friendly/sympathetic to EA, (iii) doubtful/unconvinced of AI risk. It gives you an unusual perspective which could be useful for questioning assumptions.

From reading this post, I think you're temperamentally uncomfortable with uncertainty, and prefer very well defined problems. I suspect that explains why you feel your reaction is different to others'.

"But I find it really difficult to think somewhere between concrete day-to-day AI work and futuristic scenarios. I have no idea how others know what assumptions hold and what don’t." - this is the key part, I think.

"I feel like it would be useful to write down limitations/upper bounds on what AI systems are able to do if they are not superintelligent and don’t for example have the ability to simulate all of physics (maybe someone has done this already, I don’t know)" - I think it would be useful and interesting to explore this. Even if someone else has done this, I'd be interested in your perspective.

Ada-Maaria Hyvärinen
Thanks for the nice comment! Yes, I am quite uncomfortable with uncertainty and trying to work on that. Also, I feel like by now I am pretty involved in EA and ultimately feel welcome enough to be able to post a story like this in here (or I feel like EA apprechiates different views enough despite also feeling this pressure to conform at the same time). 
I want to strongly second this!  I think that a proof of the limitations of ML under certain constraints would be incredibly useful to narrow the area in which we need to worry about AI safety or at least limit the types of safety questions that need to be addressed in that subset of ML

But I am a bit at loss on why people in the AI safety field think it is possible to build safe AI systems in the first place. I guess as long as it is not proven that the properties of safe AI systems are contradictory with each other, you could assume it is theoretically possible. When it comes to ML, the best performance in practice is sadly often worse than the theoretical best.

To me, this belief that AI safety is hard or impossible would imply that AI x-risk is quite high. Then, I'd think that AI safety is very important but unfortunately intractable. Would you agree? Or maybe I misunderstood what you were trying to say.

I agree that x-risk from AI misuse is quite underexplored.

For what it's worth, AI safety and governance researchers do assign significant probability to x-risk from AI misuse. AI Governance Week 3 — Effective Altruism Cambridge comments:

For context on the field’s current perspectives on these questions, a 2020 survey of AI safety and governance researchers (Clarke et al., 2021) found that, on average [1], researchers currently guess there is: [2]

A 10% chance of existential catastrophe from misaligned, influence-seeking AI [3]

A 6% chance of existential catastroph

... (read more)
Ada-Maaria Hyvärinen
I think you understood me in the same way than my friend did in the second part of the prolog, so I apparently give this impression. But to clarify, I am not certain that AI safety is impossible (I think it is hard, though), and the implications of that depend a lot on how much power the AI systems will be given at the end, and what part of the damage they might cause is due to them being unsafe and what for example misuse, like you said. 

Hi Ada, I'm glad you wrote this post! Although what you've written here is pretty different from my own experience with AI safety in many ways, I think I got some sense of your concerns from reading this.

I also read Superintelligence as my first introduction to AI safety, and I remember pretty much buying into the arguments right away.[1] Although I think I understand that modern-day ML systems do dumb things all the time, this intuitively weighs less on my mind than the idea that AI can in principle be much smarter than humans, and that sooner or lat... (read more)

Ada-Maaria Hyvärinen
Hi Caleb! Very nice to read your reflection on what might make you think what you think. I related to many things you mentioned, such as wondering how much I think intelligence matters because of having wanted to be smart as a kid. You understood correctly that intuitively, I think AI is less of a big deal than some people feel. This probably has a lot to do with my job, because it includes making estimates on if problems can be solved with current technology given certain constraints, and it is better to err to the side of caution. Previously, one of my tasks was also to explain people why AI is not a silver bullet and that modern ML solutions require things like training data and interfaces in order to be created and integrated to systems. Obviously, if the task is to find out all things that can future AI systems might be able to do at some point, you should take a quite different attitude than when trying to estimate what you yourself can implement right now. This is why I try to take a less conservative approach than would come naturally to me, but I think it still comes across as pretty conservative compared to many AI safety folks. I also find GPT-3 fascinating but I think the feeling I get from it is not "wow, this thing seems actually intelligent" but rather "wow, statistics can really encompass so many different properties of language". I love language so it makes me happy.  But to me, it seems that GPT-3 is ultimately a cool showcase of the current data-centered ML approaches ("take a model that is based on a relatively non-complex idea[1], pour a huge amount of data into it, use model"). I don't see it as a direct stepping stone to science-automating AI, because it is my intuition that "doing science well" is not that well encompassed in the available training data. (I should probably reflect more on what the concrete difference is.) Importantly, this does not mean I believe there can be no risks (or benefits!) from large language models, and models t

As somehow who works on AGI safety and cares a lot about it, my main conclusion from reading this is: it would be ideal for you to work on something other than AGI safety! There are plenty of other things to work on that are important, both within and without EA, and a satisfactory resolution to “Is AI risk real?” doesn’t seem essential to usefully pursue other options.

Nor do I think this is a block to comfortable behavior as an EA organizer or role model: it seems fine to say “I’ve thought about X a fair amount but haven’t reached a satisfactory conclusi... (read more)

Thanks for giving me permission, I guess can use this if I need ever the opinion of "the EA community" ;)

However, I don't think I'm ready to give up on trying to figure out my stance on AI risk just yet, since I still estimate it is my best shot in forming a more detailed understanding on any x-risk, and understanding x-risks better would be useful for establishing better opinions on other cause priorization issues.

Geoffrey Irving
That is also very reasonable!  I think the important part is to not feel to bad about the possibility of never having a view (there is a vast sea of things I don't have a view on), not least because I think it actually increases the chance of getting to the right view if more effort is spent. (I would offer to chat directly, as I'm very much part of the subset of safety close to more normal ML, but am sadly over capacity at the moment.)

it would be ideal for you to work on something other than AGI safety!

I disagree. Here is my reasoning:

  • Many people that have extensive ML knowledge are not working on safety because either they are not convinced of its importance or because they haven't fully wrestled with the issue
  • In this post, Ada-Maaria articulated the path to her current beliefs and how current AI safety communication has affected her.
  • She has done a much more rigorous job of evaluating the pervasiveness of these arguments than anyone else I've read
  • If she continues down this path she could either discover what unstated assumptions the AI safety community has failed to communicate or potentially the actual flaws in the AI safety argument.
  • This will either make it easier for AI Safety folks to express their opinions or uncover assumptions that need to be verified.
  • Either would be valuable!
On the one hand I agree with this being very likely the most prudent action from OP to take from her perspective, and probably the best action for the world as well. On the other, I think I feel a bit sad to miss some element of...combativeness(?)... in my perhaps overly-nostalgic memories of the earlier EA culture, where people used to be much more aggressive about disagreements with cause and intervention prioritizations.  It feels to me that people are less aggressive about disagreeing with established consensus or strong viewpoints that other EAs have, and are somewhat more "live and let live" about both uses of money and human capital.  I sort of agree with this being the natural evolution of our movement's emphases (longtermism is harder to crisply argue about than global health, money is more liquid/fungible than human capital). But I think I feel some sadness re: the decrease in general combativeness and willingness to viciously argue about causes.  This is related to an earlier post about the EA community becoming a "big tent," which at the time I didn't agree with but now I'm warning up to.
Geoffrey Irving
I think the key here is that they’ve already spent quite a lot of time investigating the question. I would have a different reaction without that. And it seems like you agree my proposal is best both for the OP and the world, so perhaps the real sadness is about the empirical difficulty at getting people to consensus? At a minimum I would claim that there should exist some level of effort past which you should not be sad not arguing, and then the remaining question is where the threshold is.

(I’m happy to die on the hill that that threshold exists, if you want a vicious argument. :))

edit: I don't have a sense of humor "a senior AGI safety person has given me permission to not have a view and not feel embarrassed about it." For a lack of a better word, this sound cultish to me, why would one need permission "from someone senior" to think or feel anything? If someone said this to me it would be a red flag about the group/community. I think your first suggestion ("I’ve thought about X a fair amount but haven’t reached a satisfactory conclusion") sounds much more reasonable, if OP feels like that reflects their opinion. But I also think that something like "I don't personally feel  convinced by the AGI risk arguments, but many others disagree, I think you should read up on it more and reach your own conclusions", is much more reasonable than your second suggestion. I think we should welcome different opinions, as long as someone agrees with the main EA principles they are an EA, it should not be about agreeing completely with cause A, B and C.  Sorry if I am over-interpreting your suggestion as implying much more than you meant, I am just giving my personal reaction.  Disclaimer: long time lurker, first time poster. 

Yep, that’s very fair. What I was trying to say was that if in response to the first suggestion someone said “Why aren’t you deferring to others?” you could use that as a joke backup, but agreed that it reads badly.

Makes a lot of sense :D I just didn't get the joke, which I in hindsight probably should have... :P 

I really appreciated this post as well. One thought I had while reading it - there is at least one project to red team EA ideas getting off the ground. Perhaps that’s something that would be interesting to you and could come closer to helping you form you views. Obviously, it would not be a trivial time commitment, but it seems like you are very much qualified to tackle the subject.

I thought this post was wonderful. Very interestingly written thoughtful and insightful. Thank you for writing. And good luck with your next steps of figuring out this problem. It makes me want to write something similar, I have been in EA circles for a long time now and  to some degree have also failed to form strong views on AI safety. Also I thought your next steps were fantastic and very sensible, I would love to hear your future thoughts on all of those topics.


On your next steps, picking up on:  

To evaluate the importance of AI risk ag

... (read more)
Ada-Maaria Hyvärinen
Thanks! And thank you for the research pointers.

Maybe a typo: the second AI (EA) should be AI (Work)?

AI (EA) did not have to care about mundane problems such as “availability of relevant training data” or even “algorithms”: the only limit ever discussed was amount of computation, and that’s why AI (EA) was not there yet, but soon would be, when systems would have enough computational power to simulate human brains.

Btw, really like your writing style! :)

Ada-Maaria Hyvärinen
thanks Aayush! Edited the sentence to be hopefully more clear now :)

I was one of the facilitators in the most recent run of EA Cambridge's AGI Safety Fundamentals course, and I also have professional DS/ML experience.

In my case I very deliberately emphasised a sceptical approach to engaging with all the material, while providing clarifications and corrections where people's misconceptions are the source of scepticism. I believe this was well-received by my cohort, all of whom appeared to engage thoughtfully and honestly with the material.

I think this is the best way to engage, when time permits, because (in brief)

  • many ar
... (read more)
Ada-Maaria Hyvärinen
I feel like everyone I have ever talked about AI safety with would agree on the importance of thinking critically and staying skeptical, and this includes my facilitator and cohort members from the AGISF programme.  I think the 1.5h discussion session between 5 people who have read 5 texts  does not allow really going deep into any topics, since it is just ~3 minutes per participant per text on average. I think these kind of programs are great for meeting new people, clearing misconceptions and providing structure/accountability on actually reading the material, but they by nature are not that good for having in-depth debates. I think that's ok, but just to clarify why I think it is normal I probably did not mention most of the things I described on this post during the discussion sessions. But there is an additional reason that I think is more important to me, which is differentiating between performing skepticism and actually voicing true opinions. It is not possible for my facilitator to notice which one I am doing because they don't know me, and performing skepticism (in order to conform to the perceived standard of "you have to think about all of this critically and by your own, and you will probably arrive to similar conclusions than others in this field") looks the same as actually raising the confusions you have. This is why I thought I can convey this failure mode to others by comparing to inner misalignment :)  When I was a Math freshman my professor told us he always encourages people to ask questions during lectures. Often, it had happened that he'd explained a concept and nobody would ask anything. He'd check what the students understood, and it would turn out they did not grasp the concept. When asking why nobody asked anything, the students would say that they did not understand enough to ask a good question. To avoid this dynamic, he told us that "I did not understand anything" counts as a valid question on his lectures. It helped somewhat but at
Oliver Sourbut
OK, this is the terrible terrible failure mode which I think we are both agreeing on (emphasis mine) By 'a sceptical approach' I basically mean 'the thing where we don't do that'. Because there is not enough epistemic credit in the field, yet, to expect that all (tentative, not-consensus-yet) conclusions to be definitely right. In traditional/undergraduate mathematics, it's different - almost always when you don't understand or agree with the professor, she is simply right and you are simply wrong or confused! This is a justifiable perspective based on the enormous epistemic weight of all the existing work on mathematics. I'm very glad you call out the distinction between performing skepticism and actually doing it.
Ada-Maaria Hyvärinen
Yeah, I think we agree on this, I think I want to write out more later on what communication strategies might help people actually voice scepticsm/concerns even if they are afraid of meeting some standards on elaborateness.  My mathematics example actually tried to be about this: in my university, the teachers tried to make us forget the teachers are more likely to be right, so that we would have to think about things on our own and voice scepticism even if we were objectively likely to be wrong. I remember another lecturer telling us: "if you finish an excercise and notice you did not use all the assuptions in your proof, you either did something wrong or you came up with a very important discovery". I liked how she stated that it was indeed possible that a person from our freshman group could make a novel discovery, however unlikely that was. The point is that my lecturers tried to teach that there is not a certain level you have to acquire before your opinions start to matter: you might be right even if you are a total beginner and the person you disagree with has a lot of experience.  This is something I would like to emphasize when doing EA community building myself, but it is not very easy. I've seen this when I've taught programming to kids. If a kid asks me if their program is "done" or "good", I'd say "you are the programmer, do you think your program does what it is supposed to do", but usually the kids think it is a trick question and I'm just withholding the correct answer for fun. Adults, too, do not always trust that I actually value their opinion.

Hey, as someone who also has professional CS and DS experience, this was a really welcome and interesting read. I have all sorts of thoughts but I had one main question

So I used the AGISF Slack to find people who had already had a background in machine learning before getting into AI safety and asked them what had originally convinced them. Finally, I got answers from 3 people who fit my search criteria. They mentioned some different sources of first hearing about AI safety (80 000 Hours and LessWrong), but all three mentioned one same source that had de

... (read more)
Ada-Maaria Hyvärinen
That's right, thanks again for answering my question back then!  Maybe I formulated my question wrong but I understood from your answer that you got first interested in AI safety, and only then on DS/ML (you mentioned you had had a CS background before but not your academic AI experience). This is why I did not include you in this sample of 3 persons - I wanted to narrow the search to people who had more AI specific background before getting into AI safety (not just CS). It is true that you did not mention Superintelligence either, but interesting to hear you also had a good opinion on it! If I would have known both your academic AI experience and that you liked Superintelligence I could have made the number to 4 (unless you think Superintelligence did not really influence you, then it would be 3 out of 4). You were the only person who answered my PM but stated they got into AI safety before getting to DS/ML. One person did not answer, and the other 3 that answered stated they got into DS/ML before AI safety. I guess there are more than 6 people with some DS/ML background on the course channel but also know not everyone introduced themselves, so the sample size is very anecdotal anyway. I also used the Slack to ask for recommendations of blog posts or similar stories on how people with DS/ML backgrounds got into AI safety. Aside of recommendations on who to talk on the Slack, I got pointers to Stuart Russell's interview on Sam Harris' podcast and a Yudkowsky post. 

Thanks for writing this, it was fascinating to hear about your journey here. I also fell into the cognitive block of “I can’t possibly contribute to this problem, so I’m not going to learn or think more about it.” I think this block was quite bad in that it got in the way of me having true beliefs, or even trying to, for quite a few months. This wasn’t something I explicitly believed, but I think it implicitly affected how much energy I put into understanding or trying to be convinced by AI safety arguments. I wouldn’t have realized it without your post, b... (read more)


Thanks for writing this! It really resonated with me despite the fact that I only have a software engineering background and not much ML experience. I'm still struggling to form my views as well for a lot of the reasons you mentioned and one of my biggest sources of uncertainty has been trying to figure out what people with AI/ML expertise think about AI safety. This post has been very helpful in that regard (in addition to other information that I've been ingesting to help resolve this uncertainty). The issue of AGI timelines has come to be a major crux f... (read more)

Hi Ada-Maaria, glad to have talked to you at EAG and congrats for writing this post - I think it's very well written and interesting from start to finish! I also think you're more informed on the topic than most people who are AI xrisk convinced in EA, surely including myself.

As an AI xrisk-convinced person, it always helps me to divide AI xrisk in these three steps. I think superintelligence xrisk probability is the product of these three probabilities:

1) P(AGI in next 100 years)
2) P(AGI leads to superintelligence)
3) P(superintelligence destroys humanity)... (read more)

Ada-Maaria Hyvärinen
Hi Otto! Thanks, it was nice talking to you on EAG. (I did not include any interactions/information I got from this weekend's EAG in the post because I had written it before the conference, felt like it should not be any longer than it already was, but wanted to wait until my friends who are described as "my friends" in the post had read it before publishing it.) I am not that convinced AGI is necessarily the most important component to x-risk from AI – I feel like there could be significant risks from powerful non-generally intelligent systems, but of course it is important to avoid all x-risk, so x-risk from AGI specifically is also worth talking about. I don't enjoy putting numbers to estimates but I understand why it can be a good idea so I will try. At least then I can later see if I have changed my mind and by how much. I would give quite low probability to 1), perhaps 1%? (I know this is lower than average estimates by AI researchers.) I think 2) on the other hand is very likely, maybe 99%, by the assumption that there can be enough differences between implement AGIs to make a team of AGIs surpass a team of humans by for example more efficient communication (basically what Russell says in Human Compatible on this seems credible to me). Note that even if this would be superhuman intelligence it could still be more stupid than some superintelligence scenarios. I would give a much lower probability to superintelligence like Bostrom describes it. 3) is hard to estimate without knowing much about the type of superintelligence, but I would spontanously say something high, like 80%? So because of the low probability on 1) my concatenated estimate is still significantly lower than yours. I definitely would love to read more research on this as well.
Thanks for the reply, and for trying to attach numbers to your thoughts! So our main disagreement lies in (1). I think this is a common source of disagreement, so it's important to look into it further. Would you say that the chance to ever build AGI is similarly tiny? Or is it just the next hundred years? In other words, is this a possibility or a timeline discussion?
Ada-Maaria Hyvärinen
Hmm, with a non-zero probability in the next 100 years the likelihood for a longer time frame should be bigger given that there is nothing that makes developing AGI more difficult the more time passes, and I would imagine it is more likely to get easier than harder (unless something catastrophic happens). In other words, I don't think it is certainly impossible to build AGI, but I am very pessimistic about anything like current ML methods leading to AGI. A lot of people in the AI safety community seem to disagree with me on that, and I have not completely understood why.
Otto Barten
So although we seem to be relatively close in terms of compute, we don't have the right algorithms yet for AGI, and no one knows if and when they will be found. If no one knows, I'd say a certainty of 99% that they won't be found in hundred years, with thousands of people trying, is overconfident.
Ada-Maaria Hyvärinen
Yeah, I understand why you'd say that. However it seems to me that there are other limitations to AGI than finding the right algorithms. As a data scientist I am biased to think about available training data. Of course there is probably going to be progress on this as well in the future.
Otto Barten
Could you explain a bit more about the kind of data you think will be needed to train an AGI, and why you think this will not be available in the next hundred years? I'm genuinely interested, actually I'd love to be convinced about the opposite... We can also DM if you prefer.
Ada-Maaria Hyvärinen
This intuition turned out harder to explain than I thought and got me thinking a lot about how to define "generality" and "intelligence" (like all talk about AGI does). But say, for example, that you want to build an automatic doctor that is able examine a patient and diagnose what illness they most likely have. This is not very general in the sense that you can imagine this system as a function of "read all kinds of input about the person, output diagnosis", but I still think it provides an example of the difficulty of collecting data.  There are some data that can be collected quite easily by the user, because the user can for example take pictures of themselves, measure their temperature etc. And then there are some things the user might not be able to collect data about, such as "is this joint moving normally". I think it is not so likely we will be able to gather meaningful data about things like "how does a persons joint move if they are healthy" unless doctors start wearing gloves that track the position of their hand while doing the examination and all this data is stored somewhere with the doctor's interpretation.  To me it currently seems that we are collecting a lot of data about various things but there are still many things where there are no methods for collecting the relevant data, and the methods do not seem like they would start getting collected as a by-product of something (like in the case where you track what people by from online stores). Also, a lot of data is unorganized and missing labels and it can be hard to label after it has been collected. I'm not sure if all of this was relevant or if I got side-tracked too much when thinking about a concrete example I can imagine.
Hi AM, thanks for your reply. Regarding your example, I think it's quite specific, as you notice too. That doesn't mean I think it's invalid, but it does get me thinking: how would a human learn this task? A human intelligence wasn't trained on many specific tasks in order to be able to do them all. Rather, it first acquired general intelligence (apparently, somewhere), and was later able to apply this to an almost infinite amount of specific tasks with typically only a few examples needed. I would guess that an AGI would solve problems in a similar way. So, first learn general intelligence (somehow), then learn specific tasks quickly with little data needed. For your example, if the AGI would really need to do this task, I'd say it could find ways itself to gather the data, just like a human would who would want to learn this skill, after first acquiring some form of general intelligence. A human doctor might watch the healthily moving joint, gathering visual data, and might hear the joint moving, gathering audio data, or might put her hand on the joint, gathering sensory data. The AGI could similarly film and record the healthy joint moving, with already available cameras and microphones, or use data already available online, or, worst case, send in a drone with a camera and a sound recorder. It could even send in a robot that could gather sensory data if needed. Of course, current AI lacks certain skills that are necessary to solve such a general problem in such a general way, such as really understanding the meaning behind a question that is asked, being able to plan a solution (including acquiring drones and robots in the process), and probably others. These issues would need to be solved first, so there is still a long way to go. But with the manpower, investment, and time (e.g. 100 years) available, I think we should assign a probability of at least tens of percents that this type of general intelligence including planning and acting effectively in the rea
Ada-Maaria Hyvärinen
Hi Otto! I agree that the example was not that great and that definitely lack of data sources can be countered with general intelligence, like you describe. So it could definitely be possible that a a generally intelligent agent could plan around to gather needed data. My gut feeling is still that it is impossible to develop such intelligence based on one data source (for example text, however large amounts), but of course there are already technologies that combine different data sources (such as self-driving cars), so this clearly is also not the limit. I'll have to think more about where this intuition of lack of data being a limit comes from, since it still feels relevant to me. Of course 100 years is a lot of time to gather data. I'm not sure if imagination is the difference either. Maybe it is the belief in somebody actually implementing things that can be imagined. 
Hey I wasn't saying it wasn't that great :) I agree that the difficult part is to get to general intelligence, also regarding data. Compute, algorithms, and data availability are all needed to get to this point. It seems really hard to know beforehand what kind and how much of algorithms and data one would need. I agree that basically only one source of data, text, could well be insufficient. There was a post I read on a forum somewhere (could have been here) from someone who let GPT3 solve questions including things like 'let all odd rows of your answer be empty'. GPT3 failed at all these kind of assignments, showing a lack of comprehension. Still, the 'we haven't found the asymptote' argument from OpenAI (intelligence does increase with model size and that increase doesn't seem to stop, implying that we'll hit AGI eventually), is not completely unconvincing either. It bothers me that no one can completely rule out that large language models might hit AGI just by scaling them up. It doesn't seem likely to me, but from a risk management perspective, that's not the point. An interesting perspective I'd never heard before from intelligent people is that AGI might actually need embodiment to gather the relevant data. (They also think it would need social skills first - also an interesting thought.) While it's hard to know how much (and what kind of) algorithmic improvement and data is needed, it seems doable to estimate the amount of compute needed, namely what's in a brain plus or minus a few orders of magnitude. It seems hard for me to imagine that evolution can be beaten by more than a few orders of magnitude in algorithmic efficiency (the other way round is somewhat easier to imagine, but still unlikely in a hundred year timeframe). I think people have focused on compute because it's most forecastable, not because it would be the only part that's important. Still, there is a large gap between what I think are essentially thought experiments (relevant ones though
Ada-Maaria Hyvärinen
Hi Otto, I have been wanting to reply to you for a while but I feel like my opinions keep changing so writing coherent replies is hard (but having fluid opinions in my case seems like a good thing). For example, while I still think only a precollected set of text as a data source is unsufficient for any general intelligence, maybe training a model on text and having it then interact with humans could lead it to connecting words to references (real world objects), and maybe it would not necessarily need many reference points of the language model is rich enough? This then again seems to sound a bit like the concept of imagination and I am worried I am antropomorphising in a weird way. Anyway, I still hold the intuition that generality is not necessarily the most important in thinking about future AI scenarios – this of course is an argument towards taking AI risk more seriously, because it should be more likely someone will build advanced narrow AI or advanced AGI than just advanced AGI. I liked "AGI safety from first principles" but I would still be reluctant to discuss it with say, my colleagues from my day job, so I think I would need something even more grounded to current tech, but I do understand why people do not keep writing that kind of papers because it does probably not directly help solving alignment. 

I feel like while “superintelligent AI would be dangerous” makes sense if you believe superintelligence is possible, it would be good to look at other risk scenarios from current and future AI systems as well.

I agree, and I think there's a gap for thoughtful and creative folks with technical understanding to contribute to filling out the map here!

One person I think has made really interesting contributions here is Andrew Critch, for example on Multipolar Failure and Robust Agent-Agnostic Processes (I realise this is literally me sharing a link without m... (read more)

I'm going to attempt to summarize what I think part of your current beliefs are (please correct me if I am wrong!)

  • Current ML techniques are not sufficient to develop AGI
  • But someday humans will be able to create AGI
  • It is possible (likely?) that it will be difficult to ensure that the AGI is safe
  • It is possible that humans will give enough control to an unsafe AGI that it is an X risk.

If I got that right I would describe that as both having (appropriately loosely held) beliefs about AI Safety and agreement that AI Safety is a risk with some unspecified probab... (read more)

Ada-Maaria Hyvärinen
I'm still quite uncertain on my beliefs but I don't think you got them quite right. Maybe a better summary is that I am generally pessimistic about both humans being ever able to create AGI and especially about humans being able to create safe AGI (it is a special case so it should probably be harder than any AGI). I also think that relying a lot on strong unsafe systems (AI powered or not) can be an x-risk. This is why it is easier to me to understand why AI governance is a way to try to reduce x-risk (at least if actors in the world want to rely on unsafe systems, I don't know how much this happens but I would not find it very surprising).  I wish I had a better understanding on how x-risk probabilities are estimated (as I said I will try to look into that) but I don't directly understand why x-risk from AI would be a lot more probable than, say, biorisk (that I don't understand in detail at all). 
Ah, yeah I misread your opinion of the likelihood that humans will ever create AGI.  I believe it will happen eventually unless AI research stops due to some exogenous reason (civilizational collapse, a ban on development, etc.).  Important assumptions I am making:   * General Intelligence is all computation, so it isn't substrate-dependent * The more powerful an AI is the more economically valuable it is to the creators * Moore's Law will continue so more computing will be available. * If other approaches fail, we will be able to simulate brains with sufficient compute. * Fully simulated brains will be AGI. I'm not saying that I think this would be the best, easiest, or only way to create AGI, just that if every other attempt fails, I don't see what would prevent this from happening. Particularly since we are already to simulate portions of a mouse brain.  I am also not claiming here that this implies short timelines for AGI.  I don't have a good estimate of how long this approach would take.

I’m fairly sure deep learning alone will not result in AGI

How sure? :)

What about some combination of deep learning (e.g. massive self-supervised) + within-context/episodic memory/state + procedurally-generated tasks + large-scale population-based training + self-play...? I'm just naming a few contemporary 'prosaic' practices which, to me, seem plausibly-enough sufficient to produce AGI that it warrants attention.

Ada-Maaria Hyvärinen
Like I said it is based on my gut feeling, but fairly sure. Is it your experience that adding more complexity and concatenating different ML models results to better quality and generality and if so, in what domains? I would have the opposite intuition especially in NLP. Also, do you happen to know why "prosaic" practices are called "prosaic"? I have never understood the connection to the dictionary definition of "prosaic".

Thank you for writing this! I particularly appreciated hearing your responses to Superintelligence and Human Compatible, and would be very interested to hear how you would respond to The Alignment Problem. TAP is more grounded in modern ML and current research than either of the other books, and I suspect that this might help you form more concrete objections (and/or convince you of some points). If you do read it, please consider sharing your responses.

That said, I don’t think that you have any obligation to read TAP, or to consider thinking about AI safe... (read more)

Ada-Maaria Hyvärinen
Thanks! It will be difficult to write an authentic response to TAP since these other responses were originally not meant to be public but I will try to keep the same spirit if I end up writing more about my AI safety journey. I actually do find AI safety interesting, it just seems that I think about a lot of stuff differently than many people in the field and it hard for me to pin-point why. But the main motivations of spending a lot of time on forming personal views about AI safety are:   * I want to understand x-risks better, AI risk is considered important among people who worry about x-risk a lot, and because of my background I should be able to understand the argument for it (better than say, biorisk) * I find it confusing that I understanding the argument is so hard, and that makes me worried (like I explained in the sections "The fear of the answer" and "Friends and appreciation") * I find it very annoying when I don't understand why some people are convinced by something, especially if these people are with me in a movement that is important for us all
Thank you for explaining more. In that case, I can understand why you'd want to spend more time thinking about AI safety. I suspect that much of the reason that "understanding the argument is so hard" is because there isn't a definitive argument -- just a collection of fuzzy arguments and intuitions. The intuitions seem very, well, intuitive to many people, and so they become convinced. But if you don't share these intuitions, then hearing about them doesn't convince you. I also have an (academic) ML background, and I personally find some topics (like mesa-optimization) to be incredibly difficult to reason about. I think that generating more concrete arguments and objections would be very useful for the field, and I encourage you to write up any thoughts that you have in that direction! (Also, a minor disclaimer that I suppose I should have included earlier: I provided technical feedback on a draft of TAP, and much of the "AGI safety" section focuses on my team's work. I still think that it's a good concrete introduction to the field, because of how specific and well-cited it is, but I also am probably somewhat biased.)

It turned out that at least around me, the most common answer was something like: “I always knew it was important and interesting, which is why I started to read about it.”

I found out about alignment/AGI from some videos of Rob Miles on Computerphile. It's possible you're around/talking to very smart people who were around when the field was founded (hence they came up with it themselves), but that's selection bias - most people aren't like that.

Ada-Maaria Hyvärinen
To clarify, my friends (even if they are very smart) did not come up with all AI safety arguments by themselves, but started to engage with AI safety material because they had already been looking at the world and thinking "hmm, looks like AI is a big thing and could influence a lot of stuff in the future, hope it changes things for the good". So they  got quickly on board after hearing that there are people seriously working on the topic, and it made them want to read more. 
[comment deleted]11
Curated and popular this week
Relevant opportunities