NinaR

Meta-level comment: this post was interesting, very well written and I could empathize with a lot of it, and in fact, it inspired me to make an account on here in order to comment : )

Object-level comment (ended up long, apologies!): My personal take is that a lot of EA literature on AI Safety (eg: forum articles) uses terminology that overly anthropomorphizes AI and skips a lot of steps in arguments, assuming a fair amount of prerequisite knowledge/ familiarity with jargon. When reading such literature, I try to convert the "EA AI Safety language" into "normal language" in my head in order to understand the claims better. Overall my answer to “why is AI safety important” is (currently) the following:

Humans are likely to develop increasingly powerful AI systems in order to solve important problems / provide in-demand services.
As AI systems become more powerful they can do more stuff / affect more change in the world. Somewhat because we’ll want them to. Why develop a powerful AI if not to do important/hard things?
As AI systems become more powerful it will be harder for us to ensure they do what we want them to do when deployed. Imo this claim needs more justification, and I will elaborate below.
Therefore, there is some probability that a powerful AI does things we don’t want. Hopefully, this follows from the above 3 premises.

AI Safety literature commonly refers to AI’s as optimizers. Terms like “mesa-optimizer” are used a lot as well. There is also a fair amount of anthropomorphizing language such as “the AI will want to”, “the AI will secretly plan to”, and “the AI will like/dislike”. By now, I am ok with parsing such statements, but it can be confusing/distracting. “This thing is an optimizer” is a useful phrase when trying to predict the thing’s behavior, but looking through the lens of “what kind of optimizer is this” isn’t always the clearest way to look at the problem. The same can be said for the anthropomorphizing language. When someone says something like “so how do we check whether an AI secretly wants to kill you”, they are trying to summarise a phenomenon succinctly without giving the details/context, however, they are incurring the cost of being less precise and clear (and sounding weird).

Here’s my attempt to explain point 3 (As AI systems become more powerful it will be harder for us to ensure they do what we want them to do when deployed) without using EA-isms.

The simplest possible description of an AI is an instance of a program that takes some inputs and produces some outputs.
Training an AI is the process of searching over different possible programs to find ones that “are good” by looking at which ones produce “good outputs for their inputs”. Another more general phrasing is searching over different possible programs to find ones that “are good” by using some kind of filter that can classify programs as good and bad.
“Is this a good output” or “is this program instance good” is easy to check when the problem is simple but harder to check when the problem is hard (some people call this the outer alignment problem). For example, “is the outputted animal image labeled correctly” is easy to check but “does the outputted drug treat the disease” is harder to check. It is plausible to me that we will want “is good” to encapsulate more and more human values as the problems become harder, which makes it *even* harder to define. For example what if the criteria is “does the outputted intervention reduce CO2 emissions without causing anything that humans find morally wrong”.
“good outputs for their inputs” / “the AI has all the properties we designed into the filter” is harder to check, the more complex the domain of possible inputs and outputs we are dealing with (some people call this the inner alignment problem). A maths analogy is function approximation. Say you have a really complicated function with a lot of local minima/maxima/wobbliness. If we are testing an approximation by checking various points in the function space and the approximated function space and seeing if they match up, the less smooth the function space the more points we need to test to see if they match up. With infinite training data, we could check every point in the function space and be certain that outputs are good for all inputs, however, this is impossible, therefore we cannot be certain, and there will always be some probability of a bad output for a particular input. If an AI is more powerful, the effect of one bad output could be really bad.

All this being said, I maintain a significant amount of skepticism around AI safety being the most important problem. I think nuclear risk and biorisk are very important and also more tractable. Based on my current model, I place a ~10% probability of AI-related existential risk in the next century, conditional on no other existential risk occurring beforehand. I think there is a much higher probability of non-existential risk, and I care about that too.

Currently, the most tractable way I can think of AI safety technical research is “let’s make ascertaining whether an AI output is good X% easier” and “let’s make it Y% easier to infer whether an AI is actually as good as it seems in training given limited training data and limited compute”. Research that increases X and Y would likely decrease AI risk. This is a hard problem, and many solutions may not be scalable to more powerful AI’s. This is likely why a lot of AI safety literature uses metaphors and language that seem far removed from current systems. I think this is an attempt to imagine a good map for the territory of future AI’s in order to come up with solutions that could work in the future.

I also have meta-level uncertainty around my takes and update in the direction of thinking that AI Safety is more important than I otherwise would because many intelligent people I respect think this. Because of this, I make decisions based on AI risk being higher than my internal model currently estimates. I also spend more time thinking about AI safety because I find it interesting and I have a somewhat suitable background (I like programming and ML). I do think a major factor in deciding whether to work on AI safety should be personal fit.

On being compromised

NinaR2y11

Is this trying to make a directional claim? Like people (in the EA community? in idealistic communities?) should on average be less afraid / more accepting of being morally compromised? (On first read, I assume no, it seems like just a descriptive post about the phenomenon).

FWIW, I think it's worth thinking about the 2 forms of "compromise" separately. (Being associated with something you end up finding morally bad / directly doing something you end up finding morally bad). I think it's easier and more worthwhile to focus on avoiding the latter, but overall I'm not sure whether I've found a strong tendency that people overdo either of these things.

"Agency" needs nuance

NinaR2y2

I don't think self-interest is relevant here if you believe that it is possible for an agent to have an altruistic goal.

Also, as with all words, "agentic" will have different meanings in different contexts, and my comment was based on its use when referring to people's behaviour/psychology which is not an exact science, therefore words are not being used in very precise scientific ways :)

NinaR2y3

Am not the author of this post, but I think EAs and rationalists have somewhat coopted the term "agentic" and infused it with a load of context and implicit assumptions about how an "agentic" person behaves, so that it no longer just means "person with agency". This meaning is transmitted via conversations with people in this social cluster as well as through books and educational sessions at camps/retreats etc.

Often, one of the implicit assumptions is that an "agentic" person is more rational and so pursues their goal more effectively, occasionally acting in socially weird ways if the net effect of their actions seems positive to them.

This was a great read! I relate to a lot of these thoughts - have swung back and forth on how much I want to lean into social norms / "guess culture" vs be a stereotypically rationalist-y person, and had a very similar experience at school. I think it's great you're thinking deeply and carefully about these issues. I've found that my attitude towards how to be "agentic" / behave in society has affected a lot of my major object-level decisions, both good and not so good.

How I failed to form views on AI safety

NinaR3y30

Effective Altruism Forum
EA Forum

Posts
1

Comments
6

NinaR

Posts 1

Comments6

Posts
1

Comments
6