Violet Hour's Quick takes

Violet Hour

This is a special post for quick takes by Violet Hour. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

26Alignment, Goals, & The Gut-Head Gap: A Review of Ngo. et al

Sorted by

New & upvoted

Click to highlight new quick takes since: Today at 6:27 PM

Violet HourJun 26 202326

Here's a dynamic that I've seen pop up more than once.

Person A says that an outcome they judge to be bad will occur with high probability, while making a claim of the form "but I don't want (e.g.) alignment to be doomed — it would be a huge relief if I'm wrong!"

It seems uncontroversial that Person A would like to be shown that they're wrong in a way that vindicates their initial forecast as ex ante reasonable.

It seems more controversial whether Person A would like to be shown that their prediction was wrong, in a way that also shows their initial prediction to have been ex ante unreasonable.

In my experience, it's much easier to acknowledge that you were wrong about some specific belief (or the probability of some outcome), than it is to step back and acknowledge that the reasoning process which led you to your initial statement was misfiring. Even pessimistic beliefs can be (in Ozzie’s language) "convenient beliefs" to hold.

If we identify ourselves with our ability to think carefully, coming to believe that there are errors in our reasoning process can hit us much more personally than updates about errors in our conclusions. Optimistic updates might be an update towards me thinking that my projects have been less worthwhile than I thought, that my local community is less effective than I thought, or that my background framework or worldview was in error. I think these updates can be especially painful for people who are more liable to identify with their ability to reason well, or identify with the unusual merits of their chosen community.

To clarify: I'm not claiming that people with more pessimistic conclusions are, in general, more likely to be making reasoning errors. Obviously there are plenty of incentives towards believing rosier conclusions. I'm simply claiming that: if someone arrives at a pessimistic conclusion based on faulty reasoning, then you shouldn't necessarily expect optimistic pushback to be uniformly welcomed— for all of the standard reasons that updates of the form "I could've done better on a task I care about" can be hard to accept.

titotalJun 27 20234

Also, sometimes the people saying this have their identity, reputation, or their career tied to the pessimistic belief. They might consciously prefer the optimistic outcome, but I reckon emotionally/subconsciously the bias would clearly be in the other direction.

Violet HourJan 14 202324

Some brief, off-the-cuff sociological reflections on the Bostrom email:

EA will continue to be appealing to those with an ‘edgelord’ streak. It’s worth owning up to this, and considering how to communicate going forward in light of that fact.
I think some of the reaction to the importance of population-level averages is indicative of an unhealthy attitude towards population-level averages.
I also think the ‘epistemic integrity’ angle is important.

Each consideration is discussed below.

1.

I think basically everyone here agrees that white people shouldn’t be using (or mentioning) racial slurs. I think you should also avoid generics. I think that you very rarely gain anything, even from a purely epistemic point of view, from saying (for example): “men are more stupid than women”.

EA skews young, and does a lot of outreach on university campuses. I also think that EA will continue to be attractive to people who like to engage in the world via a certain kind of communication, and I think many people interested in EA are likely to be drawn to controversial topics. I think this is unavoidable. Given that it’s unavoidable, it’s worth being conscious of this, and directly tackling what’s to be gained (and lost) from certain provocative modes of communication, in what contexts.

Lead poisoning affects IQ. The Flint water crisis affected an area that was majority African American, and we have strong evidence that lead poisoning affects IQ. Here’s one way of ‘provocatively communicating’ certain facts.

The US water system is poisoning black people.

I don’t like this statement and think it is true.

Deliberately provocative communication probably does have its uses, but it’s a mode of communication that can be in tension with nuanced epistemics, as well as kindness. If I’m to get back in touch with my own old edgelord streak for a moment, I’d say that one (though obviously not the major) benefit of EA is the way it can transform ‘edgelord energy’ into something that can actually make the world better.

I think, as with SBF, there’s a cognitive cluster which draws people towards both EA, and towards certain sorts of actions most of us wish to reflectively disavow. I think it’s reasonable to say: “EA messaging will appeal (though obviously will not only appeal) disproportionately to a certain kind of person. We recognize the downsides in this, and here’s what we’re doing in light of that.”

2.

There are fewer women than men in computer science. This doesn’t give you any grounds — once a woman says “I’m interested in computer science”, to be like “oh, well, but maybe she’s lying, or maybe it’s a joke, or … ”

Why? You have evidence that means you don’t need to rely on such coarse data! You don’t need to rely on population-level averages! To the extent that you do, or continue to treat such population-averages as highly salient in the face of more relevant evidence, I think you should be criticized. I think you should be criticized because it’s a sign that your epistemics have been infected by pernicious stereotypes, which makes you worse at understanding the world, in addition to being more likely to cause harm when interacting in that world.

3.

You should be careful to believe true things, even when they’re inconvenient.

On the epistemic level, I actually think that we’re not in an inconvenient possible world wrt the ‘genetic influence on IQ’, partially because I think certain conceptual discussions of ‘heritability’ are confused, and partially because I think that it’s obviously reasonable to look at the (historically quite recent!) effects of slavery, and conclude “yeaah, I’m not sure I’d expect the data we have to look all that different, conditioned on the effects of racism causing basically all of the effects we currently see”.

But, fine, suppose I’m in an inconvenient possible world. I could be faced with data that I’d hate to see, and I’d want to maintain epistemic integrity.

One reason I personally found Bostrom’s email sad was that I sensed a missing mood. To support this, here’s an intuition pump that might be helpful: suppose you’re back in the early days of EA, working for $15k in the basement of an estate agent. You’ve sacrificed a lot to do something weird, sometimes you feel a bit on the defensive, and you worry that people aren’t treating you with the seriousness you deserve. Then, someone comes along, says they’ve run some numbers, and told you that EA was more racist than other cosmopolitan groups, and — despite EA’s intention to do good — is actually far more harmful to the world than other comparable groups. Suppose further that we also ran surveys and IQ tests, and found that EA is also more stupid and unattractive than other groups. I wouldn’t say:

EA is harmful, racist, ugly, and stupid. I like this sentence and think it’s true.

Instead, I’d communicate the information, if I thought it was important, in a careful and nuanced way. If I saw someone make the unqualified statement quoted above, I wouldn’t personally wish to entrust that person with promoting my best interests, or with leading an institute directed towards the future of humanity.

I raise this example not because I wish to opine on contemporary Bostrom, based on his email twenty-six years ago. I bring this example up because, while (like 𝕮𝖎𝖓𝖊𝖗𝖆) I’m glad that Bostrom didn’t distort his epistemics in the face of social pressure, I think it’s reasonable to think (like Habiba, apologies if this is an unfair phrase) that Bostrom didn’t take ownership for his previously missing mood, and communicate why his subsequent development leads him to now repudiate what he said.

I don’t want to be unnecessarily punitive towards people who do shitty things. That’s not kindness. But I also want to be part of a community that promotes genuinely altruistic standards, including a fair sense of penance. With that in mind, I think it’s healthy for people to say: “we accept that you don’t endorse your earlier remark (Bostrom originally apologized within 24 hours, after all), but we still think your apology misses something important, and we’re a community that wants people who are currently involved to meet certain standards.”

Violet HourApr 24 20235

AI safety

A working attempt to sketch a simple three-premise argument for the claim: ‘TAI will result in human extinction’, and offer objections. Made mostly for my own benefit while working on another project, but I thought it might be useful to post here.

The structure of my preferred argument is similar to an earlier framing suggested by Katja Grace.

Goal-directed superhuman AI systems will be built (let’s say conditioned on TAI).
If goal-directed superhuman AI systems are built, their values will result in human extinction if realized.
If goal-directed superhuman AI systems are built, they’ll be able to realize their values — even if their values would result in human extinction if realized.
Thus: Humanity will go extinct.

I’ll offer some rough probabilities, but the probabilities I’m offering shouldn’t be taken seriously. I don’t think probabilities are the best way to adjudicate disputes of this kind, but I thought offering a more quantitative sense of my uncertainty (based on my immediate impressions) might be helpful in this case. For the (respective) premises, I might go for 98%, 7%, 83%, resulting in a ~6% chance of human extinction given TAI.

Some more specific objections:

Obviously Premise 2 is doing a lot of the work here. I think that one of the main arguments for believing in Premise 2 is a view like Rob’s, which holds that current ML is on track to produce systems which are, “in the ways that matter”, more like ‘randomly sample (simplicity-weighted) plans’ than anything recognizably human. If future systems are sampling from simplicity-weighted plans to achieve arbitrary goals, then Premise 2 does start to look very plausible.
- This basically just seems like an extremely strong claim about the inductive biases of ML systems, and my (likely unsatisfying) response boils down to: (1) I don’t see any strong argument for believing it, and (2) I see some arguments for the alternative conclusion.
- I find myself really confused when trying to think about this debate. In a discussion of Rob’s post, Daniel Kokotajlo says: “IMO the burden of proof is firmly on the side of whoever wants to say that therefore things will probably be fine.”
- I think I just don’t get the intuition behind his argument (tagging @kokotajlod in case he wants to correct any misunderstandings). I don’t really like ‘burden of proof’ talk, but my instinct is to say “look, LLMs distill human cognition, much of this cognition implicitly contains plans, human-like value judgements, etc.” I start from a place where I currently believe “future systems have human-like inductive biases” will be a better predictive abstraction than “randomly sample from the space of simplicity-weighted plans”. And … I just don’t currently see the argument for rejecting my current view?
  - Perhaps there are near-term predictions which would help weigh on the dispute between the two hypotheses? I currently interpret the disagreement here as a disagreement about the relevant outcome space over which we should be uncertain, which feels hard to adjudicate. But, right now, I struggle to see the argument for the more doomy outcome space.
More on Premise 2: Paul Christiano offers various considerations which count against doom which appear to go through without having “solved alignment”. These considerations feel less forceful to me than the points in the bullet point above, but they still serve to make Premise 2 seem less likely.
- “Given how small the [resource costs of keeping humans around are], even tiny preferences one way or the other will dominate incidental effects from grabbing more resources”.
- “There are just a lot of plausible ways to care a little bit (one way or the other!) about a civilization that created you, that you've interacted with, and which was prima facie plausibly an important actor in the world”
- “Most humans and human societies would be willing to spend much more than 1 trillionth of their resources (= $100/year for all of humanity) for a ton of random different goals”
- Paul also mentions “decision-theoretic arguments for cooperation”, including a passing reference to ECL.

I also think the story by Katja Grace below is plausible, in which superhuman AI systems are “goal-directed”, but don’t lead to human extinction.

AI systems proliferate, and have various goals. Some AI systems try to make money in the stock market. Some make movies. Some try to direct traffic optimally. Some try to make the Democratic party win an election. Some try to make Walmart maximally profitable. These systems have no perceptible desire to optimize the universe for forwarding these goals because they aren’t maximizing a general utility function, they are more ‘behaving like someone who is trying to make Walmart profitable’. They make strategic plans and think about their comparative advantage and forecast business dynamics, but they don’t build nanotechnology to manipulate everybody’s brains, because that’s not the kind of behavior pattern they were designed to follow. The world looks kind of like the current world, in that it is fairly non-obvious what any entity’s ‘utility function’ is. It often looks like AI systems are ‘trying’ to do things, but there’s no reason to think that they are enacting a rational and consistent plan, and they rarely do anything shocking or galaxy-brained.

Perhaps the story above is unlikely because the AI systems in Grace’s story would (in the absence of strong preventative efforts) be dangerous maximizers. I think that this is most plausible on something like Eliezer’s model of agency, and if my views change my best bet is that I’ll have updated towards his view.

I believe: as you develop gradually more capable agentic systems, there are dynamic pressures towards a certain kind of coherency. I don’t think that claim alone establishes the existence of dynamic pressures towards ‘dangerous maximizing cognition’.
I think that AGI cognition (like our own) may well involve schemas, like (say) being loyal, or virtuous. We don’t argmax(virtue). Rather, the virtue schema also applies to the process by which we search over plans.
- So I don’t see why ‘having superhuman AIs run Walmart’ necessarily leads to doom, because they might just be implementing schemas like “be a good business professional”, rather than “find the function f(.) which is most ‘business-professional-like’, then maximize f(.) — regardless of whether any human would consider f(.) to represent anything corresponding to ‘good business professional.”
  - Alex Turner has a related comment here.
On Premise 3: I feel unsatisfied, so far, by accounts of AI takeover scenarios. Admittedly, it seems kinda mad for me to say “I’m confident that an AI with greater cognitive power than all of humanity couldn’t kill us if it wanted to”, which is one reason that I’m only at ~⅙ chance that we’d survive in that situation.
- But I also don’t know how much my conclusion is swayed by a sense of wanting to avoid the hubris of “man, it would be Really Dumb if you said a highly capable AI couldn’t kill us if it wanted to, and then we end up dead”, rather than a more obviously virtuous form of cognition.
- A system being ‘cognitively efficient wrt humanity’ doesn’t automatically entail ‘whatever goals the system has – and whatever constraints the system might otherwise face – the cognitively efficient system gets what it wants’. The arguments attempting to move from ‘AI with superhuman cognitive abilities’ to ‘human extinction’ feel fuzzier than I’d like.

If superhuman systems don’t foom, we might have marginally superhuman systems, who are able to be thwarted before they kill literally everyone (while still doing a lot of damage). Constraints like ‘accessing the relevant physical infrastructure’ might dominate the gains from greater cognitive efficiency.
- I also feel pretty confused about how much actual real-world power would be afforded to AIs in light of their highly advanced cognition (a relevant recent discussion), which further brings down my confidence in Premise 3.
- I’m also assuming that: conditioned on an AI instrumentally desiring to kill all humans, deceptive alignment is likely. I haven’t read posts like this one which might challenge that assumption. If I came to believe that deceptive alignment was highly unlikely, this could lower the probability of either Premise 2 or Premise 3.

Finally, I sometimes feel confused by the concept of ‘capabilities’ as it’s used in discussions about AGI. From Jenner and Treutlein’s response to Grace’s counterarguments:

Assuming it is feasible, the question becomes: why will there be incentives to build increasingly capable AI systems? We think there is a straightforward argument that is essentially correct: some of the things we care about are very difficult to achieve, and we will want to build AI systems that can achieve them. At some point, the objectives we want AI systems to achieve will be more difficult than disempowering humanity, which is why we will build AI systems that are sufficiently capable to be dangerous if unaligned.”

Maybe one thing I’m thinking here is that “more difficult” is hard to parse. The AI systems might be able to achieve some narrower outcome that we desire, without being “capable” of destroying humanity. I think this is compatible with having systems which are superhumanly capable of pursuing some broadly-scoped goals, without being capable of pursuing all broadly-scoped goals.

Psychologists talk of ‘g’ bc there’s correlation between performance on tasks we intuitively think of as cognitive, and correlations with some important life outcomes. I don’t know how well the unidimensional notion of intelligence will transfer to advanced AI systems. The fact that some AIs perform decently on IQ tests without being good at much else is at least some weak evidence against the generality of the more unidimensional ‘intelligence’ concept.
- However, I agree that there’s a well-defined sense in which we can say that AIs are more cognitively capable than all of humanity combined. I also think that my earlier point about expecting future systems to exhibit human-like inductive biases makes the argument in the bullet point above substantially weaker.
  - I still remain uneasy about the extent to which unidimensional notion of ‘capabilities’ can feed into claims about takeoffs and takeover scenarios, and I’m currently unclear on whether this makes a practical difference.

(Also, I’m no doubt missing a bunch of relevant information here. But this is probably true for most people, and I think it’s good for people to share objections even if they’re missing important details)

kokotajlodApr 25 20233

Nice analysis!

I think a main point of disagreement is that I don't think systems need to be "dangerous maximizers" in the sense you described in order to predictably disempower humanity and then kill everyone. Humans aren't dangerous maximizers yet we've killed many species of animals, neanderthals, and various other human groups (genocide, wars, oppression of populations by governments, etc.) Katja's scenario sounds plausible for me except for the part where somehow it all turns out OK in the end for humans. :)

Another, related point of disagreement:

“look, LLMs distill human cognition, much of this cognition implicitly contains plans, human-like value judgements, etc.” I start from a place where I currently believe “future systems have human-like inductive biases” will be a better predictive abstraction than “randomly sample from the space of simplicity-weighted plans”. And … I just don’t currently see the argument for rejecting my current view?

I actually agree that current and future systems will have human-like concepts, human-like inductive biases, etc. -- relative to the space of all possible minds at least. But their values will be sufficiently alien that humanity will be in deep trouble. (Analogy: Suppose we bred some octopi to be smarter and smarter, in an environment where they were e.g. trained with pavlovian conditioning + artificial selection to be really good at reading internext text and predicting it, and then eventually writing it also.... They would indeed end up a lot more human-like than regular wild octopi. But boy would it be scary if they started getting generally smarter than humans and being integrated deeply into lots of important systems and humans started trusting them a lot etc.)

Violet HourApr 26 20233

thnx! : )

Your analogy successfully motivates the “man, I’d really like more people to be thinking about the potentially looming Octopcracy” sentiment, and my intuitions here feel pretty similar to the AI case. I would expect the relevant systems (AIs, von-Neumann-Squidwards, etc) to inherit human-like properties wrt human cognition (including normative cognition, like plan search), and a small-but-non-negligible chance that we end up with extinction (or worse).

On maximizers: to me, the most plausible reason for believing that continued human survival would be unstable in Grace’s story either consists in the emergence of dangerous maximizers, or the emergence of related behaviors like rapacious influence-seeking (e.g., Part II of What Failure Looks Like). I agree that maximizers aren't necessary for human extinction, but it does seem like the most plausible route to ‘human extinction’ rather than ‘something else weird and potentially not great’.

kokotajlodApr 26 20232

Nice. Well, I guess we just have different intuitions then -- for me, the chance of extinction or worse in the Octopcracy case seems a lot bigger than "small but non-negligible" (though I also wouldn't put it as high as 99%).

Human groups struggle against each other for influence/power/control constantly; why wouldn't these octopi (or AIs) also seek influence? You don't need to be an expected utility maximizer to instrumentally converge; humans instrumentally converge all the time.

kokotajlodApr 25 20232

Oh also you might be interested in Joe Carlsmith's report on power-seeking AI, it has a relatively thorough discussion of the overall argument for risk.

titotalApr 25 20233

This is a good summary, thanks for writing it up!

RobertMApr 25 20232

I do agree with this, in principle:

A system being ‘cognitively efficient wrt humanity’ doesn’t automatically entail ‘whatever goals the system has – and whatever constraints the system might otherwise face – the cognitively efficient system gets what it wants’.

...though I don't think it buys us more than a couple points; I think people dramatically underestimate how high the ceiling is for humans and think that a reasonably smart human familiar with the right ideas would stand a decent chance at executing a takeover if placed into the position of an AI (assuming speedup of cognition, + whatever actuators current systems typically possess).

However, I think this is wrong:

LLMs distill human cognition

LLMs have whatever capabilities they have because those are the capabilities discovered by gradient descent which, given their architecture, improved their performance on the test task (next token prediction). This task is extremely unlike the tasks represented in the environment where human evolution occurred, and the kind of cognitive machinery which would make a system effective at next token prediction seems very different from whatever it is that humans do. (Humans are capable of next token prediction, but notably we are much worse at it than even GPT-3.)

Separately, the cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values (and/or the cognitive machinery that causes humans to develop values after birth), so if it turned out that LLMs did, somehow, share the bulk of their cognitive algorithms with humans, that would be a slight positive update for me, but not an overwhelming one, since I wouldn't expect an LLM to want anything remotely relevant to humans. (Most of the things that humans want are lossy proxies for things that improved IGF in the ancestral environment, many of which generalized extremely poorly out of distribution. What are the lossy proxies for minimizing prediction loss that a sufficiently-intelligent LLM would end up with? I don't know, but I don't see why they'd have anything to do with the very specific things that humans value.)

Violet HourApr 25 20231

AI safety

Pushback appreciated! But I don’t think you show that “LLMs distill human cognition” is wrong. I agree that ‘next token prediction’ is very different to the tasks that humans faced in their ancestral environments, I just don’t see this as particularly strong evidence against the claim ‘LLMs distill human cognition’.

I initially stated that “LLMs distill human cognition” struck me as a more useful predictive abstraction than a view which claims that the trajectory of ML leads us to a scenario where future AIs, are “in the ways that matter”, doing something more like “randomly sampling from the space of simplicity-weighted plans”. My initial claim still seems right to me.

If you want to pursue the debate further, it might be worth talking about the degree to which you’re (un)convinced by Quintin Pope’s claims in this tweet thread. Admittedly, it sounds like you don’t view this issue as super cruxy for you:

“The cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values”

I don’t know the literature on moral psychology, but that claim doesn’t feel intuitive to me (possibly I’m misunderstanding what you mean by ‘human values’; I’m also interested in any relevant sources). Some thoughts/questions:

Does your position rule out the claim that “humans model other human beings using the same architecture that they use to model themselves”?
- To me, this seems like an instance where ‘value reasoning’ and ‘descriptive reasoning’ rely on similar cognitive resources. If LLMs inherit this human-like property (Quintin claims they do), would that update you towards optimism? If not, why not?
I take it that the notion of ‘intelligence’ we’re working with is related to planning. If future AI systems inherit human-like cognition wrt plan search, then I think this is a reason to expect that AI cognition will also inherit not-completely-alien-to-human values — even if there are, in some sense, distinct cognitive mechanisms undergirding ‘values’ and ‘non-values’ reasoning in humans.
- This is because the ‘search over plans’ process has both normative and descriptive components. I don’t think the claim about LLMs distilling human cognition constitutes anything like a guarantee that future LLMs will have values we’d really like, and nor is it a call for complacency about the emergence of misaligned goals. I just think it constitutes meaningful evidence against the human extinction claim.
  - As I write this, I’m starting to think that your claim about distinct cognitive mechanisms primarily seems like an argument for doom conditioned on ‘LLMs mostly don’t distill human cognition’, but doesn’t seem like an independent argument for doom conditioned on LLMs distilling human cognition. If LLMs distill the plan search component of human cognition, this feels like a meaningful update against doom. If LLMs mostly fail to distill the parts of human cognition involved in plan search, then cognitive convergence might happen because (e.g.) the Natural Abstraction Hypothesis is true, and 'human values' aren't a natural abstraction. In that case, it seems correct to say that cognitive convergence constitutes, at best, a small update against doom. (The cognitive convergence would occur due to structural properties of patterns in the world, rather than arising as the result of LLMs distilling more specifically human thought patterns related to values)
  - So I feel like ‘the degree to which we should expect future AIs to converge with human-like cognitive algorithms for plan search’ might be a crux for you?