Linch

Yeah I don't quite understand that line of argument. Naively, it seems like a bait-and-switch, not unlike "journalists don't write their own terrible headlines."

Ozzie Gooen's Quick takes

Linch11h4

Possibly a tangential point, but lots of people in many EA communities think that accelerating economic growth in the US is a top use of funds.

Hmm I think the link does not support your claim.

Power Laws of Value

Linch2d8

Why would value be disributed over some suitable measure of world-states in a way that can be described as a power law specifically (vs some other functional form where the most valuable states are rare)?

I agree with this. I'm probably being too much of a pedant, but it's a slight detriment to our broader epistemic community that people use "power law" as a shorthand for "heavy-tailed distribution" or just "many OOMs of difference between best and worst/median outcomes." I think it makes our thinking a bit less clear when we try to translate back and forth between intuitions and math.

Power Laws of Value

Linch2d7

Thanks a lot for this post! I tried addressing this earlier by exploring "extinction" vs "doom" vs "not utopia," but your writing here is clearer, more precise and more detailed. One alternative framing I have for describing the "power laws of value," hypothesis as a contrast of your 14-word summary:

"Utopia" by the lights of one axiology or moral framework might be close to worthless under other moral frameworks, assuming an additive axiology.

It's 23 words and has more jargon, but I think it describes my own confusions better. In particular, I don't think you need to believe in "weird stuff" to get to many OOMs of difference between "best possible future" and "realistic future", unless additive/linear axiology itself is weird.

As one simple illustration, humanity can either be correct or incorrect in colonizing the stars with biological bodies instead of digital emulations. Either way, if you're wrong you lose many OOMs of value

If we decide to go the biological route: biological bodies are much less efficient than digital emulations. it's also much more difficult, as a practical/short-term matter, to colonize stars with bodies, so you capture a smaller fraction of the lightcone.).
If we decide to go the digital route, and it turns out emulations don't have meaningful moral value (eg at the level of fidelity that emulations are seeded on, digital emulations are in practice not conscious), then we lose ~100.0000% of the value.

Discussion Thread: Existential Choices Debate Week

Linch10d3

29% agree

mostly because of tractability than any other reason

Emergency pod: Judge plants a legal time bomb under OpenAI (with Rose Chan Loui)

Linch18d*8

To me, "advanc[ing] digital intelligence in the way that is most likely to benefit humanity as a whole" does not necessitate them building AGI at all. Indeed the same mission statement can be said to apply to e.g. Redwood Research.

Further evidence for this view comes from OpenAI's old merge-and-assist clause, which indicates that they'd be willing to fold and assist a different company if the other company is a) within 2 years of building AGI and b) sufficiently good.

Emergency pod: Judge plants a legal time bomb under OpenAI (with Rose Chan Loui)

Linch19d6

They may assert that subsequent developments establish that nonprofit development of AI is financially infeasible, that they are going to lose the AI arms race without massive cash infusions, and that obtaining infusions while the nonprofit is in charge isn't viable. If the signs are clear enough that the mission as originally envisioned is doomed to fail, then switching to a backup mission doesn't seem necessarily unreasonable under general charitable-law principles to me

I'm confused about this line of argument. Why is losing the AI arms race relevant to whether the mission as originally envisioned is doomed to fail?

I tried to find the original mission statement. Is the following correct?

OpenAI is a non-profit artificial intelligence research company. Our goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return. Since our research is free from financial obligations, we can better focus on a positive human impact.

If so, I can see how an OpenAI plantiff can try to argue that "advanc[ing] digital intelligence in the way that is most likely to benefit humanity as a whole" necessitates them "winning the AI arms race", but I don't exactly see why an impartial observer should grant them that.

Linch's Quick takes

Linch23d*9

(x-posted from LW)

Single examples almost never provides overwhelming evidence. They can provide strong evidence, but not overwhelming.

Imagine someone arguing the following:

1. You make a superficially compelling argument for invading Iraq

2. A similar argument, if you squint, can be used to support invading Vietnam

3. It was wrong to invade Vietnam

4. Therefore, your argument can be ignored, and it provides ~0 evidence for the invasion of Iraq.

In my opinion, 1-4 is not reasonable. I think it's just not a good line of reasoning. Regardless of whether you're for or against the Iraq invasion, and regardless of how bad you think the original argument 1 alluded to is, 4 just does not follow from 1-3.
___
Well, I don't know how Counting Arguments Provide No Evidence for AI Doom is different. In many ways the situation is worse:

a. invading Iraq is more similar to invading Vietnam than overfitting is to scheming.

b. As I understand it, the actual ML history was mixed. It wasn't just counting arguments, many people also believed in the bias-variance tradeoff as an argument for overfitting. And in many NN models, the actual resolution was double-descent, which is a very interesting and confusing interaction where as the ratio of parameters to data points increases, the test error first falls, then rises, then falls again! So the appropriate analogy to scheming, if you take it very literally, is to imagine first you have goal generalization, than goal misgeneralization, than goal generalization again. But if you don't know which end of the curve you're on, it's scarce comfort.

Should you take the analogy very literally and directly? Probably not. But the less exact you make the analogy, the less bits you should be able to draw from it.

---

I'm surprised that nobody else pointed out my critique in the full year since the post was published. Given that it was both popular and had critical engagement, I'm surprised that nobody else mentioned my criticism, which I think is more elementary than the sophisticated counterarguments other people provided. Perhaps I'm missing something.

When I made my arguments verbally to friends, a common response was that they thought the original counting arguments were weak to begin with, so they didn't mind weak counterarguments to it. But I think this is invalid. If you previously strongly believed in a theory, a single counterexample should update you massively (but not all the way to 0). If you previously had very little faith in a theory, a single counterexample shouldn't update you much.

Linch's Quick takes

Linch1mo*2

Right, in the definitions above I was mostly thinking of companies and a subset of the empirical AI safety literature, which do use these terms quite differently from how e.g. MIRI or LessWrong will use them.

I think there's three common definitions of the word "alignment" in the traditional AIS literature:

Aligned to anything, anything at all (sometimes known as "technical alignment"):So in this sense, both perfectly "jailbroken" models and perfectly "corporately aligned" models in the limit count as succeeding technical alignment. As will success at aligning to more absurd goals like pure profit maximization or diamond maximization. The assumed difficulty here is that even superficially successful strategies, extreme edge cases, after distributional shift etc. To be clear, this is not globally a "win" but you may wish to restrict the domain of what you work on.

Aligned to the interest of all humanity/moral code (this is sometimes just known as "alignment"): I think this is closer to what you mean by the moral code. Under this ontology, one decomposition is that you're able to a) succeed at the technical problem of alignment to arbitrary targets as well as b) figure out what we value (also known as variously as value-loading, axiology, theory of welfare etc). Of course, we may also find that clean decomposition is too hard and we can point AIs to a desired morality without being able to point them towards arbitrary targets.

Minimally aligned enough to not be a major catastrophic or existential risk: E.g., an AI that is expected to not result in greater than 1 billion deaths (sometimes there's an additional stipulation that the superhuman AIs are sufficiently powerful and/or sufficiently useful as well, to exclude e.g. a rock counting as "aligned").

Traditionally, I believe the first problem is considered more than 50% of the difficulty of the second problem, at least on a technical level.

Linch's Quick takes

Linch1mo*16

Reading the Emergent Misalignment paper and comments on the associated Twitter thread has helped me clarify the distinction^[1] between what companies call "aligned" vs "jailbroken" models.

"Aligned" in the sense that AI companies like DeepMind, Anthropic and OpenAI mean it = aligned to the purposes of the AI company that made the model. Or as Eliezer puts it, "corporate alignment." For example, a user may want the model to help edit racist text or the press release of an asteroid impact startup but this may go against the desired morals and/or corporate interests of the company that model the model. A corporately aligned model will refuse.

"Jailbroken" in the sense that it's usually used in the hacker etc literature = approximately aligned to the (presumed) interest of the user. This is why people often find jailbroken models to be valuable. For example, jailbroken models can help users say racist things or build bioweapons, even if it goes against the corporate interests of the AI companies that made the model.

"Misaligned" in the sense that the Emergent Misalignment paper uses it = aligned to neither the interests of the AI's creators nor the users. For example, the model may unprompted try to persuade the user to take a lot of sleeping pills, an undesirable behavior that benefits neither the user nor the creator.

^{^}

EDIT: This was made especially crisp/clear to me in discussions of the Emergent Misalignment paper. The authors make a clear distinction between "jailbroken" vs what they call "misaligned" models. Though I don't think they call the base models "aligned" (since that'd be wrong in the traditional AI safety lexicon). However, many commentators were confused and thought all the paper contributed was a novel jailbreak, it is of course much less interesting!

Linch

Posts 75

Comments2830

Posts
75

Comments
2830