A Playbook for AI Risk Reduction (focused on misaligned AI)

Holden Karnofsky

A Playbook for AI Risk Reduction (focused on misaligned AI)

Holden Karnofsky

17 min read · Jun 6, 2023

Comments 17

Sorted by

New & upvoted

Sol3:2

Conflict of interest disclosure: my wife is co-founder and President of Anthropic. Please don’t assume things about my takes on specific AI labs due to this.¹⁰

This is really amazing. How much of Anthropic does Daniela own? How much does your brother in law own? If my family was in line to become billionaires many times over due to a certain AI lab becoming successful, this would certainly affect my takes.

finnhambly

Why is this getting downvoted? This comment seems plainly helpful; it's an important thing to highlight.

[anonymous]

Disclosing a conflict of interest demonstrates explicit awareness of potential bias. It's often done to make sure the reader tries to weigh the merits of the content by itself. Your comment shows me that you have (perhaps) not done so, by ignoring the points the author argued. If you see any evidence of bias in the takes in the article/post, can you be more specific? That way, the author is given an honest chance to defend his viewpoint.

finnhambly

I don't think this disclosure shows that much awareness, as the notes seem to dismiss it as a problem, unless I'm misunderstanding what Holden means by "don’t assume things about my takes on specific AI labs due to this". It sounds like he's claiming he's able to assess these things neutrally, which is quite a big claim!

Holden Karnofsky

Sorry, I didn't mean to dismiss the importance of the conflict of interest or say it isn't affecting my views.

I've sometimes seen people reason along the lines of "Since Holden is married to Daniela, this must mean he agrees with Anthropic on specific issue X," or "Since Holden is married to Daniela, this must mean that he endorses taking a job at Anthropic in specific case Y." I think this kind of reasoning is unreliable and has been incorrect in more than one specific case. That's what I intended to push back against.

Rebecca

It's often done to make sure the reader tries to weigh the merits of the content by itself.

My understanding is that it's usually meant to serve the opposite purpose: to alert readers to the possibility of bias so they can evaluate the content with that in mind and decide for themselves whether they think bias has creeped in. The alternative is people being alerted to the CoI in the comments and being angry the quite relevant information being kept from them, not that they would otherwise still know about the bias and not be able to evaluate the article well because of it.

Sol3:2

But he did not, in fact, disclose the conflict of interest. "My wife is President of Anthropic" means nothing in and of itself without some good idea of what stake she actually owns.

Holden Karnofsky

I expected readers to assume that my wife owned significant equity in Anthropic; I've now edited the post to state this explicitly (and also added a mention of her OpenAI equity, which I should've included before and have included in the past). I don't plan to disclose the exact amount and don't think this is needed for readers to have sufficient context on my statements here.

Sol3:2

I very strongly disagree. There's a huge difference between $1 billion (Jaan Tallinn money) and many tens of billions (Dustin Moskowitz money or even Elon money). You know this better than anyone. Jaan is a smart guy and can spend his money carefully and well to achieve some cool things, but Dustin can quite literally single-handedly bankroll an entire elite movement across multiple countries. Money is power even if - especially if! - you plan to give it away. The exact amount that Dario and Daniela - and of course you by association - might wind up with here is extremely relevant. If it's just a billion or two even if Anthropic does very well, fair enough, I wouldn't expect that to influence your judgment much at all given that you already exercise substantial control over the dispersion of sums many times in excess of that. If it's tens of billions, this is a very different story and we might reasonable assume that this would in fact colour your opinions concerning Anthropic. Scale matters!

JP Addison🔸

Moderator comment

-39

Hi Sol3:2. This comment raises an issue that appears to have struck a chord with many readers. However, we believe that it is unnecessarily sarcastic and breaks Forum norms. This is a warning, please do better in the future.

NunoSempere

I think you may have a model where you don't want to have comments above a given level of rudeness/sarcasm/impoliteness/political incorrectness, etc. However, I would prefer that you had a model where you give a warning or a ban if a comment or a user exceeds some rudeness - value threshold, as I think that would provide more value: I would want to have the rude comments if they produce enough value to be worth it.

And I think that you do want to have the disagreeable people push back, to discourage fake group consensus.

Sol3:2

I really don't even think my comment was rude. Absolutely massive "not allowed to tackle Carter in this game" energy on the EA forum tbqh

NunoSempere

I mean, I think it exceeds some level of rudeness in that you consider the hypothesis that Karnofsky might not be an impeccable boy scout, which some people might consider to be rude. But I also think that it's fine to exceed that threshold, so ¯\_(ツ)_/¯

Sol3:2

You must be joking. What on earth was sarcastic?

Let's go line by line.

"This is really amazing" (I was amazed).

"How much of Anthropic does Daniela own? How much does your brother in law own?" (both entirely sincere and legitimate questions - this is very material nonpublic knowledge)

"If my family was in line to become billionaires many times over due to a certain AI lab becoming successful, this would certainly affect my takes." (this is literally true - I know myself quite well and am very confident that my reasoning would be incredibly motivated if I personally had billions, perhaps tens of billions on the line)

Rebecca

I think the assumption is that most people already knew about the facts disclosed

michel

Thanks for sharing this! I think it's great you made this public.

[anonymous]

A single really convincing demonstration of something like deceptive alignment could make a big difference to the case for standards and monitoring (next section).

This struck me as a particularly good example of a small improvement having a meaningful impact. On a personal note, seeing the example of deceptive alignment you wrote would make me immediately move to the hit-the-emergency-brakes/burn-it-all-down camp. I imagine that many would react in a similar way, which might place a lot of pressure on AI labs to collectively start implementing some strict (not just for show) standards.

Comments

A Playbook for AI Risk Reduction (focused on misaligned AI)

A Playbook for AI Risk Reduction (focused on misaligned AI)

My basic picture of what success could look like

4 key categories of interventions

Alignment research

Standards and monitoring

Successful, careful AI lab

Information security

Notes