Nudge to seriously consider applying for 80,000 hours personal advising if you haven't already: https://80000hours.org/speak-with-us/
My guess is they'd be able to help you think this through!
Ah thanks Greg! That's very helpful.
I certainly agree that the target is relatively small, in the space of all possible goals to instantiate.
But luckily we aren't picking at random: we're deliberately trying to aim for that target, which makes me much more optimistic about hitting it.
And another reason I see for optimism comes from that yes, in some sense I see the AI is in some way neutral (neither aligned nor misaligned) at the start of training. Actually, I would agree that it's misaligned at the start of training, but what's missing initially are the capabilities that make that misalignment dangerous. Put another way, it's acceptable for early systems to be misaligned, because they can't cause an existential catastrophe. It's only by the time a system could take power if it tried that it's essential it doesn't want to.
These two reasons make me much less sure that catastrophe is likely. It's still a very live possibility, but these reasons for optimism make me feel more like "unsure" than "confident of catastrophe".
Regarding the whether they have the same practical implications, I guess I agree that if everyone had a 90% credence in catastrophe, that would be better than them having 50% credence or 10%.
Inasmuch as you're right that the major players have a 10% credence of catastrophe, we should either push to raise it or to advocate for more caution given the stakes.
My worry is that they don't actually have that 10% credence, despite maybe saying they do, and that coming across as more extreme might stop them from listening.
I think you might be right that if we can make the case for 90%, we should make it. But I worry we can't.
Ah I think I see the misunderstanding.
I thought you were invoking "Murphy's Law" as a general principle that should generally be relied upon - I thought you were saying that in general, a security mindset should be used.
But I think you're saying that in the specific case of AGI misalignment, there is a particular reason to apply a security mindset, or to expect Murphy's Law to hold.
Here are three things I think you might be trying to say:
I would agree with 1 - that once you have created sufficiently powerful misaligned AI systems, catastrophe is highly likely.
But I don't understand the reason to think that 2 and especially 3 are both true. That's why I'm not confident in catastrophe: I think it's plausible that we end up using a training method that ends up creating aligned AI even though the way we chose that training method wasn't watertight.
You seem to think that it's very likely that we won't end up with a good enough training setup, but I don't understand why.
Looking forward to your reply!
Plans that involve increasing AI input into alignment research appear to rest on the assumption that they can be grounded by a sufficiently aligned AI at the start. But how does this not just result in an infinite, error-prone, regress? Such “getting the AI to do your alignment homework” approaches are not safe ways of avoiding doom.
On this point, the initial AI's needn't be actually aligned, I think. They could for example do useful alignment work that we can use even though they are "playing the training game"; they might want to take over, but don't have enough influence yet, so are just doing as we ask for now. (More)
(Clearly this is not a safe way of reliably avoiding catastrophe. But it's an example of a way that it's at least plausible we avoid catastrophe.)
"Applying a security mindset" means looking for ways that something could fail. I agree that this is a useful technique for preventing any failures from happening.
But I'm not sure that assuming that this is a sound principle to rely on when trying to work out how likely it is that something will go wrong. In general, Murphy’s Law is not correct. It's not true that "anything that can go wrong will go wrong".
I think this is part of the reason I'm sceptical of confident predictions of catastrophe, like your 90% - it's plausible to me that things could work out okay, and I don't see why Murphy's law/security mindset should be invoked to assume they won't.
Given that it's much easier to argue that the risk of catastrophe is unacceptably high (say >10%), and this has the same practical implications, I'd suggest that you argue for that instead of 90%, which to me (and others) seems unjustified. I'm worried that if people think you're unjustifiably confident in catastrophe, then they don't need to pay attention to the concerns you raise at all because they think you're wrong.
Even if it's true that it can be hard to agree or disagree with a post as a whole, I do get the impression that people sometimes feel like they disagree with posts as a whole, and so simply downvote the post.
Also, I suspect it is possible to disagree with a post as a whole. Many posts are structured like "argument 1, argument 2, argument 3, therefore conclusion". If you disagree with the conclusion, I think it's reasonable to say that that's disagreeing with the post as a whole. If you agree with the arguments and the conclusion, then you agree with the post as a whole. Otherwise, I agree it's more ambiguous, but people then have the option of not agree/disagree-voting.
What's the downside of allowing this post-level agree/disagree voting that the team is worried about? (I'm not saying there isn't one, just that I can't immediately see it.)