And you've already agreed that it's implausible that these efforts would lead to tyranny, you think they will just fail.
I think that conditional on the efforts working, the chance of tyranny is quite high (ballpark 30-40%). I don't think they'll work, but if they do, it seems quite bad.
And since I think x-risk from technical AI alignment failure is in the 1-2% range, the risk of tyranny is the dominant effect of "actually enforced global AI pause" in my EV calculation, followed by the extra fast takeoff risks, and then followed by "maybe we get net positive alignment research."
I have now made a clarification at the very top of the post to make it 1000% clear that my opposition is disjunctive, because people repeatedly get confused / misunderstand me on this point.
Please stop saying that mind-space is an "enormously broad space." What does that even mean? How have you established a measure on mind-space that isn't totally arbitrary?
What if concepts and values are convergent when trained on similar data, just like we see convergent evolution in biology?
I think this post is best combined with my post. Together, these posts present a coherent, disjunctive set of arguments against pause.
My opposition is disjunctive!
I both think that if it's possible to stop the building of dangerously large models via international regulation, that would be bad because of tyranny risk, and I also think that we very likely can't use international regulation to stop building these things, so that any local pauses are not going to have their intended effects and will have a lot of unintended net-negative effects.
(Also, reread my piece - I call for action to regulate and stop larger and more dangerous models immediately as a prelude to a global moratorium. I didn't say "wait a while, then impose a pause for a while in a few places.")
This really sounds like you are committing the fallacy I was worried about earlier on. I just don't agree that you will actually get the global moratorium. I am fully aware of what your position is.
In my essay I don't make an assumption that the pause would immediate, because I did read your essay and I saw that you were proposing that we'd need some time to prepare and get multiple countries on board.
I don't see how a delay before a pause changes anything. I still think it's highly unlikely you're going to get sufficient international backing for the pause, so you will either end up doing a pause with an insufficiently large coalition, or you'll back down and do no pause at all.
Differentiability is a pretty big part of the white box argument.
The terabyte compiled executable binary is still white box in a minimal sense but it's going to take a lot of work to mould that thing into something that does what you want. You'll have to decompile it and do a lot of static analysis, and Rice's theorem gets in the way of the kinds of stuff you can prove about it. The code might be adversarially obfuscated, although literal black box obfuscation is provably impossible.
If instead of a terabyte of compiled code, you give me a trillion neural net weights, I can fine tune that network to do a lot of stuff. And if I'm worried about the base model being preserved underneath and doing nefarious things, I can generate synthetic data from the fine tuned model and train a fresh network from scratch on that (although to be fair that's pretty compute-intensive).
It's not obvious to me what alignment optimism has to do with the pause debate
Sorry, I thought it would be fairly obvious how it's related. If you're optimistic about alignment then the expected benefits you might hope to get out of a pause (whether or not you actually do get those benefits) are commensurately smaller, so the unintended consequences should have more relative weight in your EV calculation.
To be clear, I think slowing down AI in general, as opposed to the moratorium proposal in particular, is a more reasonable position that's a bit harder to argue against. I do still think the overhang concerns apply in non-pause slowdowns but in a less acute manner.
It's essentially no cost to run a gradient-based optimizer on a neural network, and I think this is sufficient for good-enough alignment. I view the the interpretability work I do at Eleuther as icing on the cake, allowing us to steer models even more effectively than we already can. Yes, it's not zero cost, but it's dramatically lower cost than it would be if we had to crack open a skull and do neurosurgery.
Also, if by "mechanistic interpretability" you mean "circuits" I'm honestly pretty pessimistic about the usefulness of that kind of research, and I think the really-useful stuff is lower cost than circuits-based interp.
That if there was a pause, alignment research would magically revert back to what it was back in the MIRI days
The claim is more like, "the MIRI days are a cautionary tale about what may happen when alignment research isn't embedded inside a feedback loop with capabilities." I don't literally believe we would revert back to pure theoretical research during a pause, but I do think the research would get considerably lower quality.
However, I'm worried that your [white box] framing is confusing and will cause people to talk past each other.
Perhaps, but I think the current conventional wisdom that neural nets are "black box" is itself a confusing and bad framing and I'm trying to displace it.