Thanks for the thoughtful responses!
a guy who attempted a home-built nuclear reactor
Ha!
a licensing system around AGIs,
Well, I have in mind something more like banning the pursuit of a certain class of research goals.
the fact that evidence is fairly easy to come by and that intent with those is fairly easy to prove.
Hm. This provokes a further question:
Are there successful regulations that can apply to activity that is both purely mental (I mean, including speech, but not including anything more kinetic), and also is not an intention to commit a crime? Doing research on how to build a nuke with the intention of blowing up a bunch of people would be only mental, but is an intention to commit a crime. Hitting a pedestrian with your car by accident doesn't involve an intention to commit a crime, but is obviously not purely mental.
To say it another way, is there some X and Y such that conspiring to do Y is illegal, even though doing Y is not in itself illegal, because doing Y would tend to cause X? Not sure this question even makes sense, but I ask it because a ban might want to say "no, you're not allowed to run a giant unbounded algorithmic search on this giant dataset, because our scientific advisors said that this might result in accidentally creating an AGI, and creating an AGI is illegal". Maybe that's just the same as "Y is illegal".
For example, is there something analogous to: "it's illegal to do theoretical research aimed at making a hyper-potent dopamine agonist, because even if that hypothetical drug isn't illegal right now, it would be extremely hazardous and would become illegal"?
Epistemic deference is just obviously parasitic, a sort of dreadful tragedy of the commons of the mind.
I don't think this is right. One has to defer quite a lot; what we do is, appropriately, mostly deferring, in one way or another. The world is so complicated, there's so much information to process, and our problems are high-context (that is, require compressing and abstracting from a lot of information). Also coordination is important.
I think a blunt-force "just defer less" is therefore not viable. Instead, having a more detailed understanding of what's undesirable about specific cases of deference opens the possibility of more specifically deferring less when it's most undesirable, and alleviating those dangers.
One horse-sized duck AI. For one thing, the duck is the ultimate (route) optimization process: you can ride it on land, sea, or air. For another, capabilities scale very nonlinearly in size; the neigh of even 1000 duck-sized horse AIs does not compare to the quack of a single horse-sized duck AI. Most importantly, if you can safely do something with 100 opposite-sized AIs, you can safely do the same thing with one opposite-sized AI.
In all seriousness though, we don't generally think in terms of "proving the friendliness" of an AI system. When doing research, we might prove that certain proposals have flaws (for example, see (1)) as a way of eliminating bad ideas in the pursuit of good ideas. And given a realistic system, one could likely prove certain high-level statistical features (such as “this component of the system has an error rate that vanishes under thus-and-such assumptions”), though it’s not yet clear how useful those proofs would be. Overall, though, the main challenges in friendly AI seem to be ones of design rather than verification. In other words, the problem is to figure out what properties an aligned system should possess, rather than to figure out how to prove them; and then to design a system that satisfies those properties. What properties would constitute friendliness, and what assumptions about the world do they rely on? I expect that answering these questions in formal detail would get us most of the way towards a design for an aligned AI, even if the only guarantees we can give about the actual system are statistical guarantees.
(1) Soares, Nate, et al. "Corrigibility." Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.
In my current view, MIRI’s main contributions are (1) producing research on highly-capable aligned AI that won’t be produced by default by academia or industry; (2) helping steer academia and industry towards working on aligned AI; and (3) producing strategic knowledge of how to reduce existential risk from highly-capable AI. I think (1) and (3) are MIRI’s current strong suits. This is not easy to verify without technical background and domain knowledge, but at least for my own thinking I’m impressed enough with these points to find MIRI very worthwhile to work with.
If (1) were not strong, and (2) were no stronger than currently, I would trust (3) somewhat less, and I would give up on MIRI. If (1) became difficult or impossible because (2) was done, i.e. if academia and/or industry were already doing all the important safety research, I’d see MIRI as much less crucial, unless there was a pivot to remaining neglected tasks in reducing existential risk from AI. If (2) looked too difficult (though there is already significant success, in part due to MIRI, FHI, and FLI), and (1) were not proceeding fast enough, and my “time until game-changing AI” estimates were small enough, then I’d probably do something different.
Scott Garrabrant’s logical induction framework feels to me like a large step forward. It provides a model of “good reasoning” about logical facts using bounded computational resources, and that model is already producing preliminary insights into decision theory. In particular, we can now write down models of agents that use logical inductors to model the world---and in some cases these agents learn to have sane beliefs about their own actions, other agents’ actions, and how those actions affect the world. This, despite the usual obstacles to self-modeling.
Further, the self-trust result from the paper can be interpreted to say that a logical inductor believes something like “If my future self is confident in the proposition A, then A is probably true”. This seems like one of the insights that the PPRHOL work was aiming at, namely, writing down a computable reasoning system that asserts a formal reflection principle of itself. Such a reflection principle must be weaker than full logical soundness; a system that proved “If my future self proves A, then A is true” would be inconsistent. But as it turns out, the reflection principle is feasible if you replace “proves” with “assigns high probability to” and replace “true” with “probably true”.
It is an active area of research to understand logical induction more deeply, and to apply it to decision-theoretic problems that require reflective properties. For example, the current framework uses “traders” that express their “beliefs” in terms of strategies for making trades against the market prices (probabilities) output by a logical inductor. Then traders are rewarded based on buying shares in sentences that turn out to be highly valued by later market prices. It would be nice to understand this process in Bayesian terms, e.g., where traders are hypotheses that output predictions about the market and have their probabilities updated by Bayesian updating.
Ah, yeah, that sounds close to what I'm imagining, thank you.