I agree that "defer less" is good advice for EAs, but that's because EAs are especially deferent, and also especially care about getting things right and might actually do something sane about it. I think part of doing something sane about it is to have a detailed model of deference.
Epistemic deference is just obviously parasitic, a sort of dreadful tragedy of the commons of the mind.
I don't think this is right. One has to defer quite a lot; what we do is, appropriately, mostly deferring, in one way or another. The world is so complicated, there's so much information to process, and our problems are high-context (that is, require compressing and abstracting from a lot of information). Also coordination is important. I think a blunt-force "just defer less" is therefore not viable. Instead, having a more detailed understanding of what's undesirable about specific cases of deference opens the possibility of more specifically deferring less when it's most undesirable, and alleviating those dangers.
The Logical Induction paper involved multiple iterations of writing and reviewing, during which we refined the notation, terminology, proof techniques, theorem statements, etc. We also had a number of others comment on various drafts, pointing out wrong or unclear parts.
One horse-sized duck AI. For one thing, the duck is the ultimate (route) optimization process: you can ride it on land, sea, or air. For another, capabilities scale very nonlinearly in size; the neigh of even 1000 duck-sized horse AIs does not compare to the quack of a single horse-sized duck AI. Most importantly, if you can safely do something with 100 opposite-sized AIs, you can safely do the same thing with one opposite-sized AI.
In all seriousness though, we don't generally think in terms of "proving the friendliness" of an AI system. When doing research, we might prove that certain proposals have flaws (for example, see (1)) as a way of eliminating bad ideas in the pursuit of good ideas. And given a realistic system, one could likely prove certain high-level statistical features (such as “this component of the system has an error rate that vanishes under thus-and-such assumptions”), though it’s not yet clear how useful those proofs would be. Overall, though, the main challenges in friendly AI seem to be ones of design rather than verification. In other words, the problem is to figure out what properties an aligned system should possess, rather than to figure out how to prove them; and then to design a system that satisfies those properties. What properties would constitute friendliness, and what assumptions about the world do they rely on? I expect that answering these questions in formal detail would get us most of the way towards a design for an aligned AI, even if the only guarantees we can give about the actual system are statistical guarantees.
(1) Soares, Nate, et al. "Corrigibility." Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.
In my current view, MIRI’s main contributions are (1) producing research on highly-capable aligned AI that won’t be produced by default by academia or industry; (2) helping steer academia and industry towards working on aligned AI; and (3) producing strategic knowledge of how to reduce existential risk from highly-capable AI. I think (1) and (3) are MIRI’s current strong suits. This is not easy to verify without technical background and domain knowledge, but at least for my own thinking I’m impressed enough with these points to find MIRI very worthwhile to work with.
If (1) were not strong, and (2) were no stronger than currently, I would trust (3) somewhat less, and I would give up on MIRI. If (1) became difficult or impossible because (2) was done, i.e. if academia and/or industry were already doing all the important safety research, I’d see MIRI as much less crucial, unless there was a pivot to remaining neglected tasks in reducing existential risk from AI. If (2) looked too difficult (though there is already significant success, in part due to MIRI, FHI, and FLI), and (1) were not proceeding fast enough, and my “time until game-changing AI” estimates were small enough, then I’d probably do something different.
Scott Garrabrant’s logical induction framework feels to me like a large step forward. It provides a model of “good reasoning” about logical facts using bounded computational resources, and that model is already producing preliminary insights into decision theory. In particular, we can now write down models of agents that use logical inductors to model the world---and in some cases these agents learn to have sane beliefs about their own actions, other agents’ actions, and how those actions affect the world. This, despite the usual obstacles to self-modeling.
Further, the self-trust result from the paper can be interpreted to say that a logical inductor believes something like “If my future self is confident in the proposition A, then A is probably true”. This seems like one of the insights that the PPRHOL work was aiming at, namely, writing down a computable reasoning system that asserts a formal reflection principle of itself. Such a reflection principle must be weaker than full logical soundness; a system that proved “If my future self proves A, then A is true” would be inconsistent. But as it turns out, the reflection principle is feasible if you replace “proves” with “assigns high probability to” and replace “true” with “probably true”.
It is an active area of research to understand logical induction more deeply, and to apply it to decision-theoretic problems that require reflective properties. For example, the current framework uses “traders” that express their “beliefs” in terms of strategies for making trades against the market prices (probabilities) output by a logical inductor. Then traders are rewarded based on buying shares in sentences that turn out to be highly valued by later market prices. It would be nice to understand this process in Bayesian terms, e.g., where traders are hypotheses that output predictions about the market and have their probabilities updated by Bayesian updating.