From time to time, someone makes the case for why transparency in reasoning is important. The latest conceptualization is Epistemic Legibility by Elizabeth, but the core concept is similar to reasoning transparency used by OpenPhil, and also has some similarity to A Sketch of Good Communication by Ben Pace.

I'd like to offer a gentle pushback. The tl;dr is in my comment on Ben's post, but it seems useful enough for a standalone post.
 

How odd I can have all this inside me and to you it's just words.” ― David Foster Wallace


When and why reasoning legibility is hard

Say you demand transparent reasoning from AlphaGo. The algorithm has roughly two parts: tree search and a neural network. Tree search reasoning is naturally legible: the "argument" is simply a sequence of board states. In contrast, the neural network is mostly illegible - its output is a figurative "feeling" about how promising a position is, but that feeling depends on the aggregate experience of a huge number of games, and it is extremely difficult to explain transparently how a particular feeling depends on particular past experiences. So AlphaGo would be able to present part of its reasoning to you, but not the most important part.[1]

Human reasoning uses both: cognition similar to tree search (where the steps can be described, written down, and explained to someone else) and processes not amenable to introspection (which function essentially as a black box that produces a "feeling"). People sometimes call these latter signals “intuition”, “implicit knowledge”, “taste”, “S1 reasoning” and the like. Explicit reasoning often rides on top of this.

Extending the machine learning metaphor, the problem with human interpretability is that "mastery" in a field often consists precisely in having some well-trained black box neural network that performs fairly opaque background computations.

 

Bad things can happen when you demand explanations from black boxes

The second thesis is that it often makes sense to assume the mind runs distinct computational processes: one that actually makes decisions and reaches conclusions, and another that produces justifications and rationalizations.

In my experience, if you have good introspective access to your own reasoning, you may occasionally notice that a conclusion C depends mainly on some black box, but at the same time, you generated a plausible legible argument A for the same conclusion after you reached the conclusion C. 

If you try running, say, Double Crux over such situations, you'll notice that even if someone refutes the explicit reasoning A, you won't quite change the conclusion to ¬C. The legible argument A was not the real crux. It is quite often the case that (A) is essentially fake (or low-weight), whereas the black box is hiding a reality-tracking model.

Stretching the AlphaGo metaphor a bit: AlphaGo could be easily modified to find a few specific game "rollouts"  that turned out to "explain" the mysterious signal from the neural network. Using tree search, it would produce a few specific examples how such a position may evolve, which would be selected to agree with the neural net prediction. If AlphaGo showed them to you, it might convince you! But you would get a completely superficial understanding of why it evaluates the situation the way it does, or why it makes certain moves.

 

Risks from the legibility norm

When you make a strong norm pushing for too straightforward "epistemic legibility", you risk several bad things:

First, you increase the pressure on the "justification generator" to mask various black boxes by generating arguments supporting their conclusions.

Second, you make individual people dumber. Imagine asking a Go grandmaster to transparently justify his moves to you, and to play the moves that are best justified - if he tries to play that way, he will become a much weaker player. A similar thing applies to AlphaGo - if you allocate computational resources in such a way that a much larger fraction is consumed by tree search at each position, and less of the neural network is used overall, you will get worse outputs.

Third, there's a risk that people get convinced based on bad arguments - because their "justification generator" generated a weak legible explanation, you managed to refute it, and they updated. The problem comes if this involves discarding the output of the neural network, which was much smarter than the reasoning they accepted.
 

What we can do about it

My personal impression is that society as a whole would benefit from more transparent reasoning on the margin. 

What I'm not convinced of, at all, is that trying to reason much more transparently is a good goal for aspiring rationalists, or that some naive (but memetically fit) norms around epistemic legibility should spread.

To me, it makes sense for some people to specialize in very transparent reasoning. On the other hand, it also makes sense for some people to mostly "try to be better at Go", because legibility has various hidden costs.

A version of transparency that seems more robustly good to me is the one that takes legibility to a meta level. It's perfectly fine to refer to various non-interpretable processes and structures, but we should ideally add a description of what data they are trained on (e.g. “I played at the national level”). At the same time, if such black-box models outperform legible reasoning, it should be considered fine and virtuous to use models which work. You should play to win, if you can.

Examples

An example of a common non-legible communication:

A: Can you explain why you feel that getting this person to implement a "Getting Things Done" system is not a good idea?

B: I don't know exactly, I feel it won't do him any good

An example of how to make the same conversation worse by naive optimization for legibility

A: Can you explain why you feel that getting this person to implement a "Getting Things Done" system is not a good idea?

B: I read a thread on Twitter yesterday where someone explained that research on similar motivational techniques does not replicate, and also another thread where someone referenced research that people who over-organize their lives are less creative.

A: Those studies are pretty weak though.

B: Ah I guess you’re right. 

An example of how to actually improve the same conversation by striving for legibility:

A: Can you explain why you feel that getting this person to implement a "Getting Things Done" system is not a good idea?

B: I guess I can't explain it transparently to you. My model of this person just tells me that there is a fairly high risk that teaching them GTD won't have good results. I think it's based on experience with a hundred people I've met on various courses who are trying to have a positive impact on the world. Also, when I had similar feelings in the past, it turned out they were predictive in more than half of the cases.


If you've always understood the terms "reasoning transparency" or "epistemic legitimacy" in the spirit of the third conversation, and your epistemology routinely involves steps like "I'm going to trust this black-box trained on lots of data a lot more than this transparent analysis based on published research", then you're probably safe. 


How this looks in practice

In my view, it is pretty clear that some of the main cruxes of current disagreements about AI alignment are beyond the limits of legible reasoning. (The current limits, anyway.) 

In my view, some of these intuitions have roughly the "black-box" form explained above. If you try to understand the disagreements between e.g. Paul Christiano and Eliezer Yudkowsky, you often end up in a situation where the real difference is "taste", which influences how much weight they give to arguments, how good or bad various future "board positions" are evaluated to be, etc. Both Elizer and Paul are extremely smart, have spent more than a decade thinking about AI safety and even more time on relevant topics such as ML or decision theory or epistemics.

A person new to AI safety evaluating their arguments is roughly at a similar position to a Go novice trying to make sense of two Go grandmasters disagreeing about a board, with the further unfortunate feature that you can't just make them play against each other, because in some sense they are both playing for the same side.

This isn't a great position to be in. But in my view it's better to understand where you are rather than, for example, naively updating on a few cherry-picked rollouts.

See also

Thanks to Gavin for help with writing  this post.

  1. ^

     We can go even further if we note that the later AlphaZero policy network doesn’t use tree search when playing.

98

3 comments, sorted by Click to highlight new comments since: Today at 7:30 PM
New Comment

I liked this post by Katja Grace on these themes.

Here is one way the world could be. By far the best opportunities for making the world better can be supported by philanthropic money. They last for years. They can be invested in a vast number of times. They can be justified using widely available information and widely verifiable judgments.

Here is another way the world could be. By far the best opportunities are one-off actions that must be done by small numbers of people in the right fleeting time and place. The information that would be needed to justify them is half a lifetime’s worth of observations, many of which would be impolite to publish. The judgments needed must be honed by the same.

These worlds illustrate opposite ends of a spectrum. The spectrum is something like, ‘how much doing good in the world is amenable to being a big, slow, public, official, respectable venture, versus a small, agile, private, informal, arespectable one’.

In either world you can do either. And maybe in the second world, you can’t actually get into those good spots, so the relevant intervention is something like trying to. (If the best intervention becomes something like slowly improving institutions so that better people end up in those places, then you are back in the first world). 

An interesting question is what factor of effectiveness you lose by pursuing strategies appropriate to world 1 versus those appropriate to world 2, in the real world. That is, how much better or worse is it to pursue the usual Effective Altruism strategies (GiveWell, AMF, Giving What We Can) relative to looking at the world relatively independently, trying to get into a good position, and making altruistic decisions.

I don’t have a good idea of where our world is in this spectrum. I am curious about whether people can offer evidence.

Enjoyed reading this post, and I think I agree with your assessment:

It is pretty clear that some of the main cruxes of current disagreements about AI alignment are beyond the limits of legible reasoning. (The current limits, anyway.) 

(In addition to the Christiano-Yudkowsky example you give, one could also point to the Hanson-Yudkowsky AI-Foom Debate of 2008.)

In addition to "Epistemic Legibility" and "A Sketch of Good Communication," which you mention, I'd recommend "Public beliefs vs. Private beliefs" (Tyre, 2022) to others who enjoyed this post - Tyre explores a somewhat related theme.

First, you increase the pressure on the "justification generator" to mask various black boxes by generating arguments supporting their conclusions.

.

Third, there's a risk that people get convinced based on bad arguments - because their "justification generator" generated a weak legible explanation, you managed to refute it, and they updated. The problem comes if this involves discarding the output of the neural network, which was much smarter than the reasoning they accepted.

On the other hand, if someone in EA is making decisions about high-stakes interventions while their judgement is being influenced by a subconscious optimization for things like status and power, I think it's probably beneficial to subject their "justification generator" to a lot of pressure (in the hope that that will cause them, and onlookers, to end up making the best decisions from an EA perspective).