Hide table of contents

From time to time, someone makes the case for why transparency in reasoning is important. The latest conceptualization is Epistemic Legibility by Elizabeth, but the core concept is similar to reasoning transparency used by OpenPhil, and also has some similarity to A Sketch of Good Communication by Ben Pace.

I'd like to offer a gentle pushback. The tl;dr is in my comment on Ben's post, but it seems useful enough for a standalone post.
 

How odd I can have all this inside me and to you it's just words.” ― David Foster Wallace


When and why reasoning legibility is hard

Say you demand transparent reasoning from AlphaGo. The algorithm has roughly two parts: tree search and a neural network. Tree search reasoning is naturally legible: the "argument" is simply a sequence of board states. In contrast, the neural network is mostly illegible - its output is a figurative "feeling" about how promising a position is, but that feeling depends on the aggregate experience of a huge number of games, and it is extremely difficult to explain transparently how a particular feeling depends on particular past experiences. So AlphaGo would be able to present part of its reasoning to you, but not the most important part.[1]

Human reasoning uses both: cognition similar to tree search (where the steps can be described, written down, and explained to someone else) and processes not amenable to introspection (which function essentially as a black box that produces a "feeling"). People sometimes call these latter signals “intuition”, “implicit knowledge”, “taste”, “S1 reasoning” and the like. Explicit reasoning often rides on top of this.

Extending the machine learning metaphor, the problem with human interpretability is that "mastery" in a field often consists precisely in having some well-trained black box neural network that performs fairly opaque background computations.

 

Bad things can happen when you demand explanations from black boxes

The second thesis is that it often makes sense to assume the mind runs distinct computational processes: one that actually makes decisions and reaches conclusions, and another that produces justifications and rationalizations.

In my experience, if you have good introspective access to your own reasoning, you may occasionally notice that a conclusion C depends mainly on some black box, but at the same time, you generated a plausible legible argument A for the same conclusion after you reached the conclusion C. 

If you try running, say, Double Crux over such situations, you'll notice that even if someone refutes the explicit reasoning A, you won't quite change the conclusion to ¬C. The legible argument A was not the real crux. It is quite often the case that (A) is essentially fake (or low-weight), whereas the black box is hiding a reality-tracking model.

Stretching the AlphaGo metaphor a bit: AlphaGo could be easily modified to find a few specific game "rollouts"  that turned out to "explain" the mysterious signal from the neural network. Using tree search, it would produce a few specific examples how such a position may evolve, which would be selected to agree with the neural net prediction. If AlphaGo showed them to you, it might convince you! But you would get a completely superficial understanding of why it evaluates the situation the way it does, or why it makes certain moves.

 

Risks from the legibility norm

When you make a strong norm pushing for too straightforward "epistemic legibility", you risk several bad things:

First, you increase the pressure on the "justification generator" to mask various black boxes by generating arguments supporting their conclusions.

Second, you make individual people dumber. Imagine asking a Go grandmaster to transparently justify his moves to you, and to play the moves that are best justified - if he tries to play that way, he will become a much weaker player. A similar thing applies to AlphaGo - if you allocate computational resources in such a way that a much larger fraction is consumed by tree search at each position, and less of the neural network is used overall, you will get worse outputs.

Third, there's a risk that people get convinced based on bad arguments - because their "justification generator" generated a weak legible explanation, you managed to refute it, and they updated. The problem comes if this involves discarding the output of the neural network, which was much smarter than the reasoning they accepted.
 

What we can do about it

My personal impression is that society as a whole would benefit from more transparent reasoning on the margin. 

What I'm not convinced of, at all, is that trying to reason much more transparently is a good goal for aspiring rationalists, or that some naive (but memetically fit) norms around epistemic legibility should spread.

To me, it makes sense for some people to specialize in very transparent reasoning. On the other hand, it also makes sense for some people to mostly "try to be better at Go", because legibility has various hidden costs.

A version of transparency that seems more robustly good to me is the one that takes legibility to a meta level. It's perfectly fine to refer to various non-interpretable processes and structures, but we should ideally add a description of what data they are trained on (e.g. “I played at the national level”). At the same time, if such black-box models outperform legible reasoning, it should be considered fine and virtuous to use models which work. You should play to win, if you can.

Examples

An example of a common non-legible communication:

A: Can you explain why you feel that getting this person to implement a "Getting Things Done" system is not a good idea?

B: I don't know exactly, I feel it won't do him any good

An example of how to make the same conversation worse by naive optimization for legibility

A: Can you explain why you feel that getting this person to implement a "Getting Things Done" system is not a good idea?

B: I read a thread on Twitter yesterday where someone explained that research on similar motivational techniques does not replicate, and also another thread where someone referenced research that people who over-organize their lives are less creative.

A: Those studies are pretty weak though.

B: Ah I guess you’re right. 

An example of how to actually improve the same conversation by striving for legibility:

A: Can you explain why you feel that getting this person to implement a "Getting Things Done" system is not a good idea?

B: I guess I can't explain it transparently to you. My model of this person just tells me that there is a fairly high risk that teaching them GTD won't have good results. I think it's based on experience with a hundred people I've met on various courses who are trying to have a positive impact on the world. Also, when I had similar feelings in the past, it turned out they were predictive in more than half of the cases.


If you've always understood the terms "reasoning transparency" or "epistemic legitimacy" in the spirit of the third conversation, and your epistemology routinely involves steps like "I'm going to trust this black-box trained on lots of data a lot more than this transparent analysis based on published research", then you're probably safe. 


How this looks in practice

In my view, it is pretty clear that some of the main cruxes of current disagreements about AI alignment are beyond the limits of legible reasoning. (The current limits, anyway.) 

In my view, some of these intuitions have roughly the "black-box" form explained above. If you try to understand the disagreements between e.g. Paul Christiano and Eliezer Yudkowsky, you often end up in a situation where the real difference is "taste", which influences how much weight they give to arguments, how good or bad various future "board positions" are evaluated to be, etc. Both Elizer and Paul are extremely smart, have spent more than a decade thinking about AI safety and even more time on relevant topics such as ML or decision theory or epistemics.

A person new to AI safety evaluating their arguments is roughly at a similar position to a Go novice trying to make sense of two Go grandmasters disagreeing about a board, with the further unfortunate feature that you can't just make them play against each other, because in some sense they are both playing for the same side.

This isn't a great position to be in. But in my view it's better to understand where you are rather than, for example, naively updating on a few cherry-picked rollouts.

See also

Thanks to Gavin for help with writing  this post.

  1. ^

     We can go even further if we note that the later AlphaZero policy network doesn’t use tree search when playing.

Comments3


Sorted by Click to highlight new comments since:

I liked this post by Katja Grace on these themes.

Here is one way the world could be. By far the best opportunities for making the world better can be supported by philanthropic money. They last for years. They can be invested in a vast number of times. They can be justified using widely available information and widely verifiable judgments.

Here is another way the world could be. By far the best opportunities are one-off actions that must be done by small numbers of people in the right fleeting time and place. The information that would be needed to justify them is half a lifetime’s worth of observations, many of which would be impolite to publish. The judgments needed must be honed by the same.

These worlds illustrate opposite ends of a spectrum. The spectrum is something like, ‘how much doing good in the world is amenable to being a big, slow, public, official, respectable venture, versus a small, agile, private, informal, arespectable one’.

In either world you can do either. And maybe in the second world, you can’t actually get into those good spots, so the relevant intervention is something like trying to. (If the best intervention becomes something like slowly improving institutions so that better people end up in those places, then you are back in the first world). 

An interesting question is what factor of effectiveness you lose by pursuing strategies appropriate to world 1 versus those appropriate to world 2, in the real world. That is, how much better or worse is it to pursue the usual Effective Altruism strategies (GiveWell, AMF, Giving What We Can) relative to looking at the world relatively independently, trying to get into a good position, and making altruistic decisions.

I don’t have a good idea of where our world is in this spectrum. I am curious about whether people can offer evidence.

I enjoyed reading this post, and I think I agree with your assessment:

It is pretty clear that some of the main cruxes of current disagreements about AI alignment are beyond the limits of legible reasoning. (The current limits, anyway.) 

(In addition to the Christiano-Yudkowsky example you give, one could also point to the Hanson-Yudkowsky AI-Foom Debate of 2008.)

In addition to "Epistemic Legibility" and "A Sketch of Good Communication," which you mention, I'd recommend "Public beliefs vs. Private beliefs" (Tyre, 2022) to others who enjoyed this post – Tyre explores a somewhat related theme.

Ofer
10
0
0

First, you increase the pressure on the "justification generator" to mask various black boxes by generating arguments supporting their conclusions.

.

Third, there's a risk that people get convinced based on bad arguments - because their "justification generator" generated a weak legible explanation, you managed to refute it, and they updated. The problem comes if this involves discarding the output of the neural network, which was much smarter than the reasoning they accepted.

On the other hand, if someone in EA is making decisions about high-stakes interventions while their judgement is being influenced by a subconscious optimization for things like status and power, I think it's probably beneficial to subject their "justification generator" to a lot of pressure (in the hope that that will cause them, and onlookers, to end up making the best decisions from an EA perspective).

Curated and popular this week
Paul Present
 ·  · 28m read
 · 
Note: I am not a malaria expert. This is my best-faith attempt at answering a question that was bothering me, but this field is a large and complex field, and I’ve almost certainly misunderstood something somewhere along the way. Summary While the world made incredible progress in reducing malaria cases from 2000 to 2015, the past 10 years have seen malaria cases stop declining and start rising. I investigated potential reasons behind this increase through reading the existing literature and looking at publicly available data, and I identified three key factors explaining the rise: 1. Population Growth: Africa's population has increased by approximately 75% since 2000. This alone explains most of the increase in absolute case numbers, while cases per capita have remained relatively flat since 2015. 2. Stagnant Funding: After rapid growth starting in 2000, funding for malaria prevention plateaued around 2010. 3. Insecticide Resistance: Mosquitoes have become increasingly resistant to the insecticides used in bednets over the past 20 years. This has made older models of bednets less effective, although they still have some effect. Newer models of bednets developed in response to insecticide resistance are more effective but still not widely deployed.  I very crudely estimate that without any of these factors, there would be 55% fewer malaria cases in the world than what we see today. I think all three of these factors are roughly equally important in explaining the difference.  Alternative explanations like removal of PFAS, climate change, or invasive mosquito species don't appear to be major contributors.  Overall this investigation made me more convinced that bednets are an effective global health intervention.  Introduction In 2015, malaria rates were down, and EAs were celebrating. Giving What We Can posted this incredible gif showing the decrease in malaria cases across Africa since 2000: Giving What We Can said that > The reduction in malaria has be
Rory Fenton
 ·  · 6m read
 · 
Cross-posted from my blog. Contrary to my carefully crafted brand as a weak nerd, I go to a local CrossFit gym a few times a week. Every year, the gym raises funds for a scholarship for teens from lower-income families to attend their summer camp program. I don’t know how many Crossfit-interested low-income teens there are in my small town, but I’ll guess there are perhaps 2 of them who would benefit from the scholarship. After all, CrossFit is pretty niche, and the town is small. Helping youngsters get swole in the Pacific Northwest is not exactly as cost-effective as preventing malaria in Malawi. But I notice I feel drawn to supporting the scholarship anyway. Every time it pops in my head I think, “My money could fully solve this problem”. The camp only costs a few hundred dollars per kid and if there are just 2 kids who need support, I could give $500 and there would no longer be teenagers in my town who want to go to a CrossFit summer camp but can’t. Thanks to me, the hero, this problem would be entirely solved. 100%. That is not how most nonprofit work feels to me. You are only ever making small dents in important problems I want to work on big problems. Global poverty. Malaria. Everyone not suddenly dying. But if I’m honest, what I really want is to solve those problems. Me, personally, solve them. This is a continued source of frustration and sadness because I absolutely cannot solve those problems. Consider what else my $500 CrossFit scholarship might do: * I want to save lives, and USAID suddenly stops giving $7 billion a year to PEPFAR. So I give $500 to the Rapid Response Fund. My donation solves 0.000001% of the problem and I feel like I have failed. * I want to solve climate change, and getting to net zero will require stopping or removing emissions of 1,500 billion tons of carbon dioxide. I give $500 to a policy nonprofit that reduces emissions, in expectation, by 50 tons. My donation solves 0.000000003% of the problem and I feel like I have f
LewisBollard
 ·  · 8m read
 · 
> How the dismal science can help us end the dismal treatment of farm animals By Martin Gould ---------------------------------------- Note: This post was crossposted from the Open Philanthropy Farm Animal Welfare Research Newsletter by the Forum team, with the author's permission. The author may not see or respond to comments on this post. ---------------------------------------- This year we’ll be sharing a few notes from my colleagues on their areas of expertise. The first is from Martin. I’ll be back next month. - Lewis In 2024, Denmark announced plans to introduce the world’s first carbon tax on cow, sheep, and pig farming. Climate advocates celebrated, but animal advocates should be much more cautious. When Denmark’s Aarhus municipality tested a similar tax in 2022, beef purchases dropped by 40% while demand for chicken and pork increased. Beef is the most emissions-intensive meat, so carbon taxes hit it hardest — and Denmark’s policies don’t even cover chicken or fish. When the price of beef rises, consumers mostly shift to other meats like chicken. And replacing beef with chicken means more animals suffer in worse conditions — about 190 chickens are needed to match the meat from one cow, and chickens are raised in much worse conditions. It may be possible to design carbon taxes which avoid this outcome; a recent paper argues that a broad carbon tax would reduce all meat production (although it omits impacts on egg or dairy production). But with cows ten times more emissions-intensive than chicken per kilogram of meat, other governments may follow Denmark’s lead — focusing taxes on the highest emitters while ignoring the welfare implications. Beef is easily the most emissions-intensive meat, but also requires the fewest animals for a given amount. The graph shows climate emissions per tonne of meat on the right-hand side, and the number of animals needed to produce a kilogram of meat on the left. The fish “lives lost” number varies significantly by