1891Joined Sep 2014


I even explicitly said I am less familiar with BP as a debate format.

The fact that you are unfamiliar with the format, and yet are making a number of claims about it, is pretty much exactly my issue. Lack of familiarity is an anti-excuse for overconfidence.

The OP is about an event conducted in BP. Any future events will presumably also be conducted in BP. Information about other formats is only relevant to the extent that they provide information about BP. 

I can understand not realising how large the differences between formats are initially, and so assuming information from other formats has strong relevance at first, which is why I was sympathetic to your original comment, but a bunch of people have pointed this out by now.

I expect substantiated criticisms of BP as a truth-seeking device (of which there are many!) to look more like the stuff that Ben Pace is saying here, and less like the things you are writing. In brief, I think the actual biggest issues are:

  1. 15-minute prep makes for a better game but for very evidence-light arguments.
  2. Judges are explicitly not supposed to reward applause lights, but they are human, so sometimes they do.
  3. It's rarely a good idea to explicitly back down, even on an issue you are clearly losing. Instead you end up making a lot of 'even if' statements. I think Scott did a good job of explaining why that's not ideal in collaborative discussions (search for "I don’t like the “even if” framing.").

(1) isn't really a problem on the meta (read: relevant) level, since it's very obvious; mostly I think this ends up teaching the useful lesson 'you can prove roughly anything with ungrounded arguments'. (2) and (3) can inculcate actual bad habits, which I would worry about more if EA  wasn't already stuffed full of those habits and if my personal experience didn't suggest that debaters are pretty good at dropping those habits outside of the debates themselves. Still, I think they are things reasonable people can worry about.

By contrast, criticisms I think mostly don't make sense:

  • Goodharting 
  • Anything to the effect of 'the speakers might end up believing what they are saying', especially at top levels. Like, these people were randomly assigned positions, have probably been assigned the roughly opposite position at some point, and are not idiots. 

Finally, even after a re-read and showing your comment to two other people seeking alternative interpretations, I think you did say the thing you claim not to have said. Perhaps you meant to say something else, in which case I'd suggest editing to say whatever you meant to say. I would suggest an edit myself, but in this case I don't know what it was you meant to say.

You did give some responses elsewhere, so a few thoughts on your responses:

But this is really far from the only way policy debate is broken. Indeed, a large fraction of policy debates end up not debating the topic at all, but end up being full of people debating the institution of debating in various ways, and making various arguments for why they should be declared the winner for instrumental reasons. This is also pretty common in other debate formats.

(Emphasis added). This seems like a classic case for 'what do you think you know, and how do you think you know it?'. 

Here's why I think I know the opposite: the standard in British Parliamentary judging is to judge based on the 'Ordinary Intelligent Voter', defined as follows:

In particular, judges are asked to conceive of themselves as if they were a hypothetical ‘ordinary intelligent voter’ (sometimes also termed ‘average reasonable person’ or ‘informed global citizen’). This hypothetical ordinary intelligent voter doesn’t have pre-formed views on the topic of the debate and isn’t convinced by sophistry, deception or logical fallacies. They are well informed about political and social affairs but lack specialist knowledge. They are open-minded and concerned to decide how to vote – they are thus willing to be convinced by the debaters who provide the most compelling case for or against a certain policy. They are intelligent to the point of being able to understand and assess contrasting arguments (including sophisticated arguments), that are presented to them; but they keep themselves constrained to the material presented unless it patently contradicts common knowledge or is otherwise wildly implausible.

This definition is basically designed to be hard to Goodhart. It's still easy for judging cultures to take effect and either reward or fail to punish unhelplful behaviour, and personally I would list 'speaking too fast' under this, but nothing in that definition is likely to lead to people 'debating the institution of debating'. So unsurprisingly, I saw vanishingly little of this. Scanning down recent WUDC finals, the only one where the speakers appear to come close to doing this is the one where the motion itself is "This house believes that University Debating has done more harm than good". Correspondingly, I see no cases where they end up 'not debating the topic at all'. 

The debates I participated in in high-school had nobody talking fast. But it had people doing weird meta-debate, and had people repeatedly abusing terrible studies because you can basically never challenge the validity or methodology of a study, or had people make terrible rhetorical arguments, or intentionally obfuscate their arguments until they complete it in the last minute so the opposition would have no time to respond to it.

I mean, I'm sorry you had terrible judges or a terrible format I guess? I judged more high school debates than virtually anyone during my time at university, and these are not things I would have allowed to fly, because they are not things I consider persuasive to the Ordinary Intelligent Voter; the 'isn't convinced by sophistry, deception or logical fallacies' seems particularly relevant. 

On that note, I don't think it's a coincidence that a significant fraction of my comments on this forum are about challenging errors of math or logic. My rough impression is that other users often notice something is wrong, but struggle to identify it precisely, and so say nothing. It should be obvious why I'm keen on getting more people who are used to structuring their thoughts in such a way that they can explain the exact perceived error. Such exactness has benefits even when the perception is wrong and the original argument holds, because it's easier to refute the refutation. 

I might be wrong here, but I currently don't really believe that recruiting from the debate community is going to increase our cognitive diversity on almost any important dimension.

The Oxbridge debating community at least is pretty far to the right of the EA community, politically speaking. I consider this an important form of cognitive diversity, but YMMV. 


Overall, I'm left with the distinct impression that you've made up your mind on this based on a bad personal experience, and that nothing is likely to change that view. Which does happen sometimes when there isn't much in the way of empirical data (after all, there's sadly no easy way for me to disprove your claim that a large fraction of BP debates end up not debating the topic at all..), and isn't a bad reasoning process per se, but confidence in such views should necessarily be limited. 

Thanks for this, pretty interesting analysis.

Every time I come across an old post in the EA forum I wonder if the karma score is low because people did not get any value from it or if people really liked it and it only got a lower score because fewer people were around to upvote it at that time.

The other thing going on here is that the karma system got an overhaul when forum 2.0 launched in late 2018, giving some users 2x voting power and also introducing strong upvotes. Before that, one vote was one karma. I don't remember exactly when the new system came in, but I'd guess this is the cause of the sharp rise on your graph around December 2018. AFAIK, old votes were never re-weighted, which is why if you go back through comments on old posts you'll see a lot of things with e.g. +13 karma and 13 total votes, a pattern I don't recall ever seeing since. 

Partly as a result, most of the karma old posts have will have been from people going back and upvoting them later once the new system was impemented, e.g. from memory my post from your list was around +10 for most of its life, and has drifted to its current +59 over the past couple of years.

This jumps out to me because I'm pretty sure that post was not a particularly high-engagement post even at the time it was written, but it's the second-highest 2015 post on your list. I think this is because it's been linked back to a fair amount and so can partially benefit from the karma inflation.

(None of which is meant to take away from the work you've done here, just providing some possibly-helpful context.)

I think these concerns are all pretty reasonable, but also strongly discordant with my personal experience, so I figured it would help third parties if I explained the key insights/skills I think I learned or were strongly reinforced by my debating experience. 

Three notable caveats on that experience:

  • I spent more time judging debates than I did speaking in them, which is moderately unusual. It's plausible to me that judging was much more useful.
  • It was 8-12 years ago, and my independent impression is that the top levels of the sport have degenerated somewhat since (e.g. I watched world-class debaters speak and while they spoke fast, I've never seen anything like the link Oli posted).
  • I approached debating with a mindset of 'this is an area I am naturally weak in and want to get better at', so it was always more likely to complement my natural quantitative approach to figuring things out, rather than replacing it.

(Edit: Since some other discussions on this thread are talking about various formats, I should also add that my experience is entirely inside British Parliamentary debate.)

All in all, I think it's very plausible Oli's experience was closer to a typical 2021 experience than mine. But mostly I'm just not sure, for one thing I'd bet that the 'cram as many points in as possible' strategy is still much less prevalent at lower levels. 

With that out of the way, here are things I picked up that I think are important and useful for truth-tracking, as opposed to persuasion.

  • Actually listening to the arguments that have been made, in a way that means I could repeat them back with at-least-comparable eloquence to the speaker. Put another way, I think debating made me much better at ideological Turing tests.
  • A healthy skepticism of the power of arguments and inner-sense-of-conviction as a truth-tracking device, particularly whenever you are talking to someone smarter and more charismatic than yourself, or whenever you've just done something like give a speech (or write a blog post/comment!) in favour of a particular conclusion, or whenever you are surrounded by a group of people who all think the same way. This is very closely related to Epistemic Learned Helplessness. It seems like Scott realised this by reading pseudohistory books, see below quote, but my parallel 'oh shit' moment was being thoroughly out-argued and convinced by much better debaters in favour of A, and then being equally out-argued by debaters in favour of not-A. Unlike Scott's experience, I think those people could argue circles around me on virtually every topic. Which just makes it even more obvious you need a better approach.
  • Being able to generate (some) strong arguments against things I strongly believe and being able to do it independently. It's pretty common for novice debaters who are highly committed socialists to be unable to come up with any arguments for free markets, or vice-versa. I often see similar patterns, including on that exact issue but also on many other issues, within EA groups. I think getting better at this is critical if we want to do more policy work. Closely related: Policy debates should not appear one-sided. I'm also reminded of Haidt's work on moral foundations and how liberals tend to ignore some of the foundations.
  • Identifying critical disagreements, areas that if they resolved one way would likely result in a win for one side, and if they resolved the other way would win for the other side. These are very close to, though not quite the same as, CFAR's concept of a double crux.

To state the hopefully-obvious, I doubt debating is the optimal way to learn any of this. If I was talking to an EA without debating experience who really wanted to pick up the things I picked up, I'd advise them to read and reflect on the above links, and probably a few other related links I didn't think of, rather than getting involved in competitive debating, partly for reasons Oli gives and partly for time reasons. I did it primarily because it was fun and the fact it happened to be (imo) useful was a bonus, not unlike the reasons I played Chess or strategy games. That and the fact that half those posts didn't even exist back in 2009. 

At the same time, if I want to learn things from a conversation with someone that I disagree with, and all I know is that I have the choice between talking to someone with or without debating experience, I'm going with the first person. Past experience has taught me that the conversation is likely to be more efficient, more focused on cruxes and falsifiable beliefs, and thus less frustrating.

And there are people who can argue circles around me. Maybe not on every topic, but on topics where they are experts and have spent their whole lives honing their arguments. When I was young I used to read pseudohistory books; Immanuel Velikovsky’s Ages in Chaos is a good example of the best this genre has to offer. I read it and it seemed so obviously correct, so perfect, that I could barely bring myself to bother to search out rebuttals.

And then I read the rebuttals, and they were so obviously correct, so devastating, that I couldn’t believe I had ever been so dumb as to believe Velikovsky.

And then I read the rebuttals to the rebuttals, and they were so obviously correct that I felt silly for ever doubting.

And so on for several more iterations, until the labyrinth of doubt seemed inescapable...

So taking a step back for a second, I think the primary point of collaborative written or spoken communication is to take the picture or conceptual map in my head and put it in your head, as accurately as possible. Use of any terms should, in my view, be assessed against whether those terms are likely to create the right picture in a reader's or listener's head. I appreciate this is a somewhat extreme position.

If everytime you use the term heavy-tailed (and it's used a lot - a quick CTRL + F tells me it's in the OP 25 times) I have to guess from context whether you mean the mathematical or commonsense definitions, it's more difficult to parse what you actually mean in any given sentence. If someone is reading and doesn't even know that those definitions substantially differ, they'll probably come away with bad conclusions.

This isn't a hypothetical corner case - I keep seeing people come to bad (or at least unsupported) conclusions in exactly this way, while thinking that their reasoning is mathematically sound and thus nigh-incontrovertible. To quote myself above:

The above, in my opinion, highlights the folly of ever thinking 'well, log-normal distributions are heavy-tailed, and this should be log-normal because things got multiplied together, so the top 1% must be at least a few percent of the overall value'.

If I noticed that use of terms like 'linear growth' or 'exponential growth' were similarly leading to bad conclusions, e.g. by being extrapolated too far beyond the range of data in the sample, I would be similarly opposed to their use. But I don't, so I'm not. 

If I noticed that engineers at firms I have worked for were obsessed with replacing exponential algorithms with polynomial algorithms because they are better in some limit case, but worse in the actual use cases, I would point this out and suggest they stop thinking in those terms. But this hasn't happened, so I haven't ever done so. 

I do notice that use of the term heavy-tailed (as a binary) in EA, especially with reference to the log-normal distribution, is causing people to make claims about how we should expect this to be 'a heavy-tailed distribution' and how important it therefore is to attract the top 1%, and get the idea.

Still, a full taboo is unrealistic and was intended as an aside; closer to 'in my ideal world' or 'this is what I aim for my own writing', rather than a practical suggestion to others. As I said, I think the actual suggestions made in this summary are good - replacing the question 'is this heavy-tailed or not' with 'how heavy-tailed is this' should do the trick- and hope to see them become more widely adopted.

Briefly on this, I think my issue becomes clearer if you look at the full section.

If we agree that log-normal is more likely than normal, and log-normal distributions are heavy-tailed, then saying 'By contrast, [performance in these jobs] is thin-tailed' is just incorrect? Assuming you meant the mathematical senses of heavy-tailed and thin-tailed here, which I guess I'm not sure if you did.

This uncertainty and resulting inability to assess whether this section is true or false obviously loops back to why I would prefer not to use the term 'heavy-tailed' at all, which I will address in more detail in my reply to your other comment.

Ex-post performance appears ‘heavy-tailed’ in many relevant domains, but with very large differences in how heavy-tailed: the top 1% account for between 4% to over 80% of the total. For instance, we find ‘heavy-tailed’ distributions (e.g.  log-normal, power law) of scientific citations, startup valuations, income, and media sales. By contrast, a large meta-analysis reports ‘thin-tailed’ (Gaussian) distributions for ex-post performance in less complex jobs such as cook or mail carrier

Hi Max and Ben, a few related thoughts below. Many of these are mentioned in various places in the doc, so seem to have been understood, but nonetheless have implications for your summary and qualitative commentary, which I sometimes think misses the mark. 

  • Many distributions are heavy-tailed mathematically, but not in the common use of that term, which I think is closer to 'how concentrated is the thing into the top 0.1%/1%/etc.', and thus 'how important is it I find top performers' or 'how important is it to attract the top performers'. For example, you write the following:

What share of total output should we expect to come from the small fraction of people we’re most optimistic about (say, the top 1% or top 0.1%) – that is, how heavy-tailed is the distribution of ex-ante performance? 

  • Often, you can't derive this directly from the distribution's mathematical type. In particular, you cannot derive it from whether a distribution is heavy-tailed in the mathematical sense. 
  • Log-normal distributions are particuarly common and are a particular offender here, because they tend to occur whenever lots of independent factors are multiplied together. But here is the approximate* fraction of value that comes from the top 1% in a few different log-normal distributions:
    EXP(N(0,0.0001))  -> 1.02%
    EXP(N(0,0001)) -> 1.08%
    EXP(N(0,0.01)) -> 1.28%
    EXP(N(0,0.1)) -> 2.22%
    EXP(N(0,1)) -> 9.5%
  • For a real-world example, geometric brownian motion is the most common model of stock prices, and produces a log-normal distribution of prices, but models based on GBM actually produce pretty thin tails in the commonsense use, which are in turn much thinner than the tails in real stock markets, as (in?)famously chronicled in Taleb's Black Swan among others. Since I'm a finance person who came of age right as that book was written, I'm particularly used to thinking of the log-normal distribution as 'the stupidly-thin-tailed one', and have a brief moment of confusion every time it is referred to as 'heavy-tailed'. 
  • The above, in my opinion, highlights the folly of ever thinking 'well, log-normal distributions are heavy-tailed, and this should be log-normal because things got multiplied together, so the top 1% must be at least a few percent of the overall value'. Log-normal distributions with low variance are practically indistinguishable from normal distributions. In fact, as I understand it many oft-used examples of normal distributions, such as height and other biological properties, are actually believed to follow a log-normal distribution.


I'd guess we agree on the above, though if not I'd welcome a correction. But I'll go ahead and flag bits of your summary that look weird to me assuming we agree on the mathematical facts:

By contrast, a large meta-analysis reports ‘thin-tailed’ (Gaussian) distributions for ex-post performance in less complex jobs such as cook or mail carrier [1]: the top 1% account for 3-3.7% of the total.

I haven't read the meta-analysis, but I'd tentatively bet that much like biological properties these jobs actually follow log-normal distributions and they just couldn't tell (and weren't trying to tell) the difference. 

These figures illustrate that the difference between ‘thin-tailed’ and ‘heavy-tailed’ distributions can be modest in the range that matters in practice

I agree with the direction of this statement, but it's actually worse than that: depending on the tail of interest "heavy-tailed distributions" can have thinner tails than "thin-tailed distributions"! For example, compare my numbers for the top 1% of various log-normal distributions to the right-hand-side of a standard N(0,1) normal distribution where we cut off negative values (~3.5% in top 1%).  


It's also somewhat common to see comments like this from 80k staff (This from Ben Todd elsewhere in this thread):

You can get heavy tailed outcomes if performance is the product of two normally distributed factors (e.g. intelligence x effort).

You indeed can, but like the log-normal distribution this will tend to have pretty thin tails in the common use of the term. For example, multipling two N(100,225) distributions together, chosen because this is roughly the distribution of IQ, gets you a distribution where the top 1% account for 1.6% of the total. Looping back to my above thought, I'd also guess that performance on jobs like cook and mail-carrier look very close to this, and empirically were observed to have similarly thin tails (aptitude x intelligence x effort might in fact be the right framing for these jobs).


Ultimately, the recommendation I would give is much the same as the bottom line presented, which I was very happy to see. Indeed, I'm mostly grumbling because I want to discourage anything which treats heavy-tailed as a binary property**, as parts of the summary/commentary tend to, see above.

Some advice for how to work with these concepts in practice:

  • In practice, don’t treat ‘heavy-tailed’ as a binary property. Instead, ask how heavy the tails of some quantity of interest are, for instance by identifying the frequency of outliers you’re interested in (e.g. top 1%, top 0.1%, …) and comparing them to the median or looking at their share of the total. [2]
  • Carefully choose the underlying population and the metric for performance, in a way that’s tailored to the purpose of your analysis. In particular, be mindful of whether you’re looking at the full distribution or some tail (e.g. wealth of all citizens vs. wealth of billionaires).

*Approximate because I was lazy and just simulated 10000 values to get these and other quoted numbers. AFAIK the true values are not sufficiently different to affect the point I'm making. 

**If it were up to me, I'd taboo the term 'heavy-tailed' entirely, because having an oft-used term whose mathematical and commonsense notions differ is an obvious recipe for miscommunication in a STEM-heavy community like this one. 

I want to push back against a possible interpretation of this moderately strongly.

If the charity you are considering starting has a 40% chance of being 2x better than what is currently being done on the margin, and a 60% chance of doing nothing, I very likely want you to start it, naive 0.8x EV be damned. I could imagine wanting you to start it at much lower numbers than 0.8x, depending on the upside case. The key is to be able to monitor whether you are in the latter case, and stop if you are. Then you absorb a lot more money in the 40% case, and the actual EV becomes positive even if all the money comes from EAs.

If monitoring is basically impossible and your EV estimate is never going to get more refined, I think the case for not starting becomes clearer. I just think that's actually pretty rare?

From the donor side in areas and at times where I've been active, I've generally been very happy to give 'risky' money to things where I trust the founders to monitor and stop or switch as appropriate, and much more conservative (usually just not giving) if I don't. I hope and somewhat expect other donors are willing to do the same, but if they aren't that seems like a serious failure of the funding landscape. 

I have a few thoughts here, but my most important one is that your (2), as phrased, is an argument in favour of outreach, not against it. If you update towards a much better way of doing good, and any significant fraction of the people you 'recruit' update with you, you presumably did much more good via recruitment than via direct work. 

Put another way, recruitment defers to question of how to do good into the future, and is therefore particularly valuable if we think our ideas are going to change/improve particularly fast. By contrast, recruitment (or deferring to the future in general) is less valuable when you 'have it all figured out'; you might just want to 'get on with it' at that point. 


It might be easier to see with an illustrated example: 

Let's say in the year 2015 you are choosing whether to work on cause P, or to recruit for the broader EA movement. Without thinking about the question of shifting cause preferences, you decide to recruit, because you think that one year of recruiting generates (e.g.) two years of counterfactual EA effort at your level of ability.

In the year 2020, looking back on this choice, you observe that you now work on cause Q, which you think is 10x more impactful than cause P. With frustration and disappointment, you also observe that a 'mere' 25% of the people you recruited moved with you to cause Q, and so your original estimate of two years actually became six months (actually more because P still counts for something in this example, but ignoring that for now).

This looks bad because six months < one year, but if you focus on impact rather than time spent then you realise that you are comparing one year of work on cause P, to six months of work on cause Q. Since cause Q is 10x better, your outreach 5x outperformed direct work on P, versus the 2x you thought it would originally.


You can certainly plug in numbers where the above equation will come out the other way - suppose you had 99% attrition - but I guess I think they are pretty implausible? If you still think your (2) holds, I'm curious what (ballpark) numbers you would use. 

+1. A short version of my thoughts here is that I’d be interested in changing the EA name if we can find a better alternative, because it does have some downsides, but this particular alternative seems worse from a strict persuasion perspective.

Most of the pushback I feel when talking to otherwise-promising people about EA is not really as much about content as it is about framing: it’s people feeling EA is too cold, too uncaring, too Spock-like, too thoughtless about the impact it might have on those causes deemed ineffective, too naive to realise the impact living this way will have on the people who dive into it. I think you can see this in many critiques.

(Obviously, this isn’t universal; some people embrace the Spock-like-mindset and the quantification. I do, to some extent, or I wouldn’t be here. But I’ve been steadily more convinced over the years that it’s a small minority.)

You can fight this by framing your ideas in warmer terms, but it does seem like starting at ‘Global Priorities community’ makes the battle more uphill. And I find losing this group sad, because I think the actual EA community is relatively warm, but first impressions are tough to overcome.

Low confidence on all of the above, would be happy to see data.

Load More