Relative Impact of the First 10 EA Forum Prize Winners

So here are the mistakes pointed out in the comments:

EAF's hiring round had a high value of information, which I didn't incorporate, per Misha's comment
"Why we have over-rated Cool Earth" was more impactful than I thought, per Khorton's comment
I likely underestimated the posible negative impact of the 2017 donor lotery report, which was quite positive on ALLFED, per MichaelA's comment.

I think this (a ~30% mistakes rate) is quite brutal, and still only a lower bound (because there might be other mistakes which commenters didn't point out.) I'm pointing this out here because I want to reference this error rate in a forthcoming post.

There are a lot of things l like about this post. From small (e.g. the summary on top of it; and the table at the end) to large (e.g. it's a good thing to do given a desire to understand how to quantify/estimate impact better).

Here are some things I am perplexed about or disagree with:

EAF hiring round estimate misses the enormous realized value of information. As far as I can see, EAF decided to move to London (partly) because of that.
- > We moved to London (Primrose Hill) to better attract and retain staff and collaborate with other researchers in London and Oxford.
- > Budget 2020: $994,000 (7.4 expected full-time equivalent employees). Our per-staff expenses have increased compared with 2019 because we do not have access to free office space anymore, and the cost of living in London is significantly higher than in Berlin.

The donor lottery evaluation seems to miss that $100K would have been donated otherwise.
Further, I would suggest another decomposition.
- Impact = impact of running donor lottery as a tool (as opposed to donating without ~aggregation) + the counterfactuals impact of particular grants (as opposed to ~expected grants) + misc. side-effects (like a grantmaker joining LTFF).
- I can understand why you added the first two terms. But it seems to me that
  - we can get a principled estimate about the first one based on arguments for donor lotteries (e.g. epistemic advantage coming from spending more time per dollar donated; and freed time of donors);
    - One can get more empirical and have a quick survey here.
  - estimating the second term is trickier because you need to make a guess about the impact of an average epistemically advantaged donation (as opposed to an average donation of 100K I which I think is missing from your estimate)
    - Both of these are doable because we saw how other donor lottery winners gave their money and how wealthy/invested donors give their money.
    - A good proxy for an impact of average donation might come from (a) EA survey donation data, (b) a quick survey of lottery participants. The latter seems superior because participating in an early donor lottery suggests a higher engagement with EA ideas &c.
- After thinking a bit longer the choice of decomposition depends on what you want to understand better. It seems like your choice is better if you want to empirically understand whether the donor lottery is valuable.

Another weird thing is to see the 2017 Donor Lottery Grant having x5..10 higher impact than 2018 AI Alignment Literature Review and Charity Comparison.
- I think it might come down to you not subtracting the counterfactual impact of donating 100K w/o lottery from donors' lottery impact estimate.
- The basic source of impact of the donor lottery and charity review comes from an epistemic advantage (someone dedicating more time to think/evaluate donations; people being better informed about the charities they are likely to donate to). Given how well received the literature review is it seems to be (quite likely) helpful to individual donors and given that it (according to your guess) impacted $100K..1M it should be kinda as impactful or more impactful than an abstract donor lottery.
  - And it's hard to see this particular donor lottery as overwhelmingly more impactful than an average one.

Another weird thing is to see the 2017 Donor Lottery Grant having x5..10 higher impact than 2018 AI Alignment Literature Review and Charity Comparison.

I see now, that is weird. Note that if I calculate the total impact of the 100k to $1M I think Larks moved, the impact of that would be 100mQ to 2Q (change the Shapley value fraction in the Guessstimate to 1), which is closer to the 500mQ to 4Q I estimated from the 2017 Donor Lottery. And the difference can be attributed to a) Investing in organizations which are starting up, b) the high cost of producing AI safety papers, coupled with cause neutrality, and c) further error.

Good point re: value of information
Re: "The donor lottery evaluation seems to miss that $100K would have been donated otherwise": I don't think it does. In the "total project impact" section, I clarify that "Note that in order to not double count impact, the impact has to be divided between the funding providers and the grantee (and possibly with the new hires as well)."

Thank you, Nuno!

Am I understand correctly that the Shapley value multiplier (0.3 to 0.5) is responsible for preventing double counting?
- If so why don't you apply it to Positive status effects? The effect was also partially enabled by the funding providers (maybe less so).
- Huh! I am surprised that your Shapley value calculation is not explicit but is reasonable.
  - Let's limit ourselves to two players (= funding providers who are only capable of shallow evaluations and grantmakers who are capable of in-depth evaluation but don't have their own funds). You get Your estimate of "0.3 to 0.5" implies that shallowly evaluated giving is as impactful as "0 to 0.4" of in-depth evaluated giving.
  - This x2.5..∞ multiplier is reasonable but doesn't feel quite right to put 10% on above ∞ :)
This makes me further confused about the gap between the donor lottery and the alignment review.

You are understanding correctly that the Shapley value multiplier is responsible for preventing double-counting, but you're making a mistake when you say that it "implies that shallowly evaluated giving is as impactful as "0 to 0.4" of in-depth evaluated giving"; the latter doesn't follow.

In the two player game, you have Value({}), Value({1}), Value({2}), Value({1,2}), and the Shapley value for player 1 (the funders) is ([Value({1})- Value({})] + [Value({1,2})- Value({2})] )/2, and the value of player 2 (the donor lottery winner) is ([Value({2})- Value({})] + [Value({1,2})- Value({1})] )/2

In this case, I'm taking ([Value({2})- Value({})] to be ~0 for simplicity, so the value of player 2 is [Value({1,2})- Value({1})] )/2. Note that this is just the counterfactual value divided by a fraction.

If there were more players, it would be a little bit more complicated, but you'd end up with something similar to [Value({1,2,3})- Value({1,3})] )/3. Note again this is just the counterfactual value divided by a fraction.

But now, I don't know how many players there are, so I just consider [Value({The World})- Value({The world without player 2})] )/(some estimates of how many players there are).

And the Shapley value multiplier would be 1/(some estimates of how many players there are).

At no point am I assuming that "shallowly evaluated giving is as impactful as 0 to 0.4 of in-depth evaluated giving"; the thing that I'm doing is just allocating value so that the sum of the value of each player is equal to the total value.

Thank you for engaging!

First, "note that this [misha: Shapley value of evaluator] is just the counterfactual value divided by a fraction [misha: by two]." Right, this is exactly the same in my comment. I further divide by total impact to calculate the Shapley multiplier.
- Do you think we disagree?
- Why isn't my conclusion follows?
Second, you conclude "And the Shapley value multiplier would be 1/(some estimates of how many players there are)", while your estimate is"0.3 to 0.5". There have been like 30 participants over two lotteries that year, so you should have ended up with something an order of magnitude less like "3% to 10%".
- Am I missing something?
Third, for the model with more than two players, it's unclear to me who the players are. If these are funders + evaluators. You indeed will end up with $\frac{1}{N} (1 - \frac{V (funders)}{V (lottery)})$ because
- Shapley multipliers should add up to $1$ , and
- Shapley value of the funders is easy to calculate (any coalition without them lacks any impact).
- Please note that $V (funders)$ is $V (default, \dots)$ from the comment above.
(Note that this model ignores that the beneficiary might win the lottery and no donations will be made.)

In the end,

I think that it is necessary to estimate X in "shallowly evaluated giving is as impactful as X times of in-depth evaluated giving". Because if $X \approx 1$ impact of the evaluator is close to nil.
- I might not understand how you model impact here, please, be more specific about the modeling setup and assumptions.
I don't think that you should split evaluators. Well, basically because you want to disentangle the impact of evaluation and funding provision and not to calculate Adam's personal impact.
- Like, take it to the extreme: it would be pretty absurd to say that the overwhelmingly successful (e.g. seeding a new ACE Top Charity in yet unknown but highly tractable area of animal welfare and e.g. discovering AI alignment prodigy) donor lottery had an impact less than an average comment because there have been too many people (100K) contributing a dollar to participate in it.

Yes, we agree
No, we don't agree. I think that Adam did better than other potential donor lottery winners, and so his counterfactual value is higher, and thus his Shapley value is also higher. If all the other donors had been clones of Adam, I agree that you'd just divide by n. Thus, the "In every example here, this will be equivalent to calculating counterfactual value, and dividing by the number of necessary stakeholders" is in fact wrong, and I was implicitly doing both of the following in one step: a. Calculating Shapley values with "evaluators" as one agent and b. thinking of Adam's impact as a high proportion of the SV of the evaluator round,
The rest of our disagreements hinge on 2., and I agree that judging the evaluator step alone would make more sense.

Kirsten

On Sanjay's Cool Earth post, I have seen it frequently referenced. Founders Pledge came out with some climate change recommendations shortly after and I think people have been largely donating to those now instead.

I'll flag the narrow and lowish estimates about the Cool Earth as something I was most likely wrong about, then, thanks.

It’s hard to see how the writeup could have had a negative effect.

It seems plausible that people who gave to ALLFED, volunteered for ALLFED, worked for ALLFED, etc. due in part to Gleave's report would otherwise have done better things with their resources.

The report may also have led to EAs/global catastrophic risk researchers/longtermists talking about ALLFED more often and more positively, which could perhaps negative effect on perceptions of those communities, e.g. because:

Papers associated with them often present explicit quantitative models and estimates about very uncertain things (which some people are just averse to in general)
ALLFED and those models sometimes make claims that can seem intuitively fairly unlikely
- E.g., "AGI safety, alternative foods, and interventions for losing electricity/industry (and probably other interventions) likely save lives in the present generation more cost-effectively than GiveWell top charities."
  - That's a comment from Denkenberger rather than ALLFED as an insitutition, but ALLFED-related papers make similar claims
Those models do seem to have some noticeable issues
- Though I'd personally say that this is to be expected with any models, and a great thing about models is that they often make it easier to identify and correct specific issues, and I personally still basically agree with the qualitative conclusions drawn from the models)
A big part of ALLFED's focus is making a catastrophe less bad if it does happen, which could seem callous to some people

I think it's unlikely that the donor lottery report would have those downsides to a substantial extent.

And I'm personally quite positive about ALLFED, David Denkenberger, and their work, and ALLFED is one of the three places I've donated the most to (along with GCRI and the Long-Term Future Fund).

I'm just disagreeing with the claim "It’s hard to see how the writeup could have had a negative effect." I basically think most longtermism-related things could plausibly have negative effects, since they operate on variables that we think might be important for the long-term future and we're really bad at predicting precisely how the effects play out. (But this doesn't mean we just try to "do nothing", of course! Something with some downside risk can still be very positive in expectation.)

I'm not sure how often my 80% confidence interval would include negative effects, nor whether it'd include them in the ALLFED case. So maybe this is just a nit-pick about your specific phrasing, and we'd agree on the substance of your model/estimate.

Yeah, I see what you're saying. Do you think that it is hard for the writeup to have a negative total effect?

When I made my comment, I think I kind-of had in mind "negative total effect", rather than "at least one negative effect, whether or not it's offset". But I don't think I'd explicitly thought about the distinction (which is a bit silly), and my comment doesn't make it totally clear what I meant, so it's a good question.

I think my 80% confidence interval probably wouldn't include an overall negative impact of the writeup. But I think my 95% confidence interval would.

Reasons why my 80% confidence interval probably wouldn't include an overall negative impact of the writeup, despite what I said in my previous comment:

I think we should have some degree of confidence that, if there's more public discussion by people with fairly good epistemics and good epistemic and discussion norms, that'll tend to update people towards more accurate beliefs.
- (Not every time, but more often than it does the opposite.)
- As such, I think we should start off skeptical of claims like "An EA Forum post that influenced people's beliefs and behaviours substantially influenced those things in a bad way, even though in theory someone else could've pointed that out convincingly and thus prevented that influence."
And then there's also the fact that Gleave later got a role on the LTFF, suggesting he's probably good at reasoning about these things.
And there's also my object-level positive impressions of ALLFED.

I have nothing to disagree about here :)

Overall thoughts

Thanks, I found this post interesting.

I don't know what I think about the reasonableness of these specific evaluations, about how useful this sort of evaluation approach is, or about whether I'd like to see more of this sort of thing in future and exactly what form it should take. (To be clear, I literally just mean "I don't know", rather than meaning "I think this all sucks, but I'm being polite.") But I think it's plausible that this or something like it would be very valuable and should be scaled up substantially, so I think exploring the idea at least a bit is definitely worthwhile in expectation.

I'd be interested to hear roughly how long this whole process took you (or how long it took minus writing the actual post, or something)? This seems relevant to how worthwhile and scalable this sort of thing is.

(Of course, the process may become much faster as the people doing it become more experienced, better tools or templates for it are built, etc. But it may also become slower if one aims for more rigour / less pulling things out of thin air. In any case, I think how long this early attempt took should give at least a rough idea.)

I also had a bunch of reactions that aren't especially important since they're focused on specific points about each evaluation, rather than on the basic methods and how this sort of analysis can be useful. I'll split them into seperate comments.

Recently Nuño asked me to do similar (but shallower) forecasting for ~150 project ideas. It took me about 5 hours. I think I could have done the evaluation faster but I left ~paragraph-long comments on like to $½$ projects and sentence long comments on most others; I haven't done any advanced modeling or guesstimating.

I'd be interested to hear roughly how long this whole process took you (or how long it took minus writing the actual post, or something)? This seems relevant to how worthwhile and scalable this sort of thing is.

Maybe an afternoon for the initial version, and then two weeks of occasional tweaks. Say 10h to 30h in total? I imagine that if one wanted to scale this, one could get it to 30 mins to an hour for each estimate.

I think that that seems promisingly fast to me, given that this was an early attempt and could probably be sped up (holding quality/rigour constant) by experience, tools, templates, etc. So that updates me a bit further towards enthusiasm about this general idea.

Ozzie Gooen

I'd also note that the larger goals are to scale in non-human ways. If we have a bunch of examples, we could:

1) Open this up to a prediction-market style setup, with a mix of volunteers and possibly inexpensive hires.
2) As we get samples, some people could use data analysis to make simple algorithms to estimate the value of many more documents.
3) We could later use ML and similar to scale this further.

So even if each item were rather time-costly right now, this might be an important step for later. If we can't even do this, with a lot of work, that would be a significant blocker.

https://www.lesswrong.com/posts/kMmNdHpQPcnJgnAQF/prediction-augmented-evaluation-systems

jimrandomh

I was somewhat confused by the scale using Categorizing Variants of Goodhart's Law as an example of a 100mQ paper, given that the LW post version of that paper won the 2018 AI Alignment Prize ($5k), which makes a pretty strong case for it being "a particularly valuable paper" (1Q, the next category up). I also think this scale significantly overvalues research agendas and popular books relative to papers. I don't think these aspects of the rubric wound up impacting the specific estimates made here, though.

JKM

I'm not sure on the exact valuation research agendas should get, but I would argue that well thought-through research agendas can be hugely beneficial in that they can reorient many researchers in high-impact directions, leading them to write papers on topics that are vastly more important than they might have otherwise chosen.

I would argue an 'ingenious' paper written on an unimportant topic isn't anywhere near as good as a 'pretty good' paper written on a hugely important topic.

Yes, the scale is under construction, and you're not the first person to mention that the specific research agenda mentioned is overvalued.

Some specific things I was confused about

The estimated mQARPs per employee per month seems to differ substantially between sections. Is this based on something like dividing the posts/papers the org produced by the org's total budget or number of FTE employees? (Your comment on ALLFED vs AI safety papers seems to indicate this? Note that I didn't look closely at the Guesstimate model.)
"I have reason to believe that the amount of money influenced was large, and Larks's writeup was further the only one available. My confidence interval is then $100,000 to $1M in moved funding". That seems surprisingly high, but I have no specific knowledge on this. Could you share your reason to believe that? (But no problem if the reasoning is based on private info or is just hard to communicate concisely and explicitly.)
"You can see my first guesstimate model here, but I think that this was too low because it only took into account the value of papers, so here is a second model, which takes into account that the value which does not come from papers is 2 to 20x the magnitude of the value which does come from papers. I’m still uncertain, so my final estimate is even more uncertain than the Guesstimate model."
- Could you give a sense of where you see the rest of that value is coming from? (I'm not saying I disagree with you.)
- Was that accounted for in the other evaluations too? My impression from your written description (without looking at the models) was that e.g. for ALLFED you estimated their impact as entirely coming from posts and papers?
Is there a reason you estimated the impact of the Giving Tuesday and EAGx things in terms of dollars moved, without converting that into mQARPS?
- Part of why this confuses me is that:
  - I'd guess that dollars moved is actually not the primary value of EAGx (though you can still convert the value into dollars-moved-equivalents if you want)
  - Meanwhile, I'd guess that dollars moved is the primary value of the animal welfare commitments post and maybe some other posts (though you can still convert the value into mQARPs if you want)
"Similarly, while EAGx Boston 2018 and the EA Giving Tuesday Donation Matching Initiative might have taken similar amounts of time to organize, by comparably capable people, I prefer the second. This is in large part because EAGx events are scalable, whereas Giving Tuesdays are not." Did you mean to say Giving Tuesday is scalable, whereas EAGx events are not?

In the case of ALLFED, this is based on my picturing of one employee going about its month, and asking myself how surprising it would be if they couldn't produce 10 mQARPS of value per month, or how surprising it would be if they could produce 50 mQARPs per month. In the case of the AI safety organizations, this is based on estimating the value of each of the papers that Larks things are valuable enough to mention, and then estimating what fraction of the total value of an organization those are.
Private info
a) Building up researchers into more capable researchers, knowledge acquired that isn't published, information value of trying out dead ends, acquiring prestige, etc. b) I actually didn't estimate ALLFED's impact, I estimated the impact of the marginal hires, per 1.
Personal taste, it's possible that was the inferior choice. I found it more easy to picture the dollars moved than the improvement in productivity. In hindsight, maybe improving retention would be another main benefit which I didn't consider.
I got that as a comment. The intuition here is that it would be really, really hard to find a project which moves as much money as Giving Tuesday and which you could do every day, every week, or every month. But if there are more than 52 local EA groups, an EAGx could be organized every week. If you think that EA is only doing projects at maximum efficiency (which it isn't), and knowing only that Giving Tuesdays are done once a year and EAGx are done more often, I'd expect one EAGx to be less valuable than one Giving Tuesday.
- Or, in other words, I'd expect there to be some tradeoff between quality and scalable.

Thanks for the clarifications :)

I actually didn't estimate ALLFED's impact, I estimated the impact of the marginal hires, per 1.

So did that estimate of the impact of marginal hires also account for how much those hires would contribute to "Building up researchers [themselves or others] into more capable researchers, knowledge acquired that isn't published, information value of trying out dead ends, acquiring prestige"?

Oof, no it didn't, good point.

How would something like this approach be used for decision-making?

You write:

We don’t normally estimate the value of small to medium-sized projects.
But we could!
If we could do this reliably and scalably, this might lead us to choose better projects

And:

In addition, one could also use these kinds of estimates to choose one’s own projects, or to recommend projects to others, and see how that fares.

But this post estimates the impact of already completed projects/writeups. So precisely this sort of method couldn't directly be used to choose what projects to do. Instead, I see at least two broad ways something like this method could be used as an input when choosing what projects to do:

When choosing what projects to do, look at estimates like these, and either explicitly reason about or form intuitions about what these estimates suggest about the impact of different projects one is considering
- One way to do this would be to create classifications for different types of projects, and then look up what has been the estimated impact per dollar of past projects in the same or similar classifications to each of the projects one is now choosing between
- I think there'd be many other specific ways to do this as well
When choosing what projects to do, explicitly make estimates like these for those specific future projects
- If that's the idea, then the sort of approach taken in this post could be seen as:
  - just a proof of concept, or
  - a way to calibrate one's intuitions/forecasts (with the hope being that there'll be transfer between calibration when estimating the impact of past projects and calibration when forecasting the impact of future projects), or
  - a way of getting reference classes / base rates / outside views
    - In that case, this second approach would sort-of incorporate the first approach suggested above as one part of it

Is one of these what you had in mind? Or both? Or something else?

Yeah, I think that the distinction between evaluation and forecasting is non-central. For example, these estimates can also be viewed as forecasts of what I would estimate if I spent 100x as much time on this, or as forecasts of what a really good system would output.

More to the point, if a project isn't completed I could just estimate the distribution of expected quality, and the expected impact given each degree of quality (or, do a simplified version of that).

That said, I was thinking more about 2., though having a classification/lookup scheme would also be a way to produce explicit estimates.

For example, these estimates can also be viewed as forecasts of what I would estimate if I spent 100x as much time on this, or as forecasts of what a really good system would output.

Agreed, but that's still different from forecasting the impact of a project that hasn't happened yet, and the difference intuitively seems like it might be meaningful for our purposes. I.e., it's not immediately obvious that methods and intuitions that work well for the sort of estimation/forecasting done in this post would also work well for forecasting the impact of a project that hasn't happened yet.

One could likewise say that it's not obvious that methods and intuitions that work well for forecasting how I'll do in job applications would also work well for forecasting GDP growth in developing countries. So I guess my point was more fundamentally about the potential significance of the domain being different, rather than whether the thing can be seen as a type of forecasting or not.

So it sounds like you're thinking that the sort of thing done in this post would be "a way to calibrate one's intuitions/forecasts (with the hope being that there'll be transfer between calibration when estimating the impact of past projects and calibration when forecasting the impact of future projects)"?

That does seem totally plausible to me; it just adds a step to the argument.

(I guess I'm also more generally interested in the question of how well forecasting accuracy and calibration transfers across domains - though at the same time I haven't made the effort to look into it at all...)

Yes, I expect the intuitions and for estimation to generalize/help a great deal with the forecasting step, and though I agree that this might not be intuitively obvious. I understand that estimation and forecasting seem like different categories, but I don't expect that to be a significant hurdle in practice.

Specific reactions to the evaluation of Takeaways from EAF's Hiring Round

The way to estimate impact here would be something like: "Counterfactual impact of the best hire(s) in the organizations it influenced, as opposed to the impact of the hires who would otherwise have been chosen"

I think that's a substantial part of the impact, but that there may be other substantial parts too, such as:

Time saved by employees who have to design application processes (since it's usually easier to do things when one has a good writeup as guidance)
Causing orgs to hire sooner, since they're more confident they can do it well and without a huge time investment
Something along the lines of "health of the organisation"; if that post reduces the chance of making a hire which isn't a good fit, it reduces the chance of frictions and someone ending up being fired or quitting, which I imagine are negative for the culture of the organisation
Something along the lines of "health of the community"; I imagine a better hiring round will mean better applicant experiences, which could reduce rates of value drift or burnout or the like

But these are just quick thoughts, and I haven't run application rounds myself, and I think that those things overlap somewhat such that there's a risk of double-counting.

As an upper bound: Let's say it influenced 1 to 5 hiring rounds, and advice in the post allowed advice-takers to hire 0 to 3 people per organization who were 1 to 10% more effective, and who stayed with the organization for 0.5 to 3 years

FWIW, I think the upper bound of my 80% confidence interval would be above 10% more effective and 3 years staying at the org, and definitely above 1% more effect and 0.5 years staying there.

I'm also not sure how to interpret your upper bound itself having a range? (Caveat that I haven't looked at your Guesstimate model.)

I also think that one other effect perhaps worth modelling is that better hiring rounds might mean hires stay at the org for longer (since better and more fitting hires are chosen). This could either be modelled as more output by those employees, or as less cost/output-reduction by employees involved in later hiring rounds, or maybe both.

There are also cases in which an org just doesn't hire anyone at all at a given time if they don't find a good enough fit, and presumably better hiring rounds somewhat reduces the odds of that.

Ballpark: 0 to three excellent EA forum posts.

FWIW, intuitively, that seems like a pretty low upper bound for the value of improving other orgs' hiring rounds. I guess this is just for the reasons noted above. (And obviously it'd be better if I actually myself provided specific alternative numbers and models - I'm just taking the quicker option, sorry!)

but sometimes, like in the case of a hiring round, I have the intuition that one might want to penalize the hiring organization for the lost opportunity cost of applicants, even though that’s not what Shapley values recommends

Many of the other projects were stated to have an impact by increasing the funding certain organisations received, thereby helping them hire more people, thereby resulting in more useful output. So by that logic, shouldn't those projects also be penalised for the lost opportunity cost of applicants involved in the hiring rounds run by the orgs which received extra funding due to the project?

Or am I misunderstanding the reasoning or the modelling approach? (That's very possible; I didn't actually look at any of your Guesstimate models.)

I think that's a substantial part of the impact, but that there may be other substantial parts too, such as...

Yes, those seem like at least somewhat important pathways to impact that I've neglected, particularly the first two points. I imagine that could easily lead to a 2x to 3x error (but probably not to a 10x error)

To answer this specifically:

FWIW, I think the upper bound of my 80% confidence interval would be above 10% more effective and 3 years staying at the org, and definitely above 1% more effect and 0.5 years staying there.

Yeah, I disagree with this. I'd expect most interventions to have a small effect, and in particular I expect it to just be hard to change people's actions by writing words. In particular, I'd be much higher if I was thinking about the difference between a completely terrible hiring round and an excellent one, but I don't know that people start off all that terrible or that this particular post brings people up all that much.

That seems reasonable. I think my intuitions would still differ from yours, but I don't have that much reason to expect my intuitions are well-calibrated here, nor have I thought about this carefully and explicitly.