# All of Jsevillamol's Comments + Replies

Takeaways from our interviews of Spanish Civil Protection servants

(disclaimer: this is my opinion)

In short:  Spanish civil protection would not as of today consider making plans to address specific GCRs

There is this weird tension where they believe that resilience is very important, and that planning in advance is nearly useless for non-recurring risks.

The civil protection system is very geared towards response. Foresight, mitigation and prevention seldom happens.This means they are quite keen on improving their general response capacity but they have no patience for hypotheticals. So they would not consider specifi... (read more)

Takeaways from our interviews of Spanish Civil Protection servants

Basically the picture I got is: public health is delegated to the ministry of health, and in particular pandemics are seen as the business of the Centre of Sanitary Alerts (CCAES) The CCAES does have an early warning system, but we could not talk to them and we don't know how the systems reacted to COVID.

In the civil prot side, they basically did nothing until the govn declared an... (read more)

Announcing my retirement

Echoing everyone else, thank you for all your hard work.

I do not exaggerate when I say you are the best forum moderator I have ever seen. I am really impressed with your availability, creativity and kindness. You have driven the culture of this website to a whole new level, and inspired me and I bet many others to write better content.

Good luck at OpenPhil!

Don’t wait – there’s plenty more need and opportunity today

I think that this kind of criticism is really useful, and I am glad it was written.

That being said, there is something about this post that really rubbed me the wrong way. This is a shame, because the topic is very pertinent and deserves an in-depth discussion - how do opportunities for funding today compare to opportunities for funding 5 or 10 years from now?

Let me try to give my best shot at a more thoughtful critique.

In the midst of a global pandemic that pushed 150 million people into extreme poverty while billionaires’ wealth grew by

"I assume that the crux here is that GiveDirectly believes that spending more money now would have a good publicity effect, that would promote philanthropy and raise the total amount of donations overall.
I would change my mind if this was the case, but I don't see this as obvious."

I'm not entirely sure what the answer is here either, but one thought I had today was "I should make a Facebook post for Thanksgiving/Christmas telling my friends why I think it's so important to donate to GiveWell - your marginal donation can save a life for \$3-5k! Ah, but... (read more)

I agree with most of your comment.

However, given that GiveWell want to use a bar of 5-7x GiveDirectly, I think accounting for a study that at best will demonstrate that GiveDirectly is 2.6 times more effective than previously thought, will not influence GiveWell’s decision to wait for better opportunities, since it still doesn’t meet the 5-7x GiveDirectly bar.

9Khorton6dI agree that the criticism is very welcome and also in places felt a bit odd - I think your comment about the pandemic remark being both unrelated to GiveDirectly's arguments and very emotive captures why it doesn't feel quite right.
Can we influence the values of our descendants?

I do think so!

It's hard to contest that change across many dimensions has been accelerating.

And it would make sense that this accelerating change makes parental advice less applicable, and thus parents less influential overall.

When pooling forecasts, use the geometric mean of odds

The A,B,C example you came up with is certainly a strike against average log odds and in favor of average probs.

I have though more about this. I now believe that this invariance property is not reasonable - aggregating outcomes is (surprisingly) not a natural operation in Bayesian reasoning. So I do not think this is a strike agains log-odd pooling.

We need alternatives to Intro EA Fellowships

Why did you end up being turned off by EA?

It's hard to pinpoint but I think it's somehting along the lines of a) the messaging didn't match my perceived self-image ("I am not an altruist"), b) they seemed weirdly fanatical ("donating 10% of my money seems crazy weird") and c) I was not impressed with the people I interacted with (concretely the people from eg the rationality community seemed comparatively more thoughful and to be working on cooler things).

I am unsure of whether I would have changed my mind had I interacted more with the community at ... (read more)

We need alternatives to Intro EA Fellowships

Retreats are awesome!

It was the MIRI Summer Fellows in 2015. For full disclosure it was not about EA, and I came off it being turned off by EA aesthetics. But it was where I first heard about the movement, and it was crucial for my involvement in the long term.

2Miranda_Zhang10dAh, interesting! Two questions: * Why did you end up being turned off by EA? * How did it end up being crucial for your long-term engagement?
We need alternatives to Intro EA Fellowships

One hour (maybe two) fellowship sessions isn’t long enough to get into “late night life-changing conversations” mode, which is important for big changes.

This to me is the main downside.

I got introduced to EA over a 3 week in-person summer program, and my experience is that 2~4 week in person intensive programs have a good track record in getting people excited and engaged. Off the top of my head 1 out of 3 participants in the camps ive been involved in became counterfactually engaged, 1 out of three was engaged but would be anyway and 1 out of ... (read more)

2ChanaMessinger10dAgree with value of late night conversations. Can discussions be held later, and over dinner, with an easy way to transition into just talking?
1Miranda_Zhang11dWould love to hear about this 3 week program! From my skim (will read properly soon), that is the alternative I am most excited about. For example, my shift from planning a career in communications to community building is almost entirely attributable to a 3-day retreat I went on.
Persistence - A critical review [ABRIDGED]

Thank you!

These papers were ones that William MacAskill was considering citing in his forthcoming book. FF hired me to thoroughly check them.

There is definitely many other persistence papers I didn't cover!

Eg:

• Acemoglu et al, Colonial Origins
• Acemoglu et al, Reversal of Fortune
• Woodberry (2012). The Missionary Roots of Liberal Democracy
• All the papers cited in Kelly's Understanding Persistence

And many others.

Persistence - A critical review [ABRIDGED]

EDIT: Faatima Osman questioned whether it was fair to exclude respondents from Benin, Ghana and Nigeria in Nunn and Wantchekon's paper, given that Nigeria is the most populated country in Africa by far.

And in hindsight I think she is totally right - respondents from these countries are ~25% of the sample! I now believe that its unfair to call these respondents outliers. Correspondingly, my trust in Nunn and Wantchekon's paper has gone up, since Kelly's critique was my main concern about it.

Can we influence the values of our descendants?

EDIT: Faatima Osman questioned whether it was fair to exclude respondents from Benin, Ghana and Nigeria in Nunn and Wantchekon's paper, given that Nigeria is the most populated country in Africa by far.

And in hindsight I think she is totally right - respondents from these countries are ~25% of the sample! I now believe that its unfair to call these respondents outliers. Correspondingly, my trust in Nunn and Wantchekon's paper has gone up, since Kelly's critique was my main concern about it.

When pooling forecasts, use the geometric mean of odds

Thank you for your thoughful reply. I think you raise interesting points, which move my confidence in my conclusions down.

[...] averaging log odds will always give more extreme pooled probabilities than averaging probabilities does

As in your post, averaging the probs effectively erases the information from extreme individual probabilities, so I think you will agree that averaging log odds is not merely a more extreme version of averaging probs.

I nonetheless think this is a very important issue - the difficulty of separating the extrem... (read more)

2Jsevillamol8dI have though more about this. I now believe that this invariance property is not reasonable - aggregating outcomes is (surprisingly) not a natural operation in Bayesian reasoning [https://www.lesswrong.com/posts/R28ppqby8zftndDAM/a-bayesian-aggregation-paradox] . So I do not think this is a strike agains log-odd pooling.
3AlexMennen17dI think I can make sense of this. If you believe there's some underlying exponential distribution on when some event will occur, but you don't know the annual probability, then an exponential distribution is not a good model for your beliefs about when the event will occur, because a weighted average of exponential distributions with different annual probabilities is not an exponential distribution. This is because if time has gone by without the event occurring, this is evidence in favor of hypotheses with a low annual probability, so an average of exponential distributions should have its annual probability decrease over time. An exponential distribution seems like the sort of probability distribution that I expect to be appropriate when the mechanism determining when the event occurs is well-understood, so different experts shouldn't disagree on what the annual probability is. If the true annual rate is unknown, then good experts should account for their uncertainty and not report an exponential distribution. Or, in the case where the experts are explicit models and you believe one of the models is roughly correct, then the experts would report exponential distributions, but the average of these distributions is not an exponential distribution, for good reason.
3AlexMennen17dRight, the evidence about the experts come from the new evidence that's being updated on, not the pooling procedure. Suppose we're pooling expert judgments, and we initially consider them all equally credible, so we use a symmetric pooling method. Then some evidence comes in. Our experts update on the evidence, and we also update on how credible each expert is, and pool their updated judgments together using an asymmetric pooling method, weighting experts by how well they anticipated evidence we've seen so far. This is clearest in the case where each expert is using some model, and we believe one of their models is correct but don't know which one (the case you already agreed arithmetic averages of probabilities are appropriate). If we were weighting them all equally, and then we get some evidence that expert 1 thought was twice as likely as expert 2, then now we should think that expert 1 is twice as likely to be the one with the correct model as expert 2 is, and take a weighted arithmetic mean of their new probabilities where we weight expert 1 twice as heavily as expert 1. When you do this, your pooled probabilities handle Bayesian updates correctly. My point was that, even outside of this particular situation, we should still be taking expert credibility into account in some way, and expert credibility should depend on how well the expert anticipated observed evidence. If two experts assign odds ratiosr0ands0to some event before observing new evidence, and we pool these into the odds ratior1/20s1/20, and then we receive some evidence causing the experts to update tor1ands1, respectively, but expert r anticipated that evidence better than expert s did, then I'd think this should mean we would weight expert r more heavily, and pool their new odds ratios intor 2/31s1/31, or something like that. But we won't handle Bayesian updates correctly if we do! The external Bayesianity property of the mean log odds pooling method means that to handle Bayesian updates correctl
Jsevillamol's Shortform

On getting research collaborators

The 80/20 advice I would give is: be proactive in reaching out to other people and suggesting to them to work for an evening on a small project, like writing a post. Afterwards you both can decide if you are excited enough to work together on something bigger, like a paper.

For more in depth advice, here are some ways I've started collaborations in the past:

• Deconfusion sessions
I often invite other researchers for short sessions of 1-2 hours to focus on a topic, with the goal of cod
EA Forum Prize: Winners for May-July 2021

Thank you! I am quite honoured. And congratulations to the other winners!

New Data Visualisations of the EA Forum

I absolutely love the work you have done, thank you so much!

Why aren't you freaking out about OpenAI? At what point would you start?

He is listed in the website

> OpenAI is governed by the board of OpenAI Nonprofit, which consists of OpenAI LP employees Greg Brockman (Chairman & CTO), Ilya Sutskever (Chief Scientist), and Sam Altman (CEO), and non-employees Adam D’Angelo, Holden Karnofsky, Reid Hoffman, Shivon Zilis, Tasha McCauley, and Will Hurd.

It might not be up to date though

It can’t be up to date, since they recently announced that Helen Toner joined the board, and she’s not listed.

Why aren't you freaking out about OpenAI? At what point would you start?

Note that Eliezer Yudkowski argument in the opening link is that OpenAI's damage was done by fragmenting the AI Safety community on its launch.

This damage is done - and I am not sure it bears much relation to what OpenAI is trying to do going forward.

(I am not sure I agree with Eliezer on this one, but I lack details to tell if OpenAI's launch really was net negative)

My current best guess on how to aggregate forecasts

I found some revelant discussion in the EA Forum about extremizing in footnote 5 of this post.

The aggregation algorithm was elitist, meaning that it weighted more heavily forecasters with good track-records who had updated their forecasts more often. In these slides, Tetlock describes the elitism differently: He says it gives weight to higher-IQ, more open-minded forecasters. The extremizing step pushes the aggregated judgment closer to 1 or 0, to make it more confident. The degree to which they extremize depends on how diverse and sophisticated the pool o

[Creative Writing Contest] [Fiction] The Fey Deal

I liked this one a lot!

It was very easy to read and pulled me in. I felt compelled by the protagonists inner turmoil, and how he makes his decision. The writing was clear but it flowed very well. This is something I will send some friends to introduce them to effective altruism.

The only part I didn't like was the ending. I like the intention of linking to GiveWell's page but it pulled me totally out of the fantasy. Also the friend felt a bit 2D. But these are minor quibbles.

Thank you for writing this!

2Ada-Maaria Hyvärinen2moThanks for the feedback! Deciding how to end the story was definitely the hardest part in writing this. Pulling the reader out of the fantasy was a deliberate choice, but that does not mean it was necessarily the best one – I did some A/B testing on my proof reading audience but I have to admit my sample size was not that big. Glad you liked it in general anyway :)
4Samuel Shadrach2moAgreed, it would have been simpler it was framed as the Fei taking your money and giving it to GiveWell. Perhaps the Fei wish to test human morals.
My current best guess on how to aggregate forecasts

Thanks for this post - I think this was a very useful conversation to have started (at least for my own work!), even if I'm less confident than you in some of these conclusions

Thank you for your kind words! To dismiss any impression of confidence, this represents my best guesses. I am also quite confused.

I've heard other people give good-sounding arguments for other conclusions

I'd be really curious if you can dig these up!

You later imply that you think [the geo mean of probs outperforming the geo mean of odds] is at least partly because of a specific

When pooling forecasts, use the geometric mean of odds

I get what you are saying, and I also harbor doubts about whether extremization is just pure hindsight bias or if there is something else to it.

Overall I still think its probably justified in cases like Metaculus to extremize based on the extremization factor that would optimize the last 100 resolved questions, and I would expect the extremized geo mean with such a factor to outperform the unextremized geo mean in the next 100 binary questions to resolve (if pressed to put a number on it maybe ~70% confidence without thinking too much).

My reasoning here is... (read more)

6Simon_M2moLooking at the rolling performance of your method (optimize on last 100 and use that to predict), median and geo mean odds, I find they have been ~indistinguishable over the last ~200 questions. If I look at the exact numbers, extremized_last_100 does win marginally, but looking at that chart I'd have a hard time saying "there's a 70% chance it wins over the next 100 questions". If you're interested in betting at 70% odds I'd be interested. No offense, but the academic literature can do one. Again, I don't find this very persuasive, given what I already knew about the history of Metaculus' underconfidence. I think extremizing might make sense if the other forecasts aren't public. (Since then the forecasts might be slightly more independent). When the other forecasts are public, I think extremizing makes less sense. This goes doubly so when the forecasts are coming from a betting market. I find this the most persuasive. I think it ultimately depends how you think people adjust for their past calibration. It's taken the community ~5 years to reduce it's under-confidence, so maybe it'll take another 5 years. If people immediately update, I would expect this to be very unpredictable.
My current best guess on how to aggregate forecasts

Hmm good question.

For a quick foray into this we can see what would happen if we use our estimate the mean of the max likelihood beta distribution implied by the sample of forecasts .

The log-likelihood to maximize is then

The wikipedia article on the Beta distribution discusses this maximization problem in depth, pointing out that albeit no closed form exists if  and  can be assumed to be not too small the max likelihood estimate can be approximated as ... (read more)

When pooling forecasts, use the geometric mean of odds

I was curious about why the extremized geo mean of odds didn't seem to beat other methods. Eric Neyman suggested trying a smaller extremization factor, so I did that.

I tried an extremizing factor of 1.5, and reused your script to score the performance on recent binary questions. The result is that the extremized prediction comes on top.

This has restored my faith on extremization. On hindsight, recommending a fixed extremization factor was silly, since the correct extremization factor is going to depend on the predictors being aggregated and the... (read more)

9Simon_M2moI think this is the wrong way to look at this. Metaculus was way underconfident originally. (Prior to 2020, 22% using their metric). Recently it has been much better calibrated - (2020- now, 4% using their metric). Of course if they are underconfident then extremizing will improve the forecast, but the question is what is most predictive going forward. Given that before 2020 they were 22% underconfident, more recently 4% underconfident, it seems foolhardy to expect them to be underconfident going forward. I would NOT advocate extremizing the Metaculus community prediction going forward. More than this, you will ALWAYS be able to find an extremize parameter which will improve the forecasts unless they are perfectly calibrated. This will give you better predictions in hindsight but not better predictions going forward. If you have a reason to expect forecasts to be underconfident, by all means extremize them, but I think that's a strong claim which requires strong evidence.
When pooling forecasts, use the geometric mean of odds

I don't think I get your argument for why the approximation should not depend on the downstream task. Could you elaborate?

Your best approximation of the summary distribution  is already "as good as it can get".  You think we should be cautious and treat this probability as if it could be higher for precautionary reasons? Then I argue that you should treat it as higher, regardless of how you arrived at the estimate.

In the end this circles back to basic Bayesian / Utility theory - in the idealized framework your cr... (read more)

When pooling forecasts, use the geometric mean of odds

I think this is a good account of the institutional failure example, thank you!

Honoring Petrov Day on the EA Forum: 2021

I think it was an intentional false alarm, to better simulate Petrov's situation

2MichaelStJules2moThey should have left it up longer if they wanted to test us with it, since it was gone when I reloaded the pages and the timer was never updated while it was up, even though each side would have an hour to retaliate (or it was supposed to give the impression that the hour was over, and it was already too late).

Similarly, I would like for comments I have minimised to stay minimised between visits (unless there is a new reply in thread)

When pooling forecasts, use the geometric mean of odds

The ideal system would [not] aggregate first into a single number [...] Instead, the ideal system would use the whole distribution of estimates

And I have concluded that the ideal aggregation procedure should compress all the information into a single prediction - our best guess for the actual distribution of the event.

Concretely, I think that in an idealized framework we should be treating the expert predictions   as Bayesian evidence for the actual distribution of the event of interest&nb... (read more)

6Lukas_Finnveden2moThis seems roughly right to me. And in particular, I think this highlights the issue with the example of institutional failure. The problem with aggregating predictions to a single guess p of annual failure, and then using p to forecast, is that it assumes that the probability of failure in each year is independent from our perspective. But in fact, each year of no failure provides evidence that the risk of failure is low. And if the forecasters' estimates initially had a wide spread, then we're very sensitive to new information, and so we should update more on each passing year. This would lead to a high probability of failure in the first few years, but still a moderately high expected lifetime.

I've been having some mixed feelings about some recent initiatives in the Forum.

These include things in the space of the creative fiction contest, posting humorous top level content and asking people to share memes.

I am having trouble articulating exactly what is causing my uneasiness. I think its something along the lines of "I use the EA Forum to stay up to date on research, projects and considerations about Effective Altruism. Fun content distracts from that experience,  and makes it harder for the work I publish in the Forum to be taken seriously"... (read more)

9Aaron Gertler2moThanks for voicing these concerns! You've articulated a not-uncommon point of view on how the Forum ought to be used, and one that we try to incorporate into our work alongside many other points of view. I've heard some people express a desire for the Forum to look more like a peer-reviewed journal. I've heard even more express concerns in the opposite direction — that the site feels like it has a very high bar for engagement, and any content other than serious research seems suitable only for Facebook (many of those people are trying to use Facebook less or not at all [https://thezvi.wordpress.com/2017/04/22/against-facebook/]). Other people have told me that they just really enjoy creative writing, art, jokes, etc., and want the Forum to represent that side of EA culture. Sometimes, the creative work is a big part of what drew them to the movement in the first place. I think that examples like "The Fable of the Dragon-Tyrant [https://forum.effectivealtruism.org/posts/WHL3EhgjdEcEca3ae/the-fable-of-the-dragon-tyrant] " show that more "creative" EA content clearly has a place in the movement, and that we'd be better off with more stories of that quality. Hence, the writing contest. Just as not all research on the Forum is as strong as e.g. that of Rethink Priorities, not all stories will be cultural touchstones that stand the test of time. Still, I think the gems are worth having a lot of rougher content show up. The encouragement for people to share their work in public (rather than quietly submitting it through a form) is partly in response to feedback about the Forum's "high bar", and partly to encourage more representation for that side of EA culture. I want to encourage people to share their work and not worry as much about whether something "qualifies" to be here. (Some of the Forum's best posts have started with an author emailing me to say something like "I don't know if this is a good fit, but I figured I would check". I don't know how many additional
2Ines2moI agree with you—I generally come to the forum looking for more thoughtful content, and there are already several EA Facebook groups for which at least the meme post would have been more appropriate. I think the writing contest is probably fine though.

Some discussion about profile pictures for the Forum here

When pooling forecasts, use the geometric mean of odds

Thank you! I learned too from the examples.

One question:

In particular, that the best approach for practical rationality involves calculating things out according to each  of the probabilities and then aggregating from there (or something like that), rather than aggregating first.

I am confused about this part. I think I said exactly the opposite? You need to aggregate first, then calculate whatever you are interested in. Otherwise you lose information (because eg taking the expected value of the individual predictions loses information that was contain... (read more)

4Toby_Ord2moI think we are roughly in agreement on this, it is just hard to talk about. I think that compression of the set of expert estimates down to a single measure of central tendency (e.g. the arithmetic mean) loses information about the distribution that is needed to give the right answer in each of a variety of situations. So in this sense, we shouldn't aggregate first. The ideal system would neither aggregate first into a single number, nor use each estimate independently and then aggregate from there (I suggested doing so as a contrast to aggregation first, but agree that it is not ideal). Instead, the ideal system would use the whole distribution of estimates (perhaps transformed based on some underlying model about where expert judgments come from, such as assuming that numbers between the point estimates are also plausible) and then doing some kind of EV calculation based on that. But this is so general an approach as to not offer much guidance, without further development.
When pooling forecasts, use the geometric mean of odds

I agree with the general point of "different situations will require different approaches".

From that common ground, I am interested in seeing whether we can tease out when it is appropriate to use one method against the other.

*disclaimer: low confidence from here onwards

I do not find the  first example about value 0 vs value 500 entirely persuasive, though I see where you are coming from, and I think I can see when it might work.

The arithmetic mean of probabilities is entirely justified when aggregating predictions from &nb... (read more)

3Toby_Ord3moI agree with a lot of this. In particular, that the best approach for practical rationality involves calculating things out according to each of the probabilities and then aggregating from there (or something like that), rather than aggregating first. That was part of what I was trying to show with the institution example. And it was part of what I was getting at by suggesting that the problem is ill-posed — there are a number of different assumptions we are all making about what these probabilities are going to be used for and whether we can assume the experts are themselves careful reasoners etc. and this discussion has found various places where the best form of aggregation depends crucially on these kinds of matters. I've certainly learned quite a bit from the discussion. I think if you wanted to take things further, then teasing out how different combinations of assumptions lead to different aggregation methods would be a good next step.
When pooling forecasts, use the geometric mean of odds

Let's work this example through together! (but I will change the quantities to 10 and 20 for numerical stability reasons)

One thing we need to be careful with is not mixing the implied beliefs with the object level claims.

In this case, person A's claim that the value is  is more accurately a claim that the beliefs of person A can be summed up as some distribution over the positive numbers, eg a log normal with parameters  and  . So the density distribution of beliefs of A is ... (read more)

I think you mean bearish

Oops yes 🐻

You point out this highly skilled management/leadership/labor is not fungible

Yes, exactly.

I think what I am pointing towards is something like "if you are one such highly skilled editor, and your plan is to work on something like this part time delegating work to more junior people, then you are going to find yourself burnt out very soon. Managing a team of junior people / people who do not share your aesthetic sense to do highly skilled labor will be, at least for the first six months or so, much more work than if... (read more)

I am more bullish about this. I think for distill to succeed it needs to have at least two full time editors committed to the mission.

Managing people is hard. Managing people, training them and making sure the vision of the project is preserved is insanely hard - a full time job for at least two people.

Plus the part Distill was bottlenecked on is very high skilled labour, which needed a special aesthetic sensitivity and commitment.

50 senior hours per draft sounds insane - but I do believe the Distill staff when they say it is needed.

This wraps back to why ... (read more)

2Charles He3moHi, this is another great comment, thank you! Note: I think what was meant here was "bearish", not bullish. I think what you're saying is you're bearish or have a lower view of this intervention because the editor/founders have a rare combination of vision, aesthetic view and commitment. You point out this highly skilled management/leadership/labor is not fungible—we can't just hire 10 AI practitioners and 10 designers to equal the editors who may have left.

Create a journal of AI safety, and get prestigious people like Russell publishing on them.

Basically many people in academia are stuck chasing publications. Aligning that incentive seems important.

The problem is that journals are hard work, and require a very specific profile to push it forward.

Here is a post mortem of a previous attempt: https://distill.pub/2021/distill-hiatus/

From your comment, I just learned that Distill.pub is shutting down and this is sad.

The site was beautiful. The attention to detail, and attention to the reader and presentation were amazing.

Their mission seems relevant to AI safety and risk.

Relevant to the main post and the comment above, the issues with Distill.pub seem not to be structural/institutional/academic/social—but operational, related to resources and burnout.

This seems entirely fixable by money, maybe even a reasonable amount compared to other major interventions in the AI/longtermist sp... (read more)

2PabloAMC3moI think you can still publish in conferences, and I have seen that at least AAAI has the topic of safety and trustworthiness between their areas of interest. I would say then that this is not the main issue? Creating a good journal seems like a good thing to do, but I think it addresses a bit different problem, "how to align researchers incentive with publishing quality results", not necessarily getting them excited about AIS.
When pooling forecasts, use the geometric mean of odds

META: Do you think you could edit this comment to include...

1. The number of questions, and aggregated predictions per question?
2. The information on extremized geometric mean you computed below (I think it is not receiving as much attention due to being buried in the replies)?
3. Possibly a code snippet to reproduce the results?

import requests, json
import numpy as np
import pandas as pd

def fetch_results_data():
response = {"next":"https://www.metaculus.com/api2/questions/?limit=100&status=resolved"}

results = []
while response["next"] is not None:
print(response["next"])
results.append(response["results"])
return sum(results,[])

all_results = fetch_results_data()
binary_qns = [q for q in all_results if q['possibilities']['type'] == 'binary' and q['resolution'] in [0,1]]
bi
When pooling forecasts, use the geometric mean of odds

You are right and I should be more mindful of this.

I have reformulated the main equations using only commonly known symbols, moved the equations that were not critical for the text to a footnote and added plain language explanations to the rest.

(I hope it is okay that I stole your explanation of the geometric mean!)

3Harrison D3moThose equations are definitely less intimidating / I could understand them without an issue. (And that’s totally fine, I hope it’s helpful)
When pooling forecasts, use the geometric mean of odds

I mean in the past people were underconfident (so extremizing would make their predictions better). Since then they've stopped being underconfident.  My assumption is that this is because the average predictor is now more skilled or because more predictors improves the quality of the average.

Gotcha!

The bias isn't that more questions resolve positively than users expect.

Oh I see!

When pooling forecasts, use the geometric mean of odds

but also the average predictor improving their ability also fixed that underconfidence

What do mean by this?

Metaculus has a known bias towards questions resolving positive

Oh I see!

It is very cool that this works.

One thing that confuses me - when you take the geometric mean of probabilities you end up with . So the pooled probability gets slighly nudged towards 0 in comparison to what you would get with the geometric mean of odds. Doesn't that mean that it should be less accurate, given the bias towards questions resolving ... (read more)

2Simon_M3moI mean in the past people were underconfident (so extremizing would make their predictions better). Since then they've stopped being underconfident. My assumption is that this is because the average predictor is now more skilled or because more predictors improves the quality of the average. The bias isn't that more questions resolve positively than users expect. The bias is that users expect more questions to resolve positive than actually resolve positive. Shifting probabilities lower fixes this. Basically lots of questions on Metaculus are "Will X happen?" where X is some interesting event people are talking about, but the base rate is perhaps low. People tend to overestimate the probability of X relative to what actually occurs.
When pooling forecasts, use the geometric mean of odds

(I note these scores are very different than in the first table; I assume these were meant to be the Brier scores instead?)

[This comment is no longer endorsed by its author]Reply
2Simon_M3moYes - copy and paste fail - now corrected
When pooling forecasts, use the geometric mean of odds

Thank you for the superb analysis!

This increases my confidence in the geo mean of the odds, and decreases my confidence in the extremization bit.

I find it very interesting that the extremized version was consistently below by a narrow margin. I wonder if this means that there is a subset of questions where it works well, and another where it underperforms.

One question / nitpick: what do you mean by geometric mean of the probabilities? If you just take the geometric mean of probabilities then you do not get a valid probability - the sum of the pooled ps and... (read more)

5Simon_M3moI think it's actually that historically the Metaculus community was underconfident (see track record [https://www.metaculus.com/questions/track-record/] here before 2020 vs after 2020). Extremizing fixes that underconfidence, but also the average predictor improving their ability also fixed that underconfidence. Metaculus has a known bias towards questions resolving positive . Metaculus users have a known bias overestimating the probabilities of questions resolving positive. (Again - see the track record). Taking a geometric median of the probabilities of the events happeningwill give a number between 0 and 1. (That is, a valid probability). It will be inconsistent with the estimate you'd get if you flipped the question HOWEVER Metaculus users also seem to be inconsistent in that way, so I thought it was a neat way to attempt to fix that bias. I should have made it more explicit, that's fair. Edit: Updated for clarity based on comments below
My first PhD year

Thank you!

The freedom for side projects is the best - though I should warn other people here than having a supportive supervisor who is okay with this is crucial.

I have definitely heard more than one horror story from colleagues who were constantly fighting their supervisors on the direction of their research, and felt they had little room for side projects.

Definitely one of my favorite examples, and one we're using now:

Still, that's exactly what makes this a good suggestion! If I'd forgotten to add the initiative, this would have been a critical reminder.

More EAs should consider “non-EA” jobs

But when I look at this chart, my main takeaway is that there’s a ton of money being spent on welfare, and that working to make sure that money is spent as efficiently as possible could have a huge impact.

I think this is basically true.

A while back I thought that it was false - in particular, I thought that the public money was extremely tight, and that fighting to change a budget was an extremely political issue where one would face a lot of competition.

My experience collaborating with public organizations and hearing from public servants so far has be... (read more)

2Sarah H3moThank you for sharing that! I like your idea about talking to people within these orgs--I know that my sense of how things work has been really changed by actually seeing some of this firsthand. I think another element to consider is what level of government we're talking about. My sense is that the federal budget tends to be more politicized than many state and local-level budgets, and that with state and local budgets there's more room for a discussion of "what is actually needed here in the community" vs. it becoming a straightforward red/blue issue (at least here in the states). I wonder if this means that, at least in some instances, interventions related to state and local-level would be more tractable than national ones. I'm reminded of the Zurich ballot initiative [https://forum.effectivealtruism.org/posts/dTdSnbBB2g65b2Fb9/eaf-s-ballot-initiative-doubled-zurich-s-development-aid] , for example.
[PR FAQ] Sharing readership data with Forum authors

Reading time is one of my favourite features from Medium, and has helped me understand which of my posts are most useful. This in turn has informed my decisions on what to focus on writing.

I expect to get similar benefits if a feature like this is implemented in the EA Forum.