User Comment Replies

Comparing Two Forecasters in an Ideal World

Good comment, thank you!

Comparing Two Forecasters in an Ideal World

Can't think of anything better than a t-test, but open for suggestions.

If a forecaster is consistently off by like 10 percentage points - I think that is a difference that matters. But even in that extreme scenario where the (simulated) difference between two forecasters is in fact quite large, we have a hard time picking that up using standard significance tests.

JoshuaBlake

Afraid I don't have good ideas here. Intuitively, I think there should be a way to take advantage of the fact that the outcomes are heavily structured. You have predictions on the same questions and they have a binary outcome. OTOH, if in 20% of cases the worse forecaster is better on average, that suggests that there is just a hard bound on how much we can get.

Wisdom of the Crowd vs. "the Best of the Best of the Best"

nikos3y3

Interesting, thanks for sharing the paper. Yeah agree that using the Brier score / log score might change results and it would definitely be good to check that as well.

Wisdom of the Crowd vs. "the Best of the Best of the Best"

nikos3y1

In principle yes. In practice also usually yes, but the specifics depend on whether the average user who predicted on a question gets a positive amount of points. So if you predicted very late and your points are close to zero, but the mean number of points forecasters on that question received is positive, then you will end up with a negative update to your reputation score.
Completely agree that a lot hinges on that reputation score. It seems to work decent for the Metaculus Prediction, but it would be good to see what results look like for a different metric of past performance.

Wisdom of the Crowd vs. "the Best of the Best of the Best"

nikos3y3

Not sure how to quantify that (open for ideas). But intuitively I agree with you and would suspect it's at least a sizable part

Charles Dillon 🔸

Suggestion: pre-commit to a ranking method for forecasters. Chuck out questions which go to <5%/>95% within a week. Take the pairs (question, time) with 10n+ updates within the last m days for some n,m, and no overlap (for questions with overlap pick the time which maximises number of predictions). Take the n best forecasters per your ranking method in the sample and compare them to the full sample and the "without them" sample.

Wisdom of the Crowd vs. "the Best of the Best of the Best"

nikos3y2

Yeah, definitely. The title was a bit tongue-in-cheek (it's a movie quote)

Predictive Performance on Metaculus vs. Manifold Markets

nikos3y2

And is the code to the MetaculusBot public somewhere? :)

Predictive Performance on Metaculus vs. Manifold Markets

nikos3y3

It should be possible to fully automate the bot and just run a CRON job that regularly checks the Metaculus API for new questions, right?

Predictive Performance on Metaculus vs. Manifold Markets

nikos3y3

I slightly tend towards yes, but that's mere intuition. As someone on Twitter put it, "Metaculus has a more hardcore user base, because it's less fun" - I find it plausible that the Metaculus user base and the Manifold user base differs. But higher trading volume I think would have helped.

For this particular analysis I'm not sure correcting for the number of forecasters would really be possible in a sound way. It would be great to get the MetaculusBot more active again to collect more data.

Predictive Performance on Metaculus vs. Manifold Markets

nikos3y5

Is it possible to get rid of the question mode for this post?

More Is Probably More - Forecasting Accuracy and Number of Forecasters on Metaculus

nikos3y1

For Metaculus there are lots of ways to drive engagement: prioritise making the platform easier to use, increase cash prizes, community building and outreach etc.

But as mentioned in the article the problem in practice is that the bootstrap answer is probably misleading, as increasing the number of forecasters likely changes forecaster composition.

However, one specific example where the analysis might be actually applicable is when you're thinking about how many Pro Forecasters you hire for a job.

More Is Probably More - Forecasting Accuracy and Number of Forecasters on Metaculus

nikos3y1

In principle yes, you'll just still always have the problem that people are predicting at different time points. If the best and the 2nd best predict weeks or months apart then that changes results.

More Is Probably More - Forecasting Accuracy and Number of Forecasters on Metaculus

nikos3y3

Ah snap! I forgot to remove that paragraph... I did subsampling initially, then switched to bootstrapipng. Resulsts remained virtually unchanged. Thanks for pointing that out, will update the text.

How does forecast quantity impact forecast quality on Metaculus?

nikos3y1

Hi Simon, I'm working on a follow-up to this post that uses individual-level data. Could you please give some detail on how you "sampled" k predictors? As in, did you have access to individual data and could actually do the sampling? I'm not entirely sure what the x-axis in your plot means and what the difference betwenn ">N predictors" and "k predictors" is. Thank you!

Simon_M

iirc, there is access to the histogram, which tells you how many people predicted each %age. I then sampled k predictors from that distribution. "k predictors" is the number of samples I was looking at ">N predictors" was the total number of people who predicted on a given question

Reflections on Wytham Abbey

nikos3y13

I acknowledge that transparency is complex, that there are trade-offs and that it isn't clear what the correct amount of transparency is. I also acknowledge that it is normal that regular grants are published with a delay. So I'm not making a general claim or demand that everything needs to be public (I even explicitly say that). What I say is that
a) I'm in favour of valuing transparency highly by default.
b) I feel in this specific case more communication would have helped

My intuition is that Open Phil overall is quite transparent. I was less s... (read more)

Davidmanheim3y11

Yeah, I think we basically agree on all of the points here, and I apologize that my characterization of your claim was, in fact, uncharitable.

Reflections on Wytham Abbey

nikos3y62

I think your criticism of bikeshedding somewhat misses the point people are raising. Of course the amount of money spent on WA is tiny compared to other things. The reason it's worth talking about it is that it tells you something about EA culture and how EA operates.

This is in large parts a discussion about what culture the movement should have, what EA wants to be and how it wants to communicate to the world. The reason you care about how someone builds a bike shed is because that carries information about what kind of person they are, how trustwor... (read more)

Jan_Kulveit

I will try to paraphrase, please correct me if I'm wrong about this: the argument is, this particular bikeshed is important because it provides important evidence about how EA works, how trustworthy the people are, or what are the levels of transparency. I think this is a fair argument. At the same time I don't think it works in this case, because while I think EA has important issues, this purchase does not really illuminate them. Specifically, object level facts about this bikeshed * do not provide that that much evidence, beyond basic facts like "people involved in this have access to money" * the things they tell you are mostly boring * they provide some weak positive evidence about the people involved being sane and reasonable * it is unclear how much evidence provided by this generalizes to nuclear reactors Object level, you don't need precise numbers and long spreadsheets to roughly evaluate it. As I gestured to, in late 2021, the "x-risk-reduction" area had billions of dollars committed to it, less than a thousand people working on it, and good experience with progress made on in person events. Given the ~ low millions pound effective cost of the purchase and the marginal costs of time and money, it seems like a sensible decision. In my view this conclusion does not strongly depend on priors about EA, but you can reach it by doing a quick calculation and a few google searches. Things about the process seem mostly boring. How it went seems: 1. some people thought an events venue near Oxford is a sensible, even if uncertain, bet 2. they searched for venues 3. selected a candidate 4. got funding 5. EVF decided to fiscally sponsor this 6. the venue was bought 7. this was not announced with a fanfare 8. boring things like reconstructing some things started? (Disclosure about step 2: I had seen the list of candidate venues, and actually visited one other place on the list. The process was in my view competent and sensible, for example in the asp

Jeroen Willems🔸3y12

Agreed. Effective Altruism embodies a set of values. I agree with these values. I was incredibly worried that CEA/EVF was making a big decision (15 million remains a large amount! It's millions of bednets!) that didn't embody these values. This is why I made the "Why did CEA purchase Wytham Abbey?" post. We shouldn't put too much weight on PR, spin and appearance. But we should care a lot about not losing track of what EA is all about. How EVF went about purchasing Wytham Abbey might translate to how they spend money in other areas as well, including high-... (read more)

2[comment deleted]3y

Why did CEA buy Wytham Abbey?

nikos3y-1

$1000 per person per day just for the place seems pretty expensive for a 30 person conference...

Why did CEA buy Wytham Abbey?

nikos3y1

I'm confused why you wouldn't feel concerned about EA potentially wasting 15M pounds (talking about your hypothetical example, not the real purchase). I feel that would mean that EA is not living up to its own standards of using evidence and reasoning to help others in the best possible way.

RobBensinger3y13

Since EA isn't optimizing the goal "flip houses to make a profit", I expect us to often be willing to pay more for properties than we'd expect to sell them for. Paying 2x is surprising, but it doesn't shock me if that sort of thing is worth it for some reason I'm not currently tracking.

MIRI recently spent a year scouring tens of thousands of properties in the US, trying to find a single one that met conditions like "has enough room to fit a few dozen people", "it's legal to modify the buildings or construct a new one on the land if we want to", and "near b... (read more)

Creating a database for base rates

nikos3y1

Currently there are 4 people (including me) working on the project. I focus on coordination, the other three are professional forecasters and focus on the data collection. At the moment we're aiming for wide feedback from anyone who would be interested in certain base rates, but we're not actively crowd-sourcing the collection process.

Creating a database for base rates