Hi Mauricio! More details are in the post linked at the top: https://forum.effectivealtruism.org/posts/N6cXCLDPKzoGiuDET/yale-ea-virtual-fellowship-retrospective-summer-2020#Selectiveness
I agree, I do not think I would say that "we have evidence that there is not a strong relation". But I do feel comfortable saying that we do not have evidence that there is any relation at all.
The 95% confidence intervals are extremely wide, given our small sample sizes:
Spring 2019: -0.75 to 0.5 (95th) and -0.55 to 0.16 (75th)
Fall 2019: -0.37 to 0.69 and -0.19 to 0.43
Spring 2020: -0.67 to 0.66 and -0.37 to 0.37
Summer 2020: -0.60 to 0.51 and -0.38 to 0.26
The upper ends are very high, and there is certainly a possibility that our interview scoring process is actually good. But, of the observed effects, two are negative, and two are positive. The highest positive observed correlation is only 0.10.
To somebody who has never been to San Francisco in the summer, it seems reasonable to expect it to rain. It's cloudy, it's dark, and it's humid. You might even bring an umbrella! But, after four days, you've noticed that it hasn't rained on any of them, despite continuing to be gloomy. You also notice that almost nobody else is carrying an umbrella; many of those who are are only doing so because you told them you were! In this situation, it seems unlikely that you would need to see historical weather charts to conclude that the cloudy weather probably doesn't imply what you thought it did.
This is analogous to our situation. We thought our interview scores would be helpful. But it's been several years, and we haven't seen any evidence that they have been. It's costly to use this process, and we would like to see some benefit if we are going to use it. We have not seen that benefit in any of our four cohorts. So, it makes sense to leave the umbrella at home, for now.
Broadly, I agree with your points. You're right that we don't care about the relationship in the subpopulation, but rather about the relationship in the broader population. However, there are a couple of things I think are important to note here:
In general, we believe that in order to use a selection method based on subjective interview rankings -- which are very time-consuming and open us up to the possibility of implicit bias -- we need to have some degree of evidence that our selection method actually works. After two years, we have found none using the best available data.
That being said -- this fall, we ended up admitting everyone who we interviewed. Once we know more about how engaged these fellows end up being, we can follow up with an analysis that is truly of the entire population.
There are definitely a lot of selection effects prior to us making our selection. I think what we are trying to say is that our selections based on interview scores were not very helpful. Perhaps, they would be helpful if our system worked very differently (for instance, if just interviewed anyone who put down their email). But it seems like with the selection effects we had (have to make an effort to fill out the application, do a small amount of background reading, schedule and show up to an interview) we arrived at a place where our interview scoring system didn't do a good job further narrowing down the applicants.
We definitely do not mean to say that other groups definitively shouldn't be selective, or even shouldn't be selective using our criteria. We just don't have the evidence to suggest that our criteria were particularly helpful in our case, so we can't really recommend it for others.
Yes, this is definitely a concern, for some cohorts more than others. Here are the number of people we interviewed each cohort:
So for Fall 2018 and Summer 2020, I think the case can be made that the range restriction effects might be high (given we have admitted ~15 fellows). For the Spring fellowships, we admitted the majority of applications and thus there should be more differentiation in the predictor variable.