The EA group at the London School of Economics (LSE) ran an introductory fellowship last term. We received more applicants than we expected, so we did not have enough facilitators to admit all applicants nor to conduct traditional, one-on-one interviews in time. So, we used a selective application process with collective, group discussion-style interviews to select fellows.
Following Jessica McCurdy and Thomas Woodside’s post on Yale’s results, we analysed whether the scores we gave applicants predicted whether they engaged with our EA group and how productive fellows were in discussions.
We found that our scores did predict overall performance, but not extremely well.
However, LSE’s results do not strictly differ from Yale’s. A simple bivariate regression of engagement on total application scores shows no correlation in LSE’s case either (although the one for discussion quality does). However, I show below that this is because of the way we aggregated the different scores we gave; not because none of the scores were predictive.
In future cases, if LSE has to be selective, one new suggestion would be to significantly change the weights we give to different admissions criteria and ditch most others.
- Measuring and motivating variables
- Regression results
- LSE’s interview process
- Total application scores did predict overall performance, but not extremely well.
- Our way of aggregating scores was not optimal. E.g., using total application scores to predict overall performance gets an R2 of about 0.35. But when letting OLS fit the data by breaking up application scores into its components, we get R2 = 0.69.
- Most of the attributes we were selecting for in the written application had no predictive power for how well fellows performed overall. The only significant variables were interview scores and prior understanding of EA, and only the former led to better discussion scores.
- Being admitted is associated with a 35% greater chance of engaging with the EA society, but the effect is probably not causal – controlling for application scores, the effect is null.
2/6. Measuring and motivating variables
Engagement: Because we ran our first fellowship last term, we can currently only measure engagement with LSE’s EA Society during the fellowship; not after. In the regression of engagement on admission, we use a dummy variable that takes the value 1 as long as the person attended a social, attended a society event, attended the Student Summit, applied to the In Depth Fellowship, or joined the committee (and 0 otherwise). In the main regression, we used a continuous variable where larger values indicate more engagement. All other variables below are continuous.
We chose to measure this because it seems like a reasonable proxy for impact (see the Yale post).
Discussion: We measured this as an average of the facilitators’ judgements for each fellow they were familiar with. It aggregates the quantity and quality of fellows' comments during discussions, weighting the latter significantly higher.
We chose this as an outcome worth aiming at for two reasons. First, being productive in discussions is a good proxy for in-depth understanding of the material, which is a goal of the fellowship. And second, better discussions improve the experience of other fellows, which should make it be remembered more positively and recommended to others in the future.
Interview component: This is measured as an average of the interviewers’ judgements for each applicant. The format of the interviews is explained in section 6/6.
Written component: This is a weighted average of readers’ judgements on a variety of criteria (below), on the basis of a set of answers to a questionnaire and a CV/resume. The questionnaire asked about applicants' motivation to apply to the fellowship as well as evidence of altruism and critical reflection.
Altruism: We looked for evidence of genuine motivation to improve the world (e.g., interning with an NGO). We thought this would be associated with more lasting interest in EA.
Prospects: We looked for evidence of impressive academic and/or career promise in the applicants’ CVs (e.g., very high grades or impressive extra-curriculars). We though this would be associated with future potential impact.
Open-mindedness: We looked for evidence of applicants’ willingness to change their minds in face of new evidence (by recounting an instance where they did so and why). We thought this would be important to foster positive epistemic norms in discussions.
Commitment: We tried to gauge how motivated applicants were to commit to the fellowship, so that we selected fellows that would do the readings, attend regularly, etc.
EA understanding: We looked for evidence of prior familiarity with EA; not as a stand-alone benefit, but to check whether the applicant knows what they are applying for.
Composite scores: No model in the regression table includes all the variables. That is because it would lead to perfect collinearity. ‘Total application’ is a linear function of ‘Interview component’ and ‘Written component’; while ‘Written component’ is a linear function of ‘Altruism’, ‘Prospects’, ‘Open-mindedness’, ‘Commitment’, and ‘EA understanding’. They were each given different weights in composite scores.
3/6. Regression results
Most of the results are summarised in this table.
The OLS regressions use absolute values; not rankings, and all variables are standardised (normalised). So, each coefficient denotes how much of a standard deviation (SD) the outcome increases by when increasing the regressor of interest by one SD, holding all other included variables constant. E.g., in model (5), increasing interview scores by one SD is associated with a 0.74SD increase in discussion scores, keeping written scores fixed.
Main potential issues:
- Measurement error. Some of the variables are not very well-defined. Although, as long as we preserve ordinality, we should not expect results to be too absurd. If there were classical measurement error in the dependent variables, we might be getting false negatives due to attenuation bias. If measurement error is non-classical (we can think of plausible stories for this such as unconscious bias), results would be biased in some direction. However, this analysis is just evaluating whether our admission process works. What therefore matters is whether our judgement of applicants’ qualities predicts outcomes; not whether the applicants’ true qualities do. So, measurement error would mostly be a flaw in our admissions system, rather than this analysis.
- ‘Collider bias’ might be an issue, as David Bernard suggested under the Yale post. Three things to note here. First, it does not apply to our variables for application score, admission, or engagement (since n = N for these). Second, as in Yale’s case, we did not reject the vast majority of applicants; only about 60%. Third, our analysis also used ‘discussion productivity’ as an outcome, which did predictably vary across admitted fellows.
- Sample size and the number of samples. We do not have a tonne of data, at all. We still get statistically significant results (which accounts for n), but this is an important precaution.
Do applicants’ scores predict their performance after admission?
Yes, but only through some of the criteria and for certain outcomes.
I look at each outcome in turn.
The total application score that we used to rank and thereby select our final fellows did predict our measure of overall performance. Model (7) shows that a one-SD increase in the application score led to about 0.6*SD increase in overall performance, significant at the 10% level.
Breaking this up into interview and composite written scores (model 8), we see that only interview scores had a significant effect – written scores had none, in the way we chose to aggregate them. But when we break up the written score into its components in model (9), we see that prior understanding of EA did significantly predict overall performance, but less so than interview scores. Model (9) explains almost 70% of the variation in outcomes.
Interestingly, our measures of altruism, prospects, open-mindedness, and commitment appear completely useless in predicting our definition of overall performance.
The coefficients stay rather similar in the three models for discussion. A one-SD increase in total applicant scores in model (4) predicted a 0.72*SD increase in discussion scores, significant at 5%. When separating this into interview and writing components in model (5), only the interview mattered. The coefficient of 0.74 in (6) was the same as in (5) and even breaking up the writing score into its component showed no significant effect of any of the writing criteria.
Only interview scores significantly predicted discussion performance. But they could only explain a bit over half of the variation.
These results were more surprising.
Our total application scores had no predictive power. But when splitting up the interview and writing components, both had large and significant associations, but in opposite directions. High interview scores predicted more engagement with our EA group, but high writing scores predicted less engagement. Both of these effects are of about one SD.
Model (3) shows what is driving this: our measures of open-mindedness and commitment. It is unclear why this is. One story for open-mindedness could be that open-minded applicants are less likely to go all-in on EA socials and events and prefer to read widely. And a story for commitment could be that those most committed to the fellowship spent more time reading the extra readings and thus had less time for non-fellowship engagement. But this is speculative. It is completely plausible that odd results like these are due to measurement error or small samples.
Altruism and prior understanding of EA, on the other hand, have strong, positive associations with engagement, likely because these people already felt motivated and had attended events or otherwise engaged with our group and just continued to do so.
Beyond the main regression, we ran one for engagement of all applicants (not just fellows) on admission to the fellowship. We found that being admitted is associated with a 35% greater chance of engaging with our EA group. This is significant at 5% but can only explain about one-tenth of the variation. This, however, becomes insignificant (p = 0.76) when controlling for interview and written application performance.
The effect of admission on engagement is thus unlikely causal. This seems to imply that the fellowship itself did not increase engagement, but it is unclear whether this is true more broadly. Our data is only from the duration of the fellowship; it does not consider what the fellows and unsuccessful applicants did or will do with our EA group after the fellowship. This is an important difference compared to Yale’s more comprehensive measure of engagement. And importantly, our measure does not consider whether applicants ended up changing career plans, doing more independent EA reading, or any other important proxies for impact that we could not directly observe. I would expect that this is a major benefit of the fellowship, but I have only seen anecdotal evidence of this.
It seems that our admissions process can be greatly improved. Keeping the assessment format as it is, these are my two suggestions:
(1) Use the writing component only as a very lenient benchmark for considering applicants. There seems to be no benefit in scoring applicants on the many criteria we chose, so only very unusual things in the writing component should be weighted in the total score for applicants. This would likely simplify and shorten the application.
It is true that the ‘altruism’ and ‘EA understanding’ variables were useful predictors of engagement, but the second regression shows no evidence that the fellowship itself further increased people’s engagement with our group within that timeframe.
(2) Weight the interview far higher than the written component, maybe close to 100%.
6/6. LSE’s interview process
Since LSE EA got more applicants than expected, we tried a workshop structure for the interviews rather than 1-to-1s. We think it worked well for us, both prediction-wise and logistically. Here is how it worked.
About 4-6 applicants in a Zoom breakout room are asked to discuss a question, problem, or reading. There is one facilitator listening and taking notes on how the applicants discuss and contribute. The facilitator rarely speaks.
60–75-minute workshops spent discussing 3 topics.
- Introduction: After pleasantries, applicants were told that we wanted to see how they engage with EA material in a discussion format together with other students. We aimed to reward careful analysis, coherent methodology, and good epistemics.
- 10 mins: Applicants were asked to read an article, e.g., “Practical ethics given moral uncertainty”.
- 15 mins: Applicants very briefly introduced themselves and then discussed the article together.
- 10 mins: Break.
- 15 mins: Applicants were asked to work together to come to a collective answer to a Fermi estimate-question, e.g., "How many piano tuners are there in London?" The breakout room groups were brought back into the main room, one representative from each room would announce their answer, and then a facilitator would announce the true number (this part was just a gimmick).
- 10 mins: Applicants were asked to discuss how they would make a difficult or intangible trade-off, e.g., "would it be best to remove 100 metric tonnes of carbon from the atmosphere or to save a child’s life from malaria?"
- Figure out how many facilitators there are whose judgement you would trust for selecting fellows. Determine how many applicants can be interviewed at the same time (but in different rooms) if each facilitator listens to 4-6 applicants at a time. If not everyone can be interviewed at the same time, run the workshop over more than one day. Allow applicants to choose which day they are available to join in (but cap the number for capacity).
- Run a Zoom session with all the applicants and facilitators for that day. Use the main room to announce things like what the next question, reading, or problem will be, when there will be breaks, etc.
- Split the applicants randomly into evenly sized breakout rooms for the discussions, each with one facilitator.
- During the discussion, have all the facilitators take notes about applicants in their group.
- Repeat 2-4 after each discussion.
- Facilitators compare notes on individuals after the workshop and give them scores.
- Some applicants might find this kind of assessment competitive than genuinely discussion-based and thus behave differently (although we did not observe this).
- We could not ask applicants questions specific to them (although the results above do not suggest that this would be important).
- We saved a lot of time (important for smaller groups).
- We could judge more accurately how applicants engaged with the type of material they would see during the fellowship.
- Applicants seemed more comfortable since they were not the only people being judged.
- All the facilitators get to see almost all the applicants, which makes it easier to compare them.
- The interviews turned out to have predictive power for fellows’ overall performance. I presume this is because the interviews were quite similar to the discussion format.
I think we tend to confuse 'lack of strong statistical significance' with 'no predictive power'.
A small amount of evidence can substantially improve our decision-making...
... even if we cannot conclude that 'data with a correlation this large or larger would be very unlikely to be generated (p<0.05) if there were no correlation in the true population.
We, very reasonably, substantially update our beliefs and guide our decisions based on small amounts of data. See, e.g., the 'Bayes rule' chapter of Algorithms to Live By
I believe that for optimization problems and decision-making problems we should use a different approach both to design and to assessing results... relative to when we are trying to measure and test for scientific purposes.
This relates to 'reinforcement learning' and to 'exploration sampling'.
We need to make a decision in one direction or another, and we need to consider costs and benefits of collecting and using these measures I believe we should be taking a Bayesian approach, updating our belief distribution,
... and considering the value of the information generated (in industry, the 'lift', 'profit curve' etc) in terms of how it improves our decision-making.
Note: I am exploring these ideas and hoping to learn, share and communicate more. Maybe others in this forum have more expertise in 'reinforcement learning' etc.
Thanks for writing this!
This is very reasonable; 'no predictive power' is a simplification.
Purely academically, I am sure a well-reasoned Bayesian approach would get us closer to the truth. But I think the conclusions drawn still make sense for three reasons.
Thanks for sharing the results and thanks, in particular, for including the results for the particular measures, rather than just the composite score.
Taking the results at face value, it seems like this could be explained by your measures systematically measuring something other than what you take them to be measuring (e.g. the problem is construct validity). For example, perhaps your measures of "open-mindedness" or "commitment" actually just tracked people's inclination to acquiesce to social pressure, or something associated with it. Of course, I don't know how you actually measured open-mindedness or commitment, so my speculation isn't based on having any particular reason to think your measures were bad.
Of course, not taking the results at face value, it could just be idiosyncracies of what you note was a small sample. It could be interesting to see plots of the relationship between some of the variables, to help get a sense of whether some of the effects could be driven by outliers etc.
Thanks for the comment!
I think it's completely plausible that these two measures were systematically measuring something other than what we took them to be measuring. The confusing thing is what it indeed was measuring and why these traits had negative effects.
(The way we judged open-mindedness, for example, was by asking applicants to write down an instance where they changed their minds in response to evidence.)
But I do think the most likely case is the small sample.