Takeaways from EAF's Hiring Round

bystefan.torges 9mo19th Nov 201819 comments

90


We (the Effective Altruism Foundation) recently ran a hiring round for two positions, an Operations Analyst and a Research Analyst, for which purpose we overhauled our application process. Since we’re happy with the outcome ourselves and received positive feedback from applicants about how we handled the hiring round, we want to share our experience. This post might be useful for other organizations and (future) applicants in the community.

Summary

  • The most important channels for finding promising candidates were our own networks and the EA community.
  • Encouraging specific individuals to apply was crucial.
  • Having an application period of only two weeks is possible, but not ideal..
  • We’ve become more convinced that a general mental ability test, work tests relevant for the role, a trial (week), and reference checks should be important parts of our application process.
  • The application form was more useful than a cover letter as a first filter. In the future, we will likely use a very brief form consisting of three to five questions.
  • It’s less clear to us that interviews added a lot of value, but that might have been due to the way we conducted them.
  • It’s hard to say whether blinding submissions and using quantitative measures in fact led to a fairer process. They definitely increased the focus and efficiency of our discussions. Given that it didn’t cost us much extra time, we will likely continue this practice in the future.

Recommendations for applicants

  • If you’re uncertain about applying, approach the organization and find out more.
  • When in doubt, apply! We had to convince two of the four candidates who made it to the trial to apply in the first place.
  • Not all effective altruism organizations are alike. Familiarize yourself as much as possible with the mission, priorities, and general practices of the organization you’re applying to.
  • Research positions: Practice your skills by researching relevant topics independently. Publish your write-ups in a relevant channel (e.g. on the EA forum or your personal blog). Collect as much radically honest feedback as possible and try to incorporate it.
  • Operations positions: Read the 80,000 Hours post on operations roles. Take the lead on one or more projects (e.g. website, event). Collect feedback and try to incorporate it.

Deciding whether to hire

When it comes to hiring for an early-stage nonprofit like ours, common startup advice likely applies.[1] This is also backed up to some extent by our own experience:

  • Hire slowly, and only when you absolutely need to.
  • If you don’t find a great candidate, don’t hire.
  • Compromising on these points will mean achieving less due to the added costs in management capacity.

In this particular case, we realized that our philanthropy strategy lacked a grantmaking research team, which implied a fairly clear need for an additional full-time equivalent. We also systematically analyzed the amount of important operations tasks that we weren’t getting done, and concluded that we have about 1.5 full time equivalents worth of extra work in operations.

Goals of the application process

Find the most suited candidate(s) for each role.

  • Make sure that people who are plausibly among the best candidates actually apply.
  • Make sure that your filters test for the right characteristics.

Make sure that applicants have a good experience and can take away useful lessons.

  • Minimize unnecessary time investment while learning as much as possible about each other.
  • Communicate details about the application process as clearly and early as possible.
  • Compensate candidates who invest a lot of time, irrespective of whether they get the job or not.
  • Give feedback to rejected candidates if they ask.
  • Write this post.

The application process in brief

Our application process consisted of four stages. These were, in chronological order:

  1. initial application (application form, general mental ability (GMA) test, CV);
  2. work test(s);
  3. two 45 minute interviews;
  4. trial week and reference checks.

Outcome

Within a two-week application period, we received a total of 66 applications. Six weeks later, we offered positions to two candidates and offered an extended trial to another candidate. The application process left us with a very good understanding of the candidates’ strengths and weaknesses and a conviction that they’re both excellent fits for their respective roles. Throughout the process, we stuck to a very aggressive timetable and were able to move quickly, while still gathering a lot of useful information.

The entire process took about two weeks of preparation and about two months to run (including the two week application period). We invested about 400 hours, mainly across three staff members, which amounts to about 30% of their time during that period.[2] This was a good use of our time, both in expectation and in retrospect. If your team has less time available to run an application process, we would probably recommend doing fewer interviews.

Table: Overview of the different application stages

We did not expect to filter so strongly during the first stage, but the median candidate turned out to be a worse fit than we expected. The best applications were better than we expected, however. The fact that the gender ratio stayed fairly constant throughout the stages is some evidence that we were successful in curbing potential gender biases (see “Equal opportunity” section). However, it’s hard to say since we don’t have a control group. Since the time cost was limited, we will likely keep these measures in place. In the future, we’d like to reduce the time candidates have to spend on the initial application in order to encourage more candidates to apply and avoid unnecessarily taking up applicants’ time (also see section “First stage: Initial application”).

How we came up with the application process

Scientific literature

We mainly relied on “The Validity and Utility of Selection Methods in Personnel Psychology: Practical and Theoretical Implications of 100 Years” by Schmidt (specifically, the 2016 update to his 1998 meta-analysis on this issue). We also considered Handbook of Principles of Organizational Behavior: Indispensable Knowledge for Evidence-Based Management by Edwin Locke. Both seemed to agree that general mental ability (GMA) tests were the best predictors of job performance. Indicators of conscientiousness (either integrity tests or conscientiousness measures) also fared well, as did interviews (either structured or unstructured). Interestingly and surprisingly, work tests were the best predictor in the 1998 meta-analysis, but did far worse in the 2016 one. Overall, this led us to include a GMA test in the process, increased our confidence in conducting interviews, decreased our confidence in work tests, and led us to try to measure conscientiousness (in a way that was not easily gameable).

Other EA organizations

We also considered how other EA organizations handled this challenge, particularly Open Phil and FHI, as they seem most similar to us in some relevant ways. From what we knew, they relied strongly on work tests and a trial (in the case of Open Phil), which updated us toward including a work test and running some form of trial.

Our own reasoning

Thinking about our particular situation and past experiences, as well as common sense heuristics, made a few things stand out:

  • Having a thorough understanding of our mission is crucial for prioritizing correctly, making trade-offs, and coming up with useful ideas. We thought it was important that candidates display some of this already.
  • Although work tests are not among the best instruments in the meta analysis, whether candidates excel at tasks they would take on if we hired them is likely predictive of their future performance in that role. The difficulty seems to be in making work tests representative of future tasks. Still, this gave us further reason to include a work test fairly early and also to do an in-person trial week.
  • For small teams, culture fit also struck us as particularly important. You have to get along in order to perform well and be as efficient as possible. Small teams in particular have limited slack for tolerating friction between team members, as we’ve also experienced in the past. This consideration spoke in favor of interviews and an in-person trial.

Equal opportunity

We wanted to make the process as fair as possible to all applicants. Research suggests that certain groups in particular, e.g. women and ethnic minorities, are less likely to apply and are more likely to be filtered out for irrelevant reasons. We tried to correct for potential biases by

  • encouraging specific individuals to apply,
  • making sure the job ad was inclusive,
  • defining criteria and good answers in advance,
  • blinding submissions wherever possible [3], and
  • introducing quantitative measures wherever possible.

We thought quantitative measures were particularly relevant for squishy components like “culture fit” which might introduce biases late in the process when blinding is no longer possible.

There is another reason for such measures in a small community like effective altruism. It was very likely that we would evaluate people who we already knew and had formed opinions on—both positive and negative—that would affect our judgment in ways that would be very hard to account for.

The application process in detail

Writing the job ad

When drafting the two job ads (Operations Analyst, Research Analyst) we tried to incorporate standard advice for ensuring a diverse applicant pool. We ran the writing through a “gender decoder” (textio.com, gender-decoder), cut down the job requirements to what we considered the minimum, and added an equal opportunity statement. For better planning, increased accountability, and as a commitment device, we made the entire process transparent and gave as detailed a timeline as possible.

What we learned [4]

  • We would have liked to extend the application period from 2 to 8 weeks. This was not possible due to other deadlines at the organizational level.
  • We should have reduced barriers to approach us for people who were uncertain whether to apply. In the future we might add an explicit encouragement to the job ad or host an AMA session in some form during the application period.
  • After some initial feedback we added an expected salary range to the job ads and will do so from the beginning in the future.
  • We should have communicated the required time investment for the work tests more clearly in the job ad, which, as it turns out, contained some ambiguous formulations which implied that the work test would take four days to complete instead of two hours.
  • We have become less convinced of generic equal opportunity statements.

Spreading the word

We shared the job ads mostly within the EA community and a few other facebook groups, e.g. scholarship recipients in Germany. To make sure the best candidates applied, we reached out to individuals we thought might be a particularly good fit and encouraged them to apply. For that, we relied on our own personal networks, as well as the networks of CEA and 80,000 Hours, both of which were a great help. We specifically tried to identify people from groups currently underrepresented at EAF.

What we learned

Table: Share of channels through which applicants learned about the positions, across stages of the application process

The most important channels for finding promising candidates was reaching out to people in our own network, CEA’s and 80,000 Hours’ networks, and the online EA community.

First stage: Initial application

Initially, we asked candidates for either role to (1) submit their CV, (2) fill out an application form, and (3) complete a general mental ability (GMA) test [5]. The purpose of this stage was to test their basic mission understanding and basic general competence.

The application form consisted of 13 long-form questions (with a limit of 1,000 characters per question). Three of these asked for understanding of and identification with our mission, while ten were so-called “behavioral questions”, i.e. questions about past behavior and performance (see Appendix B for all questions). We also collected all necessary logistical info. To some extent, this form was a substitute for a standardized interview to reduce our time investment.

The GMA test was Raven’s Advanced Progressive Matrices (administered by Pearson Clinical). We picked it because it’s fairly well studied and understood, has a high g-loading (.80), only takes about 40 minutes to complete, and contains no components that could introduce linguistic or cultural biases. It does have the drawback that it only measures one form of intelligence; however, we deemed this acceptable for our purposes, given the high g-loading and short test time.

Evaluation. Two people from our team scored each form answer from 0 to 5 and each CV from -1 to +1. To eliminate potential biases, we blinded the submissions, scored question by question instead of candidate by candidate, and randomized the candidate ordering after each question. We then aggregated the scores from all three components (CV, form, test) after converting them to a standard score. Based on the resulting distribution and looking at edge cases in more detail, we decided which of the applicants to advance to the next stage. We only unblinded the data after having made an initial decision. After unblinding, we only allowed ourselves to make adjustments upwards, i.e. only allow more (rather than fewer) candidates to advance to the next stage. In the end, we didn’t make any changes after unblinding.

What we learned

  • The initial application form was too long. We had originally planned a shorter version for the initial stage and a longer one for a later stage. However, we decided to combine them because we faced some very tight deadlines at the organizational level. Given the circumstances, we probably made the right call, but would change it if we had more time.
  • The application form and the GMA test seemed to measure independent aspects of a candidate (see Appendix A). We will likely continue to use some version of both in the future.
  • The GMA test seemed to be a better predictor of good performance in later stages than the application form. The form seemed particularly noisy for research analysts. However, the sample size is fairly low and there could be threshold effects (see Appendix A).
  • The most useful questions from the form were the following (see Appendix A):
    • Question 1: What is your current plan for improving the world?
    • Question 12: What do you think is the likelihood that at least one minister/secretary or head of state will mention the term “effective altruism” in at least one public statement over the next five years? Please give an estimate without consulting outside sources. How did you arrive at the estimate?
    • Question 13: Please describe a time when you changed something about your behavior as a result of feedback that you received from others.
  • Minor issues:
    • We should blind all material ourselves instead of asking applicants to do so.
    • We should tell applicants that the GMA test becomes progressively more difficult.

Second stage: Work test(s)

The work test(s) were designed to test the specific skills required for each role. Candidates who completed this stage received a monetary compensation.

Operations Analyst. Candidates had to complete an assignment with seven subtasks within 120 minutes. In order to validate the test, uncover potential problems, and set a benchmark, one team member in a similar role completed the test in advance. We would have liked to test it more, but this did not turn out not to be a problem as far as we can tell.

Research Analyst. Candidates had to complete two assignments; one was timed to 120 minutes, the other one was untimed, but we suggested how much to write and how much time to spend on each task. We decided to have a timed and an untimed test because we saw benefits and drawbacks for both and wanted to gather more information. For the untimed assignment, we tried to make sure that additional time invested would have strongly diminishing returns after two hours by reducing the total amount of work required. We asked two external people to test both assignments in advance.

Evaluation. The candidates’ potential supervisor evaluated each submission (blinded to the candidate’s identity) by scoring subtasks horizontally. Somebody else evaluated each submission as a whole so that we could spot potential flaws in the evaluation. We calculated a weighted sum of the scores from stages 1 and 2, which informed our decision who to advance.

What we learned

  • The work test(s) gave us a lot of useful information above and beyond what we had already gathered during the first stage. A cost-effective way to gain (part of) this information sooner could be to ask future applicants for a work/research sample (from a previous project) as part of the initial application.
  • We’re leaning toward only including timed tests in the future, as we did not see much evidence of terrible submissions as a result of time pressure—which was our main concern with timed tests, especially for the research analyst position.
  • We realized that the task load only allowed for satisficing (as opposed to excellence). However, it’s also valuable in retrospect to test candidates’ ability to optimize, so we’re contemplating reducing the task load on one of the tests.
  • In some cases, we should have put more effort into clarifying expectations, i.e. requirements for a very good submission. This was difficult to anticipate and more testing might have improved the work tests in this regard.

Third stage: Interviews

We wanted to use the interviews to resolve any uncertainties about the candidates we had at that point, to refine our assessment to what extent they identified with our mission, to learn more about their fit for the specific role, and to get a first sense of their fit with our team. We decided to have two 45 minute structured interviews per candidate back to back: one focused on EAF as an organization, one focused on the specific role.

Each interviewer scored the candidate across multiple criteria, and we calculated an aggregate score. Afterwards we discussed all remaining candidates extensively before making a decision. Given the tight schedule it was not possible to make sure that everybody on the hiring committee had the exact same information, as would be best practice. However, to resolve some uncertainties we had people listen to some of the interview recordings (which we had made with consent of the candidates and deleted right after).

What we learned

  • In retrospect we think the interviews did not give us a lot of additional information. We had already covered basic behavioral questions and mission understanding in the application form and tested for role fit with the work test. This might be less of a concern if we only use a trimmed down version of the application form.
  • However, we’re also contemplating changing the format of the interview entirely to test components which other formats cannot capture well, such as independent thinking, productive communication style, and real-time reasoning.

Fourth stage: Trial week

We invited all candidates who made it past the interviews to a trial week in our Berlin offices (one person completed the trial from a remote location [6]). We wanted to use that time to test what it was like to actually work with them, get more data on their work output, test their fit with the entire team, and also allow them to get to know us better. They worked with their potential supervisor on two research projects (Research Analyst) or five operations projects (Operations Analyst). We also scheduled individual meetings with each staff member throughout the week. During the same time, we contacted references to learn more about the candidates.

We made a final decision after extensive deliberations in the following week and discussing each candidate in a lot of detail. We also asked all staff members about their impressions and thoughts, and calculated a final score which corroborated a qualitative assessment.

What we learned

  • Overall, the trial week was very useful and we accomplished what we had planned.
  • The reference checks were very helpful and we’re considering conducting them earlier in the process. (see Appendix C for our reference call checklist)
  • Scheduling 1-on-1 meetings with all team members was a good idea for an organization of our size. In the future, we’ll try to batch them on day 2 and 3 to get everybody acquainted more quickly.
  • Some tasks were underspecified, and we should have invested more time to clarify context and expectations. We could have spotted some of these problems earlier if we had organized even closer supervision in some cases. This proved particularly difficult for the remote trial.
  • We’re also considering a longer trial to better represent actual work conditions. This is a difficult trade-off both for us as an organization as well as for the applicant. One solution could be to ask each candidate whether they want to complete a one week or a four week trial.

Appendix A. Shallow data analysis

We ran correlations of the scores from the first three stages which you can access here. You can find the questions that we used in the form on the far right of the spreadsheet or in Appendix B. After the first stage, we had to split scores between applicants for the Operations Analyst position and the Research Analyst position. Due to the small sample size, we’re not confident in any conclusions and don’t discuss any correlations beyond the second stage. We should also note that we cannot tell how well any of the measures correlate with actual work performance.

First stage

  • A high score on questions 1, 2, 12, and 3 each predict the total form score fairly well (correlations of .77, .69, .69, and .62). (Note that questions 1, 2, and 3 were weighted slightly more than the other questions in the total form score.) No question was negatively correlated with the form score, which is a good sign. (Note that they also contribute to the total form score which likely overestimates the correlation.)
  • Question 12 was not only a good a good predictor of the form score, but also had the highest correlation out of all questions with both GMA score (.43) and CV score (.39).
  • The correlation between the form score and the GMA score was fairly modest with .28. which indicates that these indeed measure different aspects. The same holds for the correlations between form score and CV score (.33) as well as GMA score and CV score (.29).
  • Question 12 was also highly correlated with the total score of the first stage (.73), even though it was just one out of 13 questions making up the form score (and was even among the lower weight questions). Other highly correlated questions were again 1 (.63), 2 (.49), and 13 (.49).
  • Question 8 was negatively correlated with both GMA score (-.34) and total score (-.14).

Second stage

Research Analyst:

  • Doing well during the first stage did not correlate with a high score in the work test (-.06). This is good on one hand because we did intend to measure different things. On the other hand, we wanted to select for excellent performance on the job in the first stage and this didn’t show in the work test at all. We’re not sure if this is picking up on something real since the sample size is low and there could be threshold effects, but we would have hoped for a slightly higher correlation all things considered.
  • We observed somewhat strong positive correlations with the work test for questions 5 (.36) and 11 (.39) as well as the GMA score (.39). Negative correlations held for questions 2 (-.53), 6 (-.74), 8 (-.42), and the total form score (-.51).

Operations Analyst:

  • For this role, the total score from the first stage was well correlated with the score on the work test (.57).
  • Not only was the GMA correlated (.58) as was the case for the research analysts, but also the form score (.59) and almost all subquestions. Notable negative correlations only held for question 4 (-.57) and question 7 (-.33).

Appendix B. Application form

  • Question 1: What is your current plan for improving the world?
  • Question 2: If you were given one million dollars, what would you do with them? Why?
  • Question 3: What excites you about our mission?
  • Question 4: Briefly describe an important accomplishment you achieved (mainly) on your own, and one you achieved together with a team. For the latter, what was your role?
  • Question 5: Briefly describe a time when you had to finish a task you did not enjoy. How did you manage the situation? What was the result?
  • Question 6: Briefly give two examples where you deliberately improved yourself.
  • Question 7: Can you describe an instance when you deliberately took the initiative to improve the team/organization you were working with?
  • Question 8: There are times when we work without close supervision or support to get the job done. Please tell us about a time when you found yourself in such a situation and how things turned out.
  • Question 9: Please tell us about a time when you successfully devised a clever solution to a difficult problem or set of problems, or otherwise “hacked” your way to success.
  • Question 10: While we try to make sure that workload and stress are kept at sustainable levels, there are times when having to perform under pressure at work is unavoidable. How do you cope with stressful situations?
  • Question 11: What was a high-stakes choice you faced recently? How did you arrive at a decision?
  • Question 12: What do you think is the likelihood that at least one minister/secretary or head of state will mention the term “effective altruism” in at least one public statement over the next five years? Please give an estimate without consulting outside sources. How did you arrive at the estimate?
  • Question 13: Please describe a time when you changed something about your behavior as a result of feedback that you received from others.

Appendix C. Template for reference checks

  • Introduce yourself.
  • Give some background on your call.
    • [candidate] applied for [position] with us. Currently in [stage of the application process].
    • [description of the types of work they’d be doing in the role]
    • We will treat all information confidentially.
  • In what context did you work with the person?
    • What was the concrete project/task?
  • What were their biggest strengths?
  • What were their main areas for improvement back then? Was there anything they struggled with?
  • How would you rate their overall performance in that job on a 1-10 scale? Why?
  • Would you say they are among the top 5% of people who you ever worked with at your organization?
  • If you were me, would you have any concerns about hiring them?
  • Is there anything else you would like to add?
  • If I have further questions that come up during the trial week, may I get in touch again?

Endnotes

[1] Kerry Vaughan pointed out to us afterwards that there are also reasons to think the common startup advice might not fully apply to EA organizations: 1) The typical startup might be more funding-constrained than EAF, and 2) EAs have more of a shared culture and shared knowledge than typical startup employees and, therefore, might require less senior management time for onboarding new staff. This might mean that if we can hire people who already have a very good sense of what needs to be done at the organization, it could be good to hire even when there isn’t a strong need.

[2] We would expect this to be a lot less for future hiring rounds (about 250 hours) since this also includes initial research, the design of the process, and putting in place all the necessary tools.

[3] Julia Wise pointed out to us afterwards that there is evidence that blinding may lead to fewer women being hired. Given that our main goal was making the process as fair as possible to all candidates, this did not update us against using blinding in the future. EDIT: After talking and thinking more about this issue, we now think this does provide us with some reason not to blind in the future.

[4] Points in these sections are based on a qualitative internal evaluation, the data analysis in Appendix A, direct feedback from applicants, and an anonymous feedback form we included in the job ad and sent to all candidates after they had been rejected.

[5] We checked with a lawyer if this was within our rights as an employer, and we concluded that it was within German law. We don’t have an opinion whether this would have been permissible in another jurisdiction. For what it’s worth, the relevant standard in the US for deciding whether administering a particular pre-employment test is legal seems to be “job-relatedness”.

[6] We already had a good sense of their team fit and organized Skype calls with some of the team members who were less familiar with them. Otherwise we would likely not have allowed for a remote trial.

90