Hide table of contents

This post collects some hacks — cheap things that work well — and common pitfalls I’ve seen in my experience hiring people for CEA. 

Hacks

  • Sharing customised-generic feedback
    • Rejected candidates often, very reasonably, desire feedback. Sometimes you don’t have capacity to tailor feedback to each candidate, particularly at earlier stages of the process. If you have some brief criteria describing what stronger applicants or stronger trial tasks submissions should look like, and if that’s borne out in your decisions about who to progress, I suggest writing out a quick description of what abilities, traits or competencies the successful candidates tended to demonstrate. This might be as quick as “candidates who progressed to the next stage tended to demonstrate a combination of strong attention to detail in the trial task, demonstrated a clear and direct writing style, and have professional experience in operations.” This shouldn’t take more than a few minutes to generate. My impression is it’s a significant improvement for the candidates over a fully generic response. 
  • Consider borrowing assessment materials
    • Not sure how to test a trait for a given role? Other aligned organisations might have already created evaluation materials tailored to the competencies you want to evaluate. If so, that organization might let you use their trial task for your recruitment round.
      • Ideally, you can do this in a way that’s pretty win-win for both orgs (e.g. Org A borrows a trial task from Org B. Org A then asks their candidates to agree that, should they ever apply to Org B, Org A will send over the results of the assessment). 
      • I have done this in the past and it worked out well.
  • Beta test your trial tasks! 
    • I’m a huge proponent of beta testing new evaluation materials. Testing your materials before sending them to candidates can save you a world of frustration down the road by helping you tweak unclear instructions, inappropriate time limits, and a whole host of other pitfalls. 

Mistakes

Taken from our internal hiring resources, here are some mistakes we’ve made in the past with our evaluation materials:

  • Trial tasks or tests that are laborious to grade
    • Some types of work tests take a long time to grade effectively. Possible issues can be: a large amount to read, multiple sources or links to check for information, a complicated or difficult-to-apply rubric. Every extra minute this takes you to grade is multiplied by the number of tasks. The ideal work sample test is quick and clear to grade.
    •  Possible solutions:
      • Think backwards from grading when you create the task.
      • Where appropriate, be willing to sacrifice some assessment accuracy for grading speed
      • Beta test!
  • Tasks that require multiple interactions from the grader
    • Some versions of trial tasks we used in the past had a candidate submit something to which the grader had to respond before the candidate could complete the next step. This turned out to be inefficient and frustrating. 
    • Solution: avoid this, particularly at early stages.
  • Too broad
    • Some work tests look for generalist ability but waste the opportunity to test a job-specific skill. The more you can make the task specific to the role, the more information you get. If fast, clear email drafting is critical, test that instead of generically testing communication skill. 
  • Too hard / too easy
    • If you don’t feel like anyone is giving you a reasonable performance on your task, you may have made it too hard. 
      • A common driver of this failure mode is assuming context the candidate won’t have or underrating the advantage conferred by context possessed by your staff but not by (most?) of your candidates
    • Ceiling effects are perhaps a larger problem. If everyone is doing well, you won’t be able to sort applicants by performance.
    • Solution: beta test, ideally with people outside your org
  • Not timed (with an enforcing tool)
    • Letting people self-time may be tempting, but this makes results harder to interpret. If someone has done well, you don’t want to have to spend time wondering if it’s because they spent more time than other applicants / more than they said they spent. Some people will also forget to record time spent (or “forget”) about their time limit and go over. If people spend hugely different amounts of time, you can find yourself comparing apples and oranges. 
    • Possible solutions:
      • Especially for early stage tasks, I recommend using software like ClassMarker
      • Alternatively, you can use e.g. google docs and spot check results for time infractions.
  • Overburdened task
    • Trying to make a task measure  too many things at once can create noisy, hard to interpret / grade data. This trades off against making sure you're getting evidence for all key role criteria, but in my opinion it's often a worse mistake to create a task that tries to do too many things at once, and therefore doesn’t (as) successfully accomplish any. If you have an assessment material that tests a single thing clearly and efficiently, you can give that assessment first and then test other key criteria later in the round. 
  • Results are opaque to other staff
    • In the past, some work sample tests spat out results only the hiring manager knew how to interpret. If only one person can understand the results, other stakeholders have to defer rather than being able to independently assess the candidate’s performance. This can be particularly frustrating if people disagree about how strong a particular candidate is. Also, if the sole capable interpreter becomes capacity constrained, your round is now bottlenecked. 
      • Sometimes skills may be so specialized that somewhat opaque results are a correct tradeoff.
  • Confusing task
    • Maybe everyone or a subgroup of applicants all misunderstand the task in the same way. Or, they answer in a different way to what you were looking for / different ways from other groups. This makes it hard to compare across answers. 
    • It may be tempting to make “figure out what I want from you” a key part of what you’re testing for, but I recommend against this, unless that’s a vital skill for success in the role, as weakness on that "figuring out" trait then causes ~complete failure, whatever other skills they may have to offer.
    • Solution: again, beta test.
  • Evaluates candidates for the wrong role
    • Having a poorly scoped role vision can lead to this failure mode. If you’ve designed a role with too many constraints (hiring for an imaginary person!) or focused on a few aspects of the role to the exclusion of other important aspects, the work sample test may similarly target the wrong traits. I propose that the antidote to this failure mode is to spend significant time drawing up a role vision, and pressure test it. If people different than your imagined ideal could perform elegantly in this role, the work sample test (along with the rest of the process) should make it possible for less prototypical candidates to shine, too. More on that here.
  • Lack of clarity on what it measures / mistargeted tasks
    • Some tasks might test well for some key role qualities, but miss other important aspects of being able to do the job well (that could have been added into the task). If a role needs traits A, B, and C, but the work sample test only evaluates trait A, then people who would not perform well on the job will pass the work sample test. I think this is fine and often unavoidable for early stage assessment but you should be aware in what ways your assessment is incomplete. More on that in the second section here.
      • After drafting your evaluation materials, you may also want to revisit your list of key competencies to see if there’s anything missing that you can easily add in. 
    • One specific failure mode here is making success on the task totally reliant on a single trait (assuming that trait is not un-controversially role-critical). For example, even if you’d love to find a candidate who speaks excellent Italian, if it’s potentially possible to succeed in that role without that competency, don’t give the work sample test in Italian.
  • Including pet features
    • Beware of including any features that will bias you without adding information. If you include a reference to a movie you like and some candidates notice it and some miss the reference, can you be confident that that isn’t going to bias you towards the noticers, who are clearly awesome because they enjoy the same things you do? This is a special case of making sure you’re testing for (and evaluating on the basis of) the features you truly care about, and ideally those features alone.
  • Privileging similarity to self
    • In any type of evaluation, we as evaluators are likely to be biased towards people like ourselves. With work sample tests, there’s a temptation to make tests that look for a bunch of the virtues you most care about, which may be virtues you yourself possess.
    • Proposed solution: Read your work sample test and ask yourself, “Does this sound like a test for being me-like?” If so, be suspicious.  

59

0
0

Reactions

0
0

More posts like this

Comments1


Sorted by Click to highlight new comments since:

The outline structure makes this easy to skim. Thank you!

Curated and popular this week
trammell
 ·  · 25m read
 · 
Introduction When a system is made safer, its users may be willing to offset at least some of the safety improvement by using it more dangerously. A seminal example is that, according to Peltzman (1975), drivers largely compensated for improvements in car safety at the time by driving more dangerously. The phenomenon in general is therefore sometimes known as the “Peltzman Effect”, though it is more often known as “risk compensation”.[1] One domain in which risk compensation has been studied relatively carefully is NASCAR (Sobel and Nesbit, 2007; Pope and Tollison, 2010), where, apparently, the evidence for a large compensation effect is especially strong.[2] In principle, more dangerous usage can partially, fully, or more than fully offset the extent to which the system has been made safer holding usage fixed. Making a system safer thus has an ambiguous effect on the probability of an accident, after its users change their behavior. There’s no reason why risk compensation shouldn’t apply in the existential risk domain, and we arguably have examples in which it has. For example, reinforcement learning from human feedback (RLHF) makes AI more reliable, all else equal; so it may be making some AI labs comfortable releasing more capable, and so maybe more dangerous, models than they would release otherwise.[3] Yet risk compensation per se appears to have gotten relatively little formal, public attention in the existential risk community so far. There has been informal discussion of the issue: e.g. risk compensation in the AI risk domain is discussed by Guest et al. (2023), who call it “the dangerous valley problem”. There is also a cluster of papers and works in progress by Robert Trager, Allan Dafoe, Nick Emery-Xu, Mckay Jensen, and others, including these two and some not yet public but largely summarized here, exploring the issue formally in models with multiple competing firms. In a sense what they do goes well beyond this post, but as far as I’m aware none of t
 ·  · 19m read
 · 
I am no prophet, and here’s no great matter. — T.S. Eliot, “The Love Song of J. Alfred Prufrock”   This post is a personal account of a California legislative campaign I worked on March-June 2024, in my capacity as the indoor air quality program lead at 1Day Sooner. It’s very long—I included as many details as possible to illustrate a playbook of everything we tried, what the surprises and challenges were, and how someone might spend their time during a policy advocacy project.   History of SB 1308 Advocacy Effort SB 1308 was introduced in the California Senate by Senator Lena Gonzalez, the Senate (Floor) Majority Leader, and was sponsored by Regional Asthma Management and Prevention (RAMP). The bill was based on a report written by researchers at UC Davis and commissioned by the California Air Resources Board (CARB). The bill sought to ban the sale of ozone-emitting air cleaners in California, which would have included far-UV, an extremely promising tool for fighting pathogen transmission and reducing pandemic risk. Because California is such a large market and so influential for policy, and the far-UV industry is struggling, we were seriously concerned that the bill would crush the industry. A partner organization first notified us on March 21 about SB 1308 entering its comment period before it would be heard in the Senate Committee on Natural Resources, but said that their organization would not be able to be publicly involved. Very shortly after that, a researcher from Ushio America, a leading far-UV manufacturer, sent out a mass email to professors whose support he anticipated, requesting comments from them. I checked with my boss, Josh Morrison,[1] as to whether it was acceptable for 1Day Sooner to get involved if the partner organization was reluctant, and Josh gave me the go-ahead to submit a public comment to the committee. Aware that the letters alone might not do much, Josh reached out to a friend of his to ask about lobbyists with expertise in Cal
Andy Masley
 ·  · 4m read
 · 
If you’re visiting Washington DC to learn more about what’s happening in effective altruist policy spaces, we at EA DC want to make sure you get the most out of it! EA DC is one of the largest EA networks and we have a lot of amazing people to draw from for help. We have a lot of activity in each major EA cause area and in a broad range of policy careers, so there are a lot of great opportunities to connect and learn about each space! If you're not visiting DC soon but would still like to connect or learn more about the group you should email us at Info@EffectiveAltruismDC.org and explore our resource list!   How to get the most out of DC Fill out our visitor form Start by filling out our visitor form. We’ll get back to you soon with any resources and connections you requested! We’d be excited to chat over a video call before your visit, get you connected to useful resources, and put you in touch with specific people in DC most relevant to your cause area and career interests. Using the form, you can: Connect with the EA DC network If you fill out the visitor form we can connect you with specific people based on your interests and the reasons for your visit. After we connect you, you can either set up in-person meetings during your visit or have video calls ahead of time to get a sense of what's happening on the ground here before you arrive. To connect with more people you can find all our community resources here and on our website. Follow along with EA DC events here.  Get added to the EA DC Slack Even if you’re just in town for a few days, the Slack channel is a great way to follow what’s up in the network. If you’re okay sharing your name and reasons for your DC visit with the community you can post in the Introductions channel and put yourself out there for members to reach out to. Get hosted for your stay We have people in the network with rooms available to sublet, and sometimes options to stay for free. Find an office to work from during the