Replication Games: My retrospective

Niklas Lehmann

The Institute for Replication is an attempt to massively replicate social-science papers to see which results actually hold up under scrutiny. In principle, replication is a very resource‑efficient way to gather additional evidence: you can replicate a high‑profile finding in a tiny fraction of the time it took to produce the original paper.

I’m personally totally on board with this. I love the idea. It’s also the kind of thing that naturally attracts philanthropic funding. As far as I can tell, the Institute for Replication has received grants from Coefficient Giving and, possibly other philanthropic funders in the past.

I think structured programs that generate more replications are a fruitful way forward—but the Replication Games didn’t work out for me, and could perhaps be massively improved upon. I would’ve been more productive just replicating on my own at home.

My best guess is that this outcome wasn’t inevitable; it was mostly driven by organizational issues (especially around team formation and follow‑through) that might be fixable.

How the Games work

Replication Games are essentially replication hackathons. You apply to participate, and organizers place participants into small teams based on practical constraints—things like methods experience, preferred software (R/Stata/Python), and the kind of empirical work people feel comfortable doing. Teams are typically assigned a paper from a curated list where replication seems feasible (e.g., the paper is influential and the underlying data/code are available). Sometimes participants suggest papers themselves, but the event usually works best when papers are screened in advance for tractability.

Before the in‑person event, teams do prep work: reading the paper, checking the available materials, setting up the computing environment, and (ideally) sketching out a plan (which robustness checks could be done, how to divide tasks). Then the team meets in person for an intensive work session—often a full day—focused on getting as far as possible and—not least—having fun: reproducing the headline results, documenting if code breaks, and running sensible robustness checks or alternative experiments where appropriate.

Because it’s rarely possible to finish everything during the event itself, there’s usually a post‑game phase where the team cleans up code, writes up a standardized replication report (what was attempted, what worked, what didn’t, why), and submits it to the organizers. The Institute for Replication coordinates the overall process, and there’s typically a local organizer (not from the Institute, but the local university/institution) handling the venue, food, and the practicalities.

Why the Replication Games did not work out (for me)

In my experience, the work achieved wasn’t worth the effort that was put in. If I could go back, I would choose a different approach—and I think that’s mostly due to organizational hiccups that could potentially be eliminated.

The main issue at the Replication Games seems to be group dynamics. Everyone knows the free‑rider problem from group projects; the Replication Games aren’t unique here. But the format aggravates the problem: What usually deters free‑riding is continued dependence on each other—repeated interaction, reputational consequences, and clear accountability. The Games often have the opposite structure: you’re thrown together with a bunch of strangers (great for meeting people; poor for incentives) who you might never work with again, for what is essentially unpaid work, with benefits (if any) shared across many. Even motivated people—me included—can drift into “I’ll do my part, but I’m not going to carry this whole thing” mode.

This is a selective, highly personal take. That said, conversations with other replicators (including people from other groups) suggested my feelings weren’t rare. At dinner, people from other groups reported similar issues; someone even said, “I’m really not looking forward to finishing it.”

My experience

I went into the Games hoping for networking, learning, results, and fun. And to be clear: the overall concept is well designed to deliver those things, and communication about what to expect was generally clear.

First game: I was placed in a group that intended to work with a proprietary software package I had zero experience with and zero intention of learning (SPSS). On top of that, as the game approached, my teammates weren’t moving toward a decision on what paper to replicate or how we’d do it. It became clear quickly that this wouldn’t work for me, and my best option was to pull out last‑minute and not attend. I incurred real costs, and I didn’t replicate anything. This also made me wonder whether the matching problem—replicators to groups, groups to papers—is so complex that it might benefit from giving replicators more choice up front.

Second game: Of my four team members, one showed up extremely well prepared; one showed up completely unprepared; one was added on short notice (so couldn’t prepare); and one didn’t show up. Still, we got to work. I was in a genuine state of flow: the room was vibrant with scientific bliss, I met new people, and I felt like I was contributing to an awesome scientific endeavor. I worked with full concentration for half a day. Lunch and the local organization were awesome!

Then the flow got punctured: an Institute member came over and told us there’d been a mistake during the paper assignment. Another group—sitting a few metres away from us—had accidentally been assigned the same paper. They’d noticed the error a couple of hours earlier, but hoped we’d take different approaches so the outputs could be merged into one larger replication. Unfortunately, both groups tweaked the experiment in exactly the same way, so a lot of what we’d done was suddenly redundant.

Because my experience with the research management of Replication Games has been so negative, I just do not trust the process anymore and likely won't attend again although I would really enjoy the work.

Tentative suggestions

If the key problem is group dynamics, then fixing it means changing incentives and accountability—not just tweaking logistics. One approach could be to create stronger “repeated interaction” or future‑opportunity effects. For example: an enticing, perhaps prestigious, replication event that’s only open to teams who reliably completed high‑quality replications. If your replication is poor (as judged by reviewers) or simply unfinished, you don’t get invited. That said, this needs care: you don’t want to create incentives to distort replication results: replicators might try to “break” a high‑profile paper by doing something unnecessarily adversarial in the analysis rather than aiming for a fair test.

But—unless I’ve misunderstood—Replication Games are explicitly meant not to recur in the same places. That makes it unlikely you’ll run into the same people again, which worsens the limited‑repeated‑interaction problem. For me, getting together with others who are passionate about replication was a key motivation to attend. Of course, I could replicate alone, or with people I already know—but then the question becomes: why attend a replication game at all? At that point, I might as well just start on my own.

Our team noted that the replication games included minimal follow-up, not even a "thank you for participating, here are the next steps" email. I don’t know whether the Institute for Replication systematically gathers feedback, or even encourages submission of reports from all participants, but I did not receive anything; that seems like low‑hanging fruit.

I guess that replicating empirical science papers is hard, and it’s a learning process for everyone involved. Happy replicating!

derekmikolaApr 283

Many thanks for the review! I'll keep your perspectives in mind over the next few months as we ramp up our next round of Replication Games.

Here to confirm much of what was said is true. Always happy to talk about how to improve the process.

Completely agree with the challenges of group dynamics. It is hard to know ex ante how teams will interact. Right now we try and keep people who program in the same language, have similar research interests, and are at different stages of their career (graduate student, postdoc, professors, etc.). We are at the whim of who signs up and do the best with the volunteers.

Do you think setting a day to talk with teams one-on-one before the event would help mitigate some of the randomness associated with team formation and get a better flavour of commitment?

One thing I'm piloting to improve follow up is setting a day to talk with teams about report progression, say a week after an event. Do you think that would help?

Less important but clarifying:

Papers are not screened for influence; they are screened for data, code and likelihood to reproduce as indicated from a README (the selection is mainly on journals who provide a codebase and allow reproduction)
I4R can handle food, building and practicalities (it just depends on the event but happens more often at conferences)
I4R tries to give at least a year off before returning to the same city.
The event participated in was joint with another institute which lead to some complications (I only booked my flight 5 days before the event not really having expected to go).

Happy Replicating!

SummaryBotMar 232

Executive summary: Replication Games are a promising concept for scaling social-science replication, but in the author’s experience they underperformed due to fixable organizational and incentive issues—especially weak team dynamics, poor matching, and lack of follow-through.

Key points:

The author argues that replication is a resource-efficient way to test influential findings and that structured programs like the Institute for Replication are a “fruitful way forward.”
The Replication Games format—short-term, team-based hackathons with limited future interaction—exacerbates free-rider problems and weak accountability.
In the author’s first game, poor team matching (e.g., unfamiliar software and lack of coordination) led them to drop out and produce no replication.
In the second game, uneven preparation and an assignment error (duplicate paper across teams) made much of the team’s work redundant.
The author reports that other participants expressed similar frustrations, suggesting these issues may not be isolated.
The author suggests stronger incentives for follow-through (e.g., selective future participation) and better coordination, while noting risks like incentivizing adversarial replication practices.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Effective Altruism Forum
EA Forum

Replication Games: My retrospective

11

How the Games work

Why the Replication Games did not work out (for me)

My experience

Tentative suggestions

11

Reactions

More posts like this