Epistemic status: I think this is a statistical “fact” but I feel a bit cautious since so few people seem to take advantage of it
It may not always be optimal for cost or statistical power to have equal-sized treatment/control groups in a study. When your intervention is quite expensive relative to data collection, you can maximise statistical power or save costs by using a larger control group and smaller treatment group. The optimal ratio of treatment sample to control sample is just the square root of the cost per treatment participant divided by the square root of the cost per control participant.
Why larger control groups seem better
Studies generally have equal numbers of treatment and control participants. This makes intuitive sense: a study with 500 treatment and 500 control will be more powerful than a study with 499 treatment and 501 control, for example. This is due to the diminishing power returns to increasing your sample size: the extra person removed from one arm hurts your power more than the extra person added to the other arm increases it.
But what if your intervention is expensive relative to data collection? Perhaps you are studying a $720 cash transfer and it costs $80 to complete each survey, for a total cost of $800 per treatment participant ($720 + $80) and $80 per control. Now, for the same cost as 500 treatment and 500 control, you could have 499 treatment and 510 control, or 450 treatment and 1000 control: up to a point, the loss in precision from the smaller treatment is more than offset by the 10x larger increase in your control group, resulting in a more powerful study overall. In other words: when your treatment is expensive, it is generally more powerful to have a larger control group, because it's just so much cheaper to add control participants.
How much larger? The exact ratio of treatment:control that optimises statistical power is surprisingly simple, it’s just the ratio of the square roots of the costs of adding to each arm i.e. sqrt(control_cost) : sqrt(treatment_cost) (See Appendix for justification). For example, if adding an extra treatment participant costs 16x more than adding a control participant, you should optimally have sqrt(16/1) = 4x as many control as treatment.
Quantifying the benefits
With this approach, you either get free extra power for the same money or save money without losing power. For example, let’s look at the hypothetical cash transfer study above with treatment participants costing $800 and control participants $80. The optimal ratio of control to treatment is then sqrt(800/80) = 3.2 :1, resulting in either:
Saving money without losing power: the study is currently powered to measure an effect of 0.175 SD and, with 500 treatment and control, costs $440,000. With a 3.2 : 1 ratio (*types furiously in Stata*) you could achieve the same power with a sample of 337 treatment and 1079 control, which would cost $356,000: saving you a cool $84k without any loss of statistical power.
Getting extra power for the same budget: alternatively, if you still want to spend the full $440k, you could then afford 416 treatment and 1,331 control, cutting your detectable effect from 0.175 SD to 0.155 SD at no extra cost.
Ethics: there may be ethical reasons for not wanting a larger control group, for example in a medical trial where you would be denying potentially life-saving treatments to sick patients. Even outside of medicine, control participants’ time is important and you may wish to avoid “wasting” it on participating in your study (although you could use some of the savings to compensate control participants, if that won’t mess with your study).
Necessarily limited samples: obviously if there is a practical limit to increasing your control group size, such as only being able to operate in a limited geography, this may not be an option.
Natural skepticism? This isn’t a common technique, you might just trust that the market for ideas is efficient and if this really was a thing you would have heard about it from somewhere else by now. It kind of blows my mind that this isn’t done more often, which both makes me want to tell people about it and be skeptical. We used this approach for a pretty large RCT I worked on in Tanzania, and no one complained.
If you treatment is quite expensive relative to data collection costs, consider using a larger control group in the ratio of sqrt(treatment_cost/control_cost) and enjoy that spare money or additional statistical power.
I am not claiming to have discovered this myself. I first read this equation in Running Randomized Evaluations and was able to derive the same result myself here.
I believe this holds for cluster RCTs, just remember that the increased control sample here would come in the form of additional control clusters, rather than larger clusters.
If you are doing power calculations in Stata and want to factor in different treatment/control group sizes, you just add ratio(X) to the sampsi command, where “X” is the treatment/control ratio. For a cluster RCT using clustersampsi you... need to do something involving harmonic means, I forget exactly, but poke me on the Forum and I'll happily dig through some old code.
The idea makes a lot of sense, but my guess is that the circumstance where the cost is driven by the intervention itself isn’t that common: In the context of charities, we’re thinking about applying RCTs to test whether an intervention works. Generally the intervention is happening anyway. The cost of RCTs then doesn’t come from applying the intervention to the treatment group - it comes from establishing the experimental conditions where you have a randomised group of participants and the ability to collect data on them.
Hey Aidan-- that's a good point. I think it will probably apply to different extents for different cases, but probably not to all cases. Some scenarios I can imagine:
Overall, I think cases 2/3/4 benefit from the cheaper study. Scenario 1 seems more like what you have in mind and is a good point, I just think there will be enough scenarios where the cheaper trial is useful, and in those cases the charity might consider this treatment/control optimisation.
Thanks Rory - I think your general idea is good, and in some cases could be a good option!
I could be wrong, but from my experience working in the development world these 4 scenarios aren't really how RCTs generally happen. Usually there will be a partnership with a RCT running NGO (like IPA) or a university department (J-PAL at MIT) where the partner organisation pays for and organise everything.
Sometimes scenario 4 could happen as part of a grant application
This doesn't change the existence of a budget constraint, though. The partner organization, especially a grant funder like JPAL/IPA, will grant you a certain amount of their resources to use. I don't see why you wouldn't want to optimize the use of their resources.
100% the original post stands, in any scenario we would want to optimise use of resources. I don't think JPAL/IPA is generally a funder though - they do the research themselves so they are the ones to convince ;).
Ah, that's helpful data. My experience in RCTs mostly comes from One Acre Fund, where we ran lots of RCTs internally on experimental programs, or just A/B tests, but that might not be very typical!
Would be super interested to see the results of some of these RCTs / AB tests. Were any of them published apart from the Lime SMS study? We're looking for great examples of learning orgs that do this and some studies from 1AF would be a great motivator/example.
Great suggestion, particularly as you say for trials with a super expensive treatment relative to control.
In defense of current practice, I'd like to add that a major difficulty when running medical trials for new therapeutics is simply recruiting patients to the trial. Many patients enroll on the trial with the aim of getting the experimental treatment, so it's a lot easier to recruit people when your trial has a 50% or 75% chance of assignment to therapeutic arm.
Some other important strategies that are currently hot right now:
Platform trials: One giant trial that has one control arm and maybe three to four treatment arms. Hard to do as it requires a lot of people to work together but amazing when you pull them off (e.g. we did many of these for COVID)
Use of historical or shared control data: Why recruit as many controls if you can integrate existing data in a statistically principled, unbiased way (easier said than done of course).
This is a really helpful post - thank you! It does blow my mind slightly that this isn't more broadly practiced, if the argument holds, but I think it holds!
I don't know enough about the market for academic papers, but I wonder if you'd be interested in writing this up for a more academic audience? You could look at some set of recent RCTs and estimate the potential savings (or, more ambitiously, the increase in power and associated improvement in detecting results)
Given that the argument is statistical rather than practical in any way that is specific to economics or development, do you know if this happens in biomedicine? Many trials often involve pitting newer, more expensive interventions against an existing standard of care.
Thanks Chris, that's a cool idea. I will give it a go (in a few days, I have an EAG to recover from...)
One thing I should note is that other comments on this post are suggesting this is well known and applied, which doesn't knock the idea but would reduce the value of doing more promotion. Conversely, my super quick, low-N look into cash RCTs (in my reply below to David Reinstein) suggests it is not so common. Since the approach you suggest would partly involve listing a bunch of RCTs and their treatment/control sizes (so we can see whether they are cost-optimised), it could also serve as a nice check of just how often this adjustment is/isn't applied in RCTs
For bio, that's way outside of my field, I defer to Joshua's comment here on limited participant numbers, which makes sense. Though in a situation like early COVID vaccine trials, where perhaps you had limited treatment doses and potentially lots of willing volunteers, perhaps it would be more applicable? I guess pharma companies are heavily incentivised to optimise trial costs tho, if they don't do it there'll be a reason!
Often recruiting is the bottleneck in biomedicine so you want to maximise the power for a given number of participants
You’re completely correct! However, it’s worth noting this is standard practice (when the treatment makes up most of the cost, which it usually doesn’t). Most statisticians will be able to tell you about this.
So I think I have two comments:
Actually, maybe I should clarify this. This is standard practice when you hire a decent statistician. We've known this since like... the 1940s, maybe?
But a lot of organizations and clinical trials don't do this because they don't consult with a statistician early enough. I've had people come to me and say "hey, here's a pile of data, can you calculate a p-value?" too many times to count. Yes, I calculated a p-value, it's like 0.06, and if you'd come to me at the start of the experiment we could've avoided the million-dollar boondoggle
that you just created.
I assumed more people were aware of this. I'm using it in a trial we're about to start. But as others have said, in many trials the treatment is not particularly more costly. But probably a factor in detailed interventions in poverty and health in poor countries. Have you looked into how many studies in development economics and GH&D with costly interventions do this?
As a quick data point I just checked the 6 RCTs GiveDirectly list on their website. I figure cash is pretty expensive so it's the kind of intervention where this makes sense.
It looks like most cash studies, certainly with just 1 treatment arm, aren't optimising for cost:
AGAINST CASH: EVIDENCE FROM RWANDA
farming communities in Uganda
USAID Workforce Readiness Program
203 cash + NGO
experimental evidence from Kenya
80 shortterm UBI
71 lump sum
Suggests either 1) there's some value in sharing this idea more or 2) there's a good reason these economists aren't making this adjustment. Someone on Twitter suggested "problems caused by unbalanced samples and heteroskedasticity" but that was beyond my poor epidemiologist's understanding and they didn't clarify further.
The “problems caused by unbalanced samples” doesn’t seem coherent to me; I’m not sure what they are talking about.
If the underlying variance is different between the treatment and the control group:
Unbalanced samples are not a problem per se. You can run into a problem of representation/generalization for the smaller sample but this argument is independent of balancing and only has to do with small sample sizes.
@david_reinstein made an excellent point about heteroscedasticity / variance. To factor this into your original post: You want to optimize the cost-effectiveness of the precision of your group-level difference score. This is achieved by minimizing the standard errors (SE) of the group-level estimates of each sample, which are just the standard deviations (SD) divided by the square root of the respective observations. So your term would expand to:
Control-to-treat-ratio = sqrt(treatment_cost/control_cost) * control_SD/treatment_SD.
The problem, in practice, is that you usually know the costs a priori but not the SDs. If variances are not equal, however, I would agree with @david_reinstein that the treatment group will more likely show greater variance on your outcome variable (if control group has more variance, I would rather reconsider the choice of the outcome variable).
If you want to read more about the concept of precision and its relation to statistical power (also cf. the paper that @Karthik Tadepalli cited), we just put together a preprint here that is supposed to double as a teaching ressource: https://doi.org/10.31234/osf.io/m8c4k (introduction and discussion will suffice since the middle part focusses on biological/neuroscientific measurements that have vastly different properties than, e.g., questionnaire data).
Here is the glossary that is mentioned in the paper: https://osf.io/2wjc4
And here is the associated Twitter post with some digest about the most important insights: https://twitter.com/bioDGPs_DGPA/status/1616014732254756865
You seem to assume that there's a linear relationship between the intervention and the effect. This might be the case for cash transfers but it's not the case for many other interventions.
If you give someone half of a betnet they are not 50% as much protected.
When it comes to medical treatments it might be that certain side effects only appear at a given dose and as a result you have to do your clinical trial for the dose that you actually want to put into the pill that you sell.
Hi Christian-- agreed but my argument here is really for fewer treatment participants, not smaller treatment doses
Great argument. My guess for why this isn't common based on a little experience is that the decision is usually sequential. First you calculate a sample size based on power requirements, and then you fundraise for that budget (and usually the grantmaker asks for your power calculations, so it does have to be sequential). This doesn't inherently prevent you from factoring intervention cost into the power calculations, but it does mean the budget constraint is not salient.
I wouldn't be too surprised ex ante if there are inefficiencies in how we do randomization. This is an area with quite active research, such as this 2022 paper which proposes a really basic shift in randomization procedures and yet shows its power benefits.
I'm confused why the process being sequential is a reason that this isn't occurring. Suppose someone was writing a RCT grant proposal and knew in advance how expensive the treatment was compared to the control. They find the optimal ratio of treatment to control, based on the post above. Then, they ask for however much money they need to get a certain amount of power (which would be less money than they would have needed to ask for not doing this).
Or alternatively, run the sample size calculation as you suggest. Convert that into a $ figure, then use the information in the post above to get more power for that same amount of money and show the grant-maker the second version of one's power calculations.
I'm surprised you retracted the comment because I agree with it and I'm not 100% sure what I meant. It is still a salience issue but I don't think the sequential process really matters for that
To explain why I retracted: I re-read your original post and noticed that you were talking about salience, and I think you're probably right that this isn't a very salient aspect of the process. At first, I thought you were saying something like 'the steps occur sequentially, so the suggestion of the post can't be implemented' which seems wrong. But 'the steps occur sequentially, so it might not occur to someone to back-track in their thinking and revise the result they got in the first step afterwards' seems probably right, although I have no idea how big of an explanation that is compared to other reasons the OP's suggestion isn't very common.