In January 2026, GiveWell published a detailed account of their experiment using AI to red-team their charitable intervention research. It's worth reading. They were honest about the results: roughly 15–30% of AI critiques were useful, with persistent hallucination, lost context, and unreliable quantitative estimates. They flagged multi-agent workflows as a future possibility but haven't pursued them.
I think their diagnosis was wrong. The limitations they experienced aren't ChatGPT's fault — they're a consequence of prompt-based, single-pass, monolithic-context architecture. The model is fine. The pipeline isn't.
So I built a different one.
What I built
A six-stage multi-agent pipeline using only GiveWell's own public materials — their intervention reports, published AI outputs, and cost-effectiveness spreadsheets. No privileged access. The improvement, if there is one, has to come from methodology alone.
The stages: Decomposer → Investigators (one per scoped thread) → Verifier → Quantifier → Adversarial Pair → Synthesizer.
Three design decisions did most of the work:
Scoped context per agent. No agent gets the whole filing cabinet. Each Investigator gets a CONTEXT.md defining what's in scope, what data GiveWell uses, what adjustments are already made, and what not to re-examine. This eliminates the lost-context failure mode GiveWell identified.
Verification as a first-class stage. Every citation and factual claim is independently checked by a separate Verifier agent before reaching a human. Hypothesis generation and evidence retrieval are deliberately separated — this is where hallucinations die.
Quantitative grounding via code execution. The Quantifier runs programmatically against GiveWell's actual CEA spreadsheet. No ungrounded "could reduce cost-effectiveness by 15–25%" without showing which parameter moves and by how much.
Phase 1 results: water chlorination
I chose water chlorination first because it's where GiveWell's AI output had hallucinated citations — a concrete baseline to beat.
| Metric | GiveWell baseline | Phase 1 result |
|---|---|---|
| Signal rate | ~15–30% | ~90% (28 of 31 critiques) |
| Hallucination rate | Multiple per run | Zero |
| Novel findings | 1–2 | 4 critical, 3 moderate |
| Quantitative specificity | Ungrounded estimates | Parameter-linked sensitivity ranges |
A note on the signal rate: 30 of 31 critiques passed the Verifier, and 28 of 30 survived adversarial review. I want to be transparent that a ~90% pass rate may indicate the filters are too permissive rather than the Investigators being unusually precise — likely some of both. I'm reporting it honestly rather than as a clean win.
The 4 critical findings — Cryptosporidium resistance in chlorinated water, age-specific vulnerability patterns, adherence decay over time, and seasonal transmission gaps — are all connected to specific CEA parameters and survived both verification and adversarial challenge. GiveWell's AI output identified the Cryptosporidium issue but without a verified citation or parameter linkage.
Full write-up, architecture spec, side-by-side comparison with GiveWell's published output, and all seven agent prompts are at tsondo.com/blog/give-well-red-team.
Two versions for different audiences
If you work at GiveWell or a similar research organization: there's a manual version — sequential prompts designed to run in a Claude Project with no engineering required.
If you're a developer: there's a Python pipeline with the full automated version, including the spreadsheet sensitivity analysis module.
Both are open source. The total API cost for Phase 1 was ~$30.
What I'd like
Direct engagement from anyone at GiveWell, or others who've worked on AI evaluation pipelines in research contexts. Phases 2 (ITNs) and 3 (SMC) are in progress.
If the methodology is wrong, I want to know. If it's useful, I'd rather GiveWell use it than have it sit in a repo.
Reach me at todd@tsondo.com or @tsondo.com on BlueSky.

I am somewhat concerned about data contamination here: Are you sure that the original Givewell writeup has at no point leaked into your model's analysis? Ie: was any of givewell's analysis online before the august 2025 knowledge cutoff for GPT, or did your agents look at the Givewell report as part of their research?