Hide table of contents

In January 2026, GiveWell published a detailed account of their experiment using AI to red-team their charitable intervention research. It's worth reading. They were honest about the results: roughly 15–30% of AI critiques were useful, with persistent hallucination, lost context, and unreliable quantitative estimates. They flagged multi-agent workflows as a future possibility but haven't pursued them.

I think their diagnosis was wrong. The limitations they experienced aren't ChatGPT's fault — they're a consequence of prompt-based, single-pass, monolithic-context architecture. The model is fine. The pipeline isn't.

So I built a different one.

What I built

A six-stage multi-agent pipeline using only GiveWell's own public materials — their intervention reports, published AI outputs, and cost-effectiveness spreadsheets. No privileged access. The improvement, if there is one, has to come from methodology alone.

The stages: Decomposer → Investigators (one per scoped thread) → Verifier → Quantifier → Adversarial Pair → Synthesizer.

Three design decisions did most of the work:

Scoped context per agent. No agent gets the whole filing cabinet. Each Investigator gets a CONTEXT.md defining what's in scope, what data GiveWell uses, what adjustments are already made, and what not to re-examine. This eliminates the lost-context failure mode GiveWell identified.

Verification as a first-class stage. Every citation and factual claim is independently checked by a separate Verifier agent before reaching a human. Hypothesis generation and evidence retrieval are deliberately separated — this is where hallucinations die.

Quantitative grounding via code execution. The Quantifier runs programmatically against GiveWell's actual CEA spreadsheet. No ungrounded "could reduce cost-effectiveness by 15–25%" without showing which parameter moves and by how much.

Phase 1 results: water chlorination

I chose water chlorination first because it's where GiveWell's AI output had hallucinated citations — a concrete baseline to beat.

MetricGiveWell baselinePhase 1 result
Signal rate~15–30%~90% (28 of 31 critiques)
Hallucination rateMultiple per runZero
Novel findings1–24 critical, 3 moderate
Quantitative specificityUngrounded estimatesParameter-linked sensitivity ranges

A note on the signal rate: 30 of 31 critiques passed the Verifier, and 28 of 30 survived adversarial review. I want to be transparent that a ~90% pass rate may indicate the filters are too permissive rather than the Investigators being unusually precise — likely some of both. I'm reporting it honestly rather than as a clean win.

The 4 critical findings — Cryptosporidium resistance in chlorinated water, age-specific vulnerability patterns, adherence decay over time, and seasonal transmission gaps — are all connected to specific CEA parameters and survived both verification and adversarial challenge. GiveWell's AI output identified the Cryptosporidium issue but without a verified citation or parameter linkage.

Full write-up, architecture spec, side-by-side comparison with GiveWell's published output, and all seven agent prompts are at tsondo.com/blog/give-well-red-team.

Two versions for different audiences

If you work at GiveWell or a similar research organization: there's a manual version — sequential prompts designed to run in a Claude Project with no engineering required.

If you're a developer: there's a Python pipeline with the full automated version, including the spreadsheet sensitivity analysis module.

Both are open source. The total API cost for Phase 1 was ~$30.

What I'd like

Direct engagement from anyone at GiveWell, or others who've worked on AI evaluation pipelines in research contexts. Phases 2 (ITNs) and 3 (SMC) are in progress.

If the methodology is wrong, I want to know. If it's useful, I'd rather GiveWell use it than have it sit in a repo.

Reach me at todd@tsondo.com or @tsondo.com on BlueSky.

3

0
0

Reactions

0
0

More posts like this

Comments1
Sorted by Click to highlight new comments since:

I am somewhat concerned about data contamination here: Are you sure that the original Givewell writeup has at no point leaked into your model's analysis? Ie: was any of givewell's analysis online before the august 2025 knowledge cutoff for GPT, or did your agents look at the Givewell report as part of their research?

More from Tsondo
Curated and popular this week
Relevant opportunities