T

Tsondo

AI architect
3 karmaJoined Working (15+ years)

Comments
4

Good to hear! All of my work is there on github. Please have a look at the results. If my pipeline found something that yours didn't, it might be worth integrating the methodology.

I'd be very happy to discuss with you at your convenience. I'm in Central EU time (Italy.) I also sent you an email via research@GiveWell.org. Hannah says she will pass it on to you.
 

If you want help converting to a database let me know. It looks like a weekend project. We could also develop a front end for easier input, if you want. I would be happy to assist you.

When I ran my multi-agent pipeline on the data, I had to create unique parsing rules for each data set due to inconsistencies in the spreadsheets. One of my suggestions going forward would be to standardize how they collect and store the data to make it more machine readable. But you are right. Claude Code can sift through it and sort it out, with the right prompting, at the right stage. It is just one more context to manage.

Hi — I took you up on the invitation to try an alternative AI red teaming approach.

I built a multi-agent pipeline (decomposition → investigation → verification → quantification → adversarial testing → synthesis) and ran it against all three interventions where you published detailed AI output: water chlorination, ITNs, and SMC.

Results across three runs:

  • Signal rates: 84% (water), 100% (ITNs), 82% (SMC) — vs your reported ~15-30%
  • Zero hallucinated citations (the key architectural change: Investigators generate hypotheses without citing evidence, then a separate Verifier searches for real evidence)
  • Each surviving critique includes parameter mappings to specific CEA spreadsheet cells with computed sensitivity ranges

The most interesting finding was cross-intervention: three structural patterns appeared independently in all three analyses. All three CEAs model dynamic phenomena with static parameters (adherence decay, resistance evolution, efficacy degradation). All three collapse meaningful within-category variation into single aggregate parameters. And the two malaria interventions both lack mechanisms to capture biological adaptation by the target organism.

I wrote up the full results here: tsondo.com/blog/three-interventions-same-structural-patterns/

Phase 1 write-up (methodology explanation): tsondo.com/blog/give-well-red-team/

The full pipeline, prompts, and results are open source: github.com/tsondo/givewell_redteam

Your post mentions you covered six grantmaking areas total. The other three — CMAM, syphilis, and malaria vaccines — could be run through the pipeline as well. It doesn't strictly require your AI output to function; that feeds into novelty filtering and baseline comparison, but the critiques themselves are generated independently. I've reached out separately about this.

Happy to discuss methodology, and happy to hear where you think the pipeline's findings miss the mark — several of the cross-intervention patterns may reflect deliberate modeling choices rather than oversights, and I'd be interested to know which.