GiveWell's AI red-teaming limitations aren't a model problem — they're an architecture problem

Tsondo

GiveWell's AI red-teaming limitations aren't a model problem — they're an architecture problem

Tsondo

2 min read · Mar 31

Comments 6

Sorted by

New & upvoted

titotal

3mo

I am somewhat concerned about data contamination here: Are you sure that the original Givewell writeup has at no point leaked into your model's analysis? Ie: was any of givewell's analysis online before the august 2025 knowledge cutoff for GPT, or did your agents look at the Givewell report as part of their research?

Tsondo

2mo

The model I use did have an earlier cut off for it's data, but that isn't relevant for what I am doing. My write up actually surfaced several things that they didn't see at all. That's the point, really. And for the verication, it did not look at GiveWell at all. My verification sources are all listed in the code if you are concerned. All reputable sources.

Brendan Phillips🔸

3mo

Hi, Todd! Thank you for engaging with our work and writing up what you found.

Since that original post, we've also built a multi-agent system for red teaming that performs better than the one we described in our post. We made some different decisions around model architecture (most of our agents represent different red teaming "personas" as well as a few quality control stages) and I'd be curious to hear more about how you approach these architecture decisions.

I'll reach out about a quick call!

Tsondo

3mo

-1

Good to hear! All of my work is there on github. Please have a look at the results. If my pipeline found something that yours didn't, it might be worth integrating the methodology.

I'd be very happy to discuss with you at your convenience. I'm in Central EU time (Italy.) I also sent you an email via [email protected]. Hannah says she will pass it on to you.

tobycrisford 🔸

3mo

It's really cool that you've done this and released the code!

Am I understanding right that the givewell baseline you're trying to beat used GPT, while your approach uses Claude? How can you be sure that the improvements aren't due to the model choice, rather than the architecture?

Tsondo

2mo*

If you read my blog post, I go into detail about why this is not a model issue. It's about how you frame the question much more than what the model contains. For this purpose any decent model would have had the same result. The main benefit that Claude gives is direct in terminal code writing and execution.

Comments

Curated and popular this week

Was Partisanship Good for the Environmental Movement?

Jeffrey Heninger·2y ago·Curated 5d ago·6m read

This is the third in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan? Summary Rising partisanship did not make environmentalism more popular or politically effective. Instead, it saw flat or falling overall public opinion, fewer major legislative achievements, and fluctuating executive actions. Public Opinion...

GWWC's 2025 impact evaluation (executive summary)

Aidan Whitfield🔸, Giving What We Can🔸·1d ago·2m read

This post presents the executive summary from Giving What We Can’s impact evaluation for 2025. At the end of this post we share links to more information, including the full report and...

Announcing Spring: a Venture Studio and Fund for Animal Welfare Tech

EitanF·1d ago·13m read

Why building and backing Welfare Tech companies may be one of the most promising things we can do for billions of animals. I used AI to assist in writing this post, but I’ve rewritten it extensively and endorse it. * Announcing the launch of Spring Innovation Fund, a not-for-profit venture philanthropy studio and fund built specifical...

Metric

GiveWell baseline

Phase 1 result

Signal rate

~15–30%

~90% (28 of 31 critiques)

Hallucination rate

Multiple per run

Zero

Novel findings

1–2

4 critical, 3 moderate

Quantitative specificity

Ungrounded estimates

Parameter-linked sensitivity ranges