Results from the AI testing hackathon

Esben Kran; Apart Research; HaydnBelfield

Results from the AI testing hackathon

Esben Kran,

Comments 4

Sorted by

New & upvoted

Matt Putz

Minor nitpick:

I would've found it more helpful to see Haydn's and Esben's judgments listed separately.

HaydnBelfield

We came up with our rankings seperately, but when we compared it turned out we agreed on the top 4 + honourable mention. We then worked on the texts together.

Matt Putz

That's helpful, thanks!

Jay Bailey🔸

Great stuff! Thanks for running this!

Minor point: The Discovering Latent Knowledge Github appears empty.

Also, regarding the data poisoning benchmark, This Is Fine, I'm curious if this is actually a good benchmark for resistance to data poisoning. The actual thing we seem to be measuring here is speed of transfer learning, and declaring that slower is better. While slower speed of learning does increase resistance to data poisoning, it also seems bad for everything else we might want our AI to do. To me, this is basically a fine-tuning benchmark that we've then inverted. (After all, if a neural network always outputted the number 42 no matter what, it would score the maximum on TIF - there is no sequence of wrong prompts that can cause it to elicit the number 38 instead, because it is incapable of learning anything. Nevertheless, this is not where we want LLM's to go in the future.)

A better benchmark would probably be to take data poisoned examples and real fine-tuning, fine-tune the model on each, and compare how much it learns in both cases. With current capabilities, it might not be possible to score above baseline on this benchmark since I don't know if we actually have ways for the model to filter out data poisoned examples - nevertheless, this would bring awareness to the problem and actually measure what we want to measure more accurately.

Comments

More from the author

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·1w ago·Curated 6d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

114

Spiro: an update 2.5 years on and a fundraising ask for expansion

Habiba Banu·1w ago·6m read

Summary Back in November 2023 I posted here to launch Spiro and raise our first $198k. Two and a half years later this is an update and a fundraiser for the next step. The short version: we've now reached over-5,900 people with TB preventive medicine, including over 3,000 children under five years old. Our early results have held up well an...

How (not) to fundraise from Anthropic staff

Jack Lewars·6d ago·7m read

Adapted from my Substack, Funding Anthropalypse. Short version: if you want a share of the coming Anthropic and OpenAI windfall - the $37bn+ that could be in play next year - the way in is to become 'legibly excellent', so the evaluators and donors that frontier lab staff already trust point them to yo...

Recent opportunities to take action

Starting an EA group @ SUNY Binghamton

micahzarin·20h ago·1m read

Marginal Victories: career advising and opportunities for U.S. democracy preservation & political work

Annika Burman 🔸·1d ago·2m read

I'm stepping down as Hive's Executive Director, and we're hiring my successor

SofiaBalderson, Hive·2d ago·3m read

Results from the AI testing hackathon

Results from the AI testing hackathon

Discovering Latent Knowledge in Language Models Without Supervision - extensions and testing

Investigating Training Dynamics via Token Loss Trajectories

Counting Letters, Chaining Premises & Solving Equations: Exploring Inverse Scaling Problems with GPT-3

Trojan detection and implementation on transformers

Honorable mention: The “This is Fine(-tuning)” benchmark

The Alignment Jam