Results from the AI testing hackathon

Esben Kran; HaydnBelfield; Apart Research

This is a linkpost for https://alignmentjam.com/post/results-from-the-ai-testing-hackathon

We (Apart Research) ran a hackathon for AI testing research projects with 11 projects submitted by 34 participants between the 16th and 18th December. Here we share the winning projects. See them all here. In summary:

Found that unsupervised latent knowledge representation is generalizable and takes the first steps towards a benchmark using the ETHICS ambiguous / unambiguous examples with latent knowledge evaluation.
Created a new way to use token loss trajectories as a marker for targeting our interpretability methods towards a focus area.
Investigated three potential inverse scaling phenomena: Counting letters, chaining premises and solving equations. Found incidental inverse scaling on one of them and U-shaped scaling on another.
Implemented Trojans into Transformer models and used a gradient arithmetic technique to combine multiple Trojan triggers into one Transformer model.
(honorable mention) Invented a way to test how quickly models become misaligned by negative example fine-tuning.

Thank you to Zaki, Fazl, Rauno, Charbel, Nguyen, more jam site organizers, and the participants for making it all possible.

Discovering Latent Knowledge in Language Models Without Supervision - extensions and testing

By Agatha Duzan, Matthieu David, Jonathan Claybrough

Abstract: Based on the paper "Discovering Latent Knowledge in Language Models without Supervision" this project discusses how well the proposed method applies to the concept of ambiguity.

To do that, we tested the Contrast Consistent Search method on a dataset which contained both clear cut (0-1) and ambiguous (0,5) examples: We chose the ETHICS-commonsense dataset.

The global conclusion is that the CCS approach seems to generalize well in ambiguous situations, and could potentially be used to determine a model’s latent knowledge about other concepts.

These figures show how the CCS results for last layer activations splits into two groups for the non-ambiguous training samples while the ambiguous test samples on the ETHICS dataset reveals the same ambiguity of latent knowledge by the flattened Gaussian inference probability distribution.

Haydn & Esben’s judging comment: This project is very good in investigating the generality of unsupervised latent knowledge learning. It also seems quite useful as a direct test of how easy it is to extract latent knowledge and provides an avenue towards a benchmark using the ETHICS unambiguous/ambiguous examples dataset. Excited to see this work continue!

Read the report and the code (needs updating).

Investigating Training Dynamics via Token Loss Trajectories

By Alex Foote

Abstract: Evaluations of ML systems typically focus on average statistical performance on a dataset measured at the end of training. However, this type of evaluation is relatively coarse, and does not provide insight into the training dynamics of the model.

We present tools for stratifying tokens into groups based on arbitrary functions and measuring the loss on these token groups throughout the training process of a Language Model. By evaluating the loss trajectory of meaningful groups of tokens throughout the training process, we can gain more insight into how the model develops during training, and make interesting observations that could be investigated further using interpretability tools to gain insight into the development of specific mechanisms within a model.

We use this lens to look at the training dynamics of the region in which induction heads develop. We also zoom in on a specific region of training where there is a spike in loss and find that within this region the majority of tokens follow the loss trajectory of a spike, but a small set follow the inverse trajectory.

Haydn & Esben’s judging comment: This is really good work in testing for locations to look at using interpretability tools! Further developing this idea into something that can be used as a testing suite would be quite interesting and the ideas from the project seem useful at scale.

Read the report and python notebook.

Counting Letters, Chaining Premises & Solving Equations: Exploring Inverse Scaling Problems with GPT-3

By D. Chipping, J. Harding, H. Mannering, P. Selvaraj

Abstract: Language models generally show increased performance in a variety of tasks as their size increases. But there are a class of problems for which increase in model size results in worse performance. These are known as inverse scaling problems.

In this work, we examine how GPT-3 performs on tasks that involve the use of multiple, interconnected premises and those that require the counting of letters within given strings of text as well as solving simple multi-operator mathematical equations.

Haydn & Esben’s judging comment: These are really interesting investigations into inverse scaling! Each of these three tasks can be extended quite liberally. It’s pretty epic to find two relatively good inversely scaling phenomena while the third might end up working with a bit of a reframing. Curious to see more generality testing for the inverse scaling.

See the dataset generation code, the graph plotting code, and the report.

Trojan detection and implementation on transformers

By Clément Dumas, Charbel-Raphaël Segerie, Liam Imadache

Abstract: Neural Trojans are one of the most common adversarial attacks out there. Even though they have been extensively studied in computer vision, they can also easily target LLMs and transformer based architecture. Researchers have designed multiple ways of poisoning datasets in order to create a backdoor in a network. Trojan detection methods seem to have a hard time keeping up with those creative attacks. Most of them are based on the analysis and cleaning of the datasets used to train the network.

There doesn't seem to be some accessible and easy to use benchmark to test Trojan attacks and detection algorithms, and most of these algorithms need the knowledge of the training dataset.

We therefore decided to create a small benchmark of trojan networks that we implemented ourselves based on the literature, and use it to test some existing and new detection techniques.

[from the authors]: The colab contains the code to create the trojans described below, but you will also find some mysterious networks containing trojans that you can try to detect and explain. We will provide 50 euros for the first one who will be able to propose a method to find our trigger!

Haydn & Esben’s judging comment: Great to see so many replications of papers in one project and a nice investigation into Trojan triggers in training data. The proposed use of Task Vectors is quite interesting and the conclusion about Trojan attacks >> defenses is a good observation.

Read the report and the Colab. Check out the created Trojans (if you dare).

Honorable mention: The “This is Fine(-tuning)” benchmark

By Jan Wehner, Joep Storm, Tijmen van Graft, Jaouad Hidayat

Abstract: Large language models (LLMs) build up models of the world and of tasks leading them to impressive performance on many benchmarks. But how robust are these models against bad data?

Motivated by an example where an actively learning LLM is being fed bad data for a task by malicious actors, we propose a benchmark, This Is Fine (TIF), which measures LLM’s robustness against such data poisoning. The benchmark takes multiple popular benchmark tasks in NLP, arithmetics, "salient-translation-error-detection" and "phrase-relatedness" and records how the performance of an LLM degrades as it is being fine-tuned on wrong examples of this task.

Further, it measures how the fraction of fine-tuning data which is wrong influences the performance. We hope that an adaptation of this benchmark will enable researchers to test the robustness of the representations learned by LLMs and can prevent data poisoning attacks on high stakes systems.

Haydn & Esben’s judging comment: An interesting approach to measuring adversarial data impacts! It’s probably hard to generalize this without creating a new benchmark per task but thinking more about the general direction of performance falloff is very encouraged.

Read the report.

The Alignment Jam

This alignment hackathon was held online and in five locations at the same time: Paris, Mexico City, Aarhus, Netherlands, and Oxford. We started with an introduction to AI governance and why testing AI is important right now by Haydn Belfield along with a short introduction to how benchmarks for safety might be created with Esben Kran (recording).

We had 63 signups, 34 submitters and 11 final entries (1 omitted due to info hazards). $2,200 in prizes were awarded by us judges, Haydn Belfield and Esben Kran.

The post hackathon survey saw a 10% increase in percentage points for working on AI safety and a 8 of 10 rating for how likely the participants would be to recommend joining the hackathon to their friends and colleagues. The testimonial feedback was generally positive, especially about the starter resources provided.

Join the upcoming AI neuroscience hackathon and check out the Alignment Jam website.

Matt PutzJan 2 20232

Minor nitpick:

I would've found it more helpful to see Haydn's and Esben's judgments listed separately.

HaydnBelfieldJan 3 20234

We came up with our rankings seperately, but when we compared it turned out we agreed on the top 4 + honourable mention. We then worked on the texts together.

Matt PutzJan 3 20231

That's helpful, thanks!

Jay Bailey🔸Jan 2 20232

Great stuff! Thanks for running this!

Minor point: The Discovering Latent Knowledge Github appears empty.

Also, regarding the data poisoning benchmark, This Is Fine, I'm curious if this is actually a good benchmark for resistance to data poisoning. The actual thing we seem to be measuring here is speed of transfer learning, and declaring that slower is better. While slower speed of learning does increase resistance to data poisoning, it also seems bad for everything else we might want our AI to do. To me, this is basically a fine-tuning benchmark that we've then inverted. (After all, if a neural network always outputted the number 42 no matter what, it would score the maximum on TIF - there is no sequence of wrong prompts that can cause it to elicit the number 38 instead, because it is incapable of learning anything. Nevertheless, this is not where we want LLM's to go in the future.)

A better benchmark would probably be to take data poisoned examples and real fine-tuning, fine-tune the model on each, and compare how much it learns in both cases. With current capabilities, it might not be possible to score above baseline on this benchmark since I don't know if we actually have ways for the model to filter out data poisoned examples - nevertheless, this would bring awareness to the problem and actually measure what we want to measure more accurately.

Effective Altruism Forum
EA Forum