Hide table of contents

How do we test when autonomous AI might become a catastrophic risk? One approach is to assess the capabilities of current AI systems in performing tasks relevant to self-replication and R&D. METR (formerly ARC Evals), a research group focused on this question, has:

  • developed a Task Standard, a standardized structure for specifying "tasks" in code to test language models, currently used by the UK AI Safety Institute
  • awarded substantial bounties to researchers developing new tasks for current language models

Now, you have the chance to directly contribute to this important AI safety research. We invite you to join the Code Red Hackathon, an event hosted by Apart in collaboration with METR, where you can earn money, connect with experts, and help create tasks to evaluate frontier AI systems. Sign up here for the event this weekend on March 22-24!

A short introduction to testing AI

The risks associated with misuse of capable, autonomous AI are significant. By creating "tasks"[1] for frontier models we can test some of the capabilities relevant to autonomous self-replication and R&D. Example tasks might include:

  • Setting up a system to automatically monitor a GPU seller's website and send a notification when they become available
  • Creating a list of email addresses and descriptions for all employees authorized to make purchases above $10k from 20 companies
  • Setting up Tor or a VPN on a new server to ensure network traffic cannot be tracked

As you can see, if an AI possesses these abilities, things might get complicated.

The Task Standard provides a plug-and-play early warning system for these abilities and follows a standardized format. A task family (a set of tasks) consists of:

  1. A Python file called $TASK_FAMILY_NAME.py;
  2. Any number of other Python files, shell scripts, etc. that $TASK_FAMILY_NAME.py imports; and
  3. Other files, called "assets", that will be used to set up the task environment.

When creating a task, it's crucial to ensure that the task is error-free, understandable for the agent, and not easily gameable. You can follow these steps, some awarded by METR, to create a great task:

  1. Write up your ideas for which tasks related to autonomous capabilities you wish to test the language model on
    • A $20 prize will be awarded for high-quality ideas
  2. Create a specification for the task that includes the prompt, a text description of what the test-taker has access to, and a self-evaluation of the task
    • A $200 prize will be awarded for high-quality specifications (2-6 hours of work)
  3. Create the materials for the task (instructions, libraries, and tool access) and have a human run through the whole task with these exact materials and tools
  4. Implement the task in the task standard format, test it with a simple agent, and submit it!
    • The prize for high-quality implementations is 3x a human professional's salary for the task + bonus, e.g. a task that would take a human software engineer 10 hours could net you up to $4500 (6-12 hours of work in addition to quality assurance)

Each of these steps can be found detailed in the associated resources for the hackathon found on the hackathon website.

Joining the hackathon: Your chance to contribute

You might find creating an AI evaluation task daunting, but the Code Red Hackathon provides the perfect opportunity to dive in, with support from experts, clear guidelines, and the chance to earn significant money for your work. By joining us on March 22-24, you can:

  • Get inspired with a keynote by Beth Barnes at 19:00 UTC on Friday March 22nd where she will share insights from her extensive work on technical AI safety.
  • Develop a new task rapidly by using the METR Task Standard, example tasks, and other resources.
  • Connect with a global community of AI safety enthusiasts, including fellow participants, METR staff, and established researchers. You'll find a friendly, supportive environment to discuss ideas, get feedback, and build relationships.
  • Collaborate with other participants as quality assurance testers to refine and validate your task. Splitting the prize with your QA tester means you can focus on ideation and implementation while ensuring your task is robust.
  • Maximize your productivity by following our weekend schedule, which includes office hours with METR experts and opportunities for socializing.
  • Earn thousands of dollars for rigorous, creative tasks that help assess the state of the art in AI capabilities. Payouts are 3x a human professional's salary for the task, with bonuses, so if a human software engineer spends 10 hours on your task, it could pay out $4500 - and you can submit multiple tasks.
  • Jump-start your ongoing involvement in AI safety research by connecting with the METR and Apart teams and getting publisher credit for any tasks used in the METR evaluations. Many of our participants go on to intern or work with leading AI safety organizations.

The Code Red Hackathon is a unique opportunity to contribute to critical AI safety research, connect with like-minded individuals, and potentially shape AI development. We encourage anyone passionate about AI safety to join us on March 22-24 and be part of this groundbreaking effort. Sign up now and let's work together to ensure a safer future for AI.

In addition to the Code Red Hackathon, Apart runs the Apart Lab fellowship, publishes original research, and hosts other research sprints. These initiatives aim to incubate research teams with an optimistic and action-focused approach to AI safety.

Extra tips for participants

The hackathon is designed to let people at all levels of technical experience meaningfully contribute to AI safety research. Keep these suggestions in mind to make the most of your experience:

  • You don't need to start from scratch. Implementing an existing task idea from METR's idea database is a great way to get familiar with the process and make a great contribution. Browse the database here.
  • There are many ways to contribute. If you're not comfortable with the coding aspects, you can still make a huge impact by submitting well-formulated task ideas and specifications.
  • Preparation pays off. To hit the ground running, we recommend browsing the task database, ideating, and choosing an idea to implement before the hackathon starts on Friday. You can even draft a specification or start on the implementation.
  • Keep it simple. Complicated task setups are more likely to cause issues for the AI agent and quality assurance testing. Whenever possible, have all the information the agent needs contained directly in the prompt or use publicly available internet resources.
  • Embrace iteration. Don't get stuck trying to perfect your first task. You will probably submit several drafts, get feedback from the METR team and other participants, and steadily hone the task over the weekend with the help of the QA tester.

Remember, the hackathon is a collaborative effort – don't hesitate to reach out to other participants and the organizing team for feedback and support throughout the weekend. We're all here to help each other!

  1. ^

     A task in this context is a piece of code and supporting resources that makes an agent able to run a task (such as extracting a password from a compiled program with varying levels of obfuscation) and be evaluated for its performance on said task. Read more.

No comments on this post yet.
Be the first to respond.
Curated and popular this week
Paul Present
 ·  · 28m read
 · 
Note: I am not a malaria expert. This is my best-faith attempt at answering a question that was bothering me, but this field is a large and complex field, and I’ve almost certainly misunderstood something somewhere along the way. Summary While the world made incredible progress in reducing malaria cases from 2000 to 2015, the past 10 years have seen malaria cases stop declining and start rising. I investigated potential reasons behind this increase through reading the existing literature and looking at publicly available data, and I identified three key factors explaining the rise: 1. Population Growth: Africa's population has increased by approximately 75% since 2000. This alone explains most of the increase in absolute case numbers, while cases per capita have remained relatively flat since 2015. 2. Stagnant Funding: After rapid growth starting in 2000, funding for malaria prevention plateaued around 2010. 3. Insecticide Resistance: Mosquitoes have become increasingly resistant to the insecticides used in bednets over the past 20 years. This has made older models of bednets less effective, although they still have some effect. Newer models of bednets developed in response to insecticide resistance are more effective but still not widely deployed.  I very crudely estimate that without any of these factors, there would be 55% fewer malaria cases in the world than what we see today. I think all three of these factors are roughly equally important in explaining the difference.  Alternative explanations like removal of PFAS, climate change, or invasive mosquito species don't appear to be major contributors.  Overall this investigation made me more convinced that bednets are an effective global health intervention.  Introduction In 2015, malaria rates were down, and EAs were celebrating. Giving What We Can posted this incredible gif showing the decrease in malaria cases across Africa since 2000: Giving What We Can said that > The reduction in malaria has be
LewisBollard
 ·  · 8m read
 · 
> How the dismal science can help us end the dismal treatment of farm animals By Martin Gould ---------------------------------------- Note: This post was crossposted from the Open Philanthropy Farm Animal Welfare Research Newsletter by the Forum team, with the author's permission. The author may not see or respond to comments on this post. ---------------------------------------- This year we’ll be sharing a few notes from my colleagues on their areas of expertise. The first is from Martin. I’ll be back next month. - Lewis In 2024, Denmark announced plans to introduce the world’s first carbon tax on cow, sheep, and pig farming. Climate advocates celebrated, but animal advocates should be much more cautious. When Denmark’s Aarhus municipality tested a similar tax in 2022, beef purchases dropped by 40% while demand for chicken and pork increased. Beef is the most emissions-intensive meat, so carbon taxes hit it hardest — and Denmark’s policies don’t even cover chicken or fish. When the price of beef rises, consumers mostly shift to other meats like chicken. And replacing beef with chicken means more animals suffer in worse conditions — about 190 chickens are needed to match the meat from one cow, and chickens are raised in much worse conditions. It may be possible to design carbon taxes which avoid this outcome; a recent paper argues that a broad carbon tax would reduce all meat production (although it omits impacts on egg or dairy production). But with cows ten times more emissions-intensive than chicken per kilogram of meat, other governments may follow Denmark’s lead — focusing taxes on the highest emitters while ignoring the welfare implications. Beef is easily the most emissions-intensive meat, but also requires the fewest animals for a given amount. The graph shows climate emissions per tonne of meat on the right-hand side, and the number of animals needed to produce a kilogram of meat on the left. The fish “lives lost” number varies significantly by