Hide table of contents

I'm advertising this RFP in my professional capacity, but the project ideas below are my personal views: my colleagues may not endorse them. 

There's just one week left before applications for our RFP on Improving Capability Evaluations close! The rest of this post is a call to apply, and some concrete suggestions of projects I'd be excited to see.

If you're feeling inspired already, then:

Apply now

Areas I'm most excited about

We've received many strong proposals around building new GCR-relevant benchmarks (which is great!), but there's still lots of low-hanging fruit in two key areas:

  1. Third-party access & security solutions (I'm most excited about this)
  2. Science of evaluations & capabilities development

Below I briefly motivate these areas and give some example projects. For more details, check out their sections in our RFP (security/access, science of evals). 

Third-party access & security solutions

As AI companies move to stricter security levels, meaningful independent evaluation will become increasingly difficult. I don't think we're prepared for this: not only do we lack good plans for mitigating the security risks of giving third parties access to proprietary information or models, we don't yet know what we should be asking of companies.

There's scope here for both technical and non-technical projects, from thinking clearly about evaluation regimes or information-sharing commitments, to work that enables secure, low-trust third-party evaluations and red-teaming.

Concrete projects I'd love to see:

1. A thorough analysis of what info/access evaluators need to make different safety assessments

A project mapping out what minimum information about model training evaluators need to assess safety cases, what levels of model access enable what kinds of evaluations, and how the security costs of sharing this information can be mitigated. 

An ideal version would spell out the strongest possible safety guarantees third parties can make with different levels of information/model access, identify the costs of increasing transparency or access, and propose arrangements that maximise the strength of safety assurances third parties can make, for a given cost.

2. Technical solutions for verifiable model auditing

Work on ZKPs, secure enclaves, confidential computing, or other approaches that could let evaluators verify model properties without compromising model security.

I'm excited about methods for:

  1. Trustless model evaluations without revealing model architecture
  2. Verifying model identity (i.e., which model you're running inference on) without weights access
  3. Providing >API level access to third parties, ideally without revealing model weights
  4. Enabling fine-tuning for third parties
  5. Obfuscating queries/outputs from the model provider, where appropriate

3. Comparing different evaluation governance frameworks

I'd love to see a careful analysis comparing different approaches to evaluations, e.g. lab self-evaluation vs. government-run evaluations vs. trusted third-party approaches. Some useful aims might include understanding how might each approach fail, their tradeoffs, and the kind of legal and regulatory protections that would enable them.

4. Transparency standards for evaluation conditions

Work on understanding the most important information for evaluators to report and standardising how evaluation conditions are reported, including e.g.:

  • Prompt engineering effort
  • Scaffolding and elicitation work
  • Tool access
  • Question format
  • Inference compute spend
  • Pass@k vs. best-of-k conditions
  • Time budget

Science of evaluations & capabilities development

Current evaluations are more like "model whispering" than science. Results are significantly affected by prompting, scaffolding, tool access, question format, inference compute spend, and choice of test conditions.

I'd love to see work that helps make interpreting evals results more precise and reliable, or at least establishes the difficulty of making certain kinds of claims.

Concrete projects I'd love to see:

1. Quantifying post-training variables and their effects

A project systematically measuring how different post-training enhancements (factors like scaffolding effort, inference compute, fine-tuning strategies) affect evaluation results, and ideally developing scaling laws for model performance given changes in these factors. 

2. Password-locked models

Instead of trying to uncover maximal abilities directly, hide the model's current capabilities and then try to uncover them using various elicitation techniques. This establishes ground truth for studying and comparing elicitation methods.

3. Statistical methodology for agent benchmarks

Extending statistical best practices from QA to agent evaluations.

4. Measuring and narrowing the elicitation gap

The gap between the capabilities leading AI companies can elicit versus those that third parties can is significant. We need better approaches to measure and narrow this gap, especially as it relates to dangerous capabilities.

5. Understanding when to (re)evaluate models

Models improve through not just new training runs but also post-training enhancements, which are cheap and can happen ~continuously. We need better heuristics for when advances in post-training techniques should trigger re-evaluation of models.

Some other ideas I'm excited about

  • Red-teaming evals/bounties for beating high scores - For some evaluations, I'd be excited about incentivising beating current high scores, in order to establish more realistic upper bounds on capabilities.
  • Measuring research uplift for top AI R&D talent — e.g. surveys trying to understand how much frontier models boost productivity for the top few percent of AI R&D researchers.
  • Realistic demos of strong evaluations — Clear, accurate demos of AI agents performing tasks from difficult agentic benchmarks.  
  • Realistic demos of scheming, alignment faking, and control evasion - Going beyond toy examples to show realistic instrumental behaviours in deployment-like settings.

This isn't exhaustive!

The above list is a subset of projects  I'm particularly excited about. There are many other valuable projects not mentioned here, and great ideas I haven't even thought of! For nore details, check out the full RFP. For other project ideas, you may also enjoy reading Marius' list of concrete problems in evals.

If you've been considering applying to our RFP, now's the time! The initial Expression of Interest takes ≤1 hour, and we can refine promising ideas together during the full proposal stage.

Deadline: April 1st, 2025.

Apply now

19

0
0

Reactions

0
0

More posts like this

Comments2


Sorted by Click to highlight new comments since:

One thing I'd be much more excited about seeing rather than "quantifying post-training variables and their effects" (but which I'm not planning to pursue) would be to take an old model and then to try to map post-training enhancements discovered over time and see how the maximum elicitable capabilities change.

I'm worried that quantifying post-training variables directly has significant capabilities externalities and that there's no obvious limit to how far post-training can be pushed.

cb
3
1
0

I'd also be excited about projects aiming to do this.

One advantage that quantifying post-training variables on frontier models has over this idea is that you also get a better sense of what the upper bound of performance on some eval looks like, as well as some information about the returns from investing in post-training enhancements. I think if this were done responsibly on some well-chosen evals, it'd be helpful information to have. (Though my colleagues may disagree.)

If people outside of frontier labs were working on this, I'd be surprised if it significantly accelerated capabilities, though I can imagine it still making sense to keep the methodology private.

Curated and popular this week
Paul Present
 ·  · 28m read
 · 
Note: I am not a malaria expert. This is my best-faith attempt at answering a question that was bothering me, but this field is a large and complex field, and I’ve almost certainly misunderstood something somewhere along the way. Summary While the world made incredible progress in reducing malaria cases from 2000 to 2015, the past 10 years have seen malaria cases stop declining and start rising. I investigated potential reasons behind this increase through reading the existing literature and looking at publicly available data, and I identified three key factors explaining the rise: 1. Population Growth: Africa's population has increased by approximately 75% since 2000. This alone explains most of the increase in absolute case numbers, while cases per capita have remained relatively flat since 2015. 2. Stagnant Funding: After rapid growth starting in 2000, funding for malaria prevention plateaued around 2010. 3. Insecticide Resistance: Mosquitoes have become increasingly resistant to the insecticides used in bednets over the past 20 years. This has made older models of bednets less effective, although they still have some effect. Newer models of bednets developed in response to insecticide resistance are more effective but still not widely deployed.  I very crudely estimate that without any of these factors, there would be 55% fewer malaria cases in the world than what we see today. I think all three of these factors are roughly equally important in explaining the difference.  Alternative explanations like removal of PFAS, climate change, or invasive mosquito species don't appear to be major contributors.  Overall this investigation made me more convinced that bednets are an effective global health intervention.  Introduction In 2015, malaria rates were down, and EAs were celebrating. Giving What We Can posted this incredible gif showing the decrease in malaria cases across Africa since 2000: Giving What We Can said that > The reduction in malaria has be
Ronen Bar
 ·  · 10m read
 · 
"Part one of our challenge is to solve the technical alignment problem, and that’s what everybody focuses on, but part two is: to whose values do you align the system once you’re capable of doing that, and that may turn out to be an even harder problem", Sam Altman, OpenAI CEO (Link).  In this post, I argue that: 1. "To whose values do you align the system" is a critically neglected space I termed “Moral Alignment.” Only a few organizations work for non-humans in this field, with a total budget of 4-5 million USD (not accounting for academic work). The scale of this space couldn’t be any bigger - the intersection between the most revolutionary technology ever and all sentient beings. While tractability remains uncertain, there is some promising positive evidence (See “The Tractability Open Question” section). 2. Given the first point, our movement must attract more resources, talent, and funding to address it. The goal is to value align AI with caring about all sentient beings: humans, animals, and potential future digital minds. In other words, I argue we should invest much more in promoting a sentient-centric AI. The problem What is Moral Alignment? AI alignment focuses on ensuring AI systems act according to human intentions, emphasizing controllability and corrigibility (adaptability to changing human preferences). However, traditional alignment often ignores the ethical implications for all sentient beings. Moral Alignment, as part of the broader AI alignment and AI safety spaces, is a field focused on the values we aim to instill in AI. I argue that our goal should be to ensure AI is a positive force for all sentient beings. Currently, as far as I know, no overarching organization, terms, or community unifies Moral Alignment (MA) as a field with a clear umbrella identity. While specific groups focus individually on animals, humans, or digital minds, such as AI for Animals, which does excellent community-building work around AI and animal welfare while
Max Taylor
 ·  · 9m read
 · 
Many thanks to Constance Li, Rachel Mason, Ronen Bar, Sam Tucker-Davis, and Yip Fai Tse for providing valuable feedback. This post does not necessarily reflect the views of my employer. Artificial General Intelligence (basically, ‘AI that is as good as, or better than, humans at most intellectual tasks’) seems increasingly likely to be developed in the next 5-10 years. As others have written, this has major implications for EA priorities, including animal advocacy, but it’s hard to know how this should shape our strategy. This post sets out a few starting points and I’m really interested in hearing others’ ideas, even if they’re very uncertain and half-baked. Is AGI coming in the next 5-10 years? This is very well covered elsewhere but basically it looks increasingly likely, e.g.: * The Metaculus and Manifold forecasting platforms predict we’ll see AGI in 2030 and 2031, respectively. * The heads of Anthropic and OpenAI think we’ll see it by 2027 and 2035, respectively. * A 2024 survey of AI researchers put a 50% chance of AGI by 2047, but this is 13 years earlier than predicted in the 2023 version of the survey. * These predictions seem feasible given the explosive rate of change we’ve been seeing in computing power available to models, algorithmic efficiencies, and actual model performance (e.g., look at how far Large Language Models and AI image generators have come just in the last three years). * Based on this, organisations (both new ones, like Forethought, and existing ones, like 80,000 Hours) are taking the prospect of near-term AGI increasingly seriously. What could AGI mean for animals? AGI’s implications for animals depend heavily on who controls the AGI models. For example: * AGI might be controlled by a handful of AI companies and/or governments, either in alliance or in competition. * For example, maybe two government-owned companies separately develop AGI then restrict others from developing it. * These actors’ use of AGI might be dr
Recent opportunities in AI safety
20
Eva
· · 1m read