Hide table of contents

158

TL;DR

  1. We are a new AI evals research organization called Apollo Research based in London. 
  2. We think that strategic AI deception – where a model outwardly seems aligned but is in fact misaligned – is a crucial step in many major catastrophic AI risk scenarios and that detecting deception in real-world models is the most important and tractable step to addressing this problem.
  3. Our agenda is split into interpretability and behavioral evals:
    1. On the interpretability side, we are currently working on two main research bets toward characterizing neural network cognition. We are also interested in benchmarking interpretability, e.g. testing whether given interpretability tools can meet specific requirements or solve specific challenges.
    2. On the behavioral evals side, we are conceptually breaking down ‘deception’ into measurable components in order to build a detailed evaluation suite using prompt- and finetuning-based tests. 
  4. As an evals research org, we intend to use our research insights and tools directly on frontier models by serving as an external auditor of AGI labs, thus reducing the chance that deceptively misaligned AIs are developed and deployed. 
  5. We also intend to engage with AI governance efforts, e.g. by working with policymakers and providing technical expertise to aid the drafting of auditing regulations.
  6. We have starter funding but estimate a $1.4M funding gap in our first year. We estimate that the maximal amount we could effectively use is $4-6M $7-10M* in addition to current funding levels (reach out if you are interested in donating). We are currently fiscally sponsored by Rethink Priorities. 
  7. Our starting team consists of 8 researchers and engineers with strong backgrounds in technical alignment research. 
  8. We are interested in collaborating with both technical and governance researchers. Feel free to reach out at info@apolloresearch.ai.
  9. We intend to hire once our funding gap is closed. If you’d like to stay informed about opportunities, you can fill out our expression of interest form.

*Updated June 4th after re-adjusting our hiring trajectory

Research Agenda

We believe that AI deception – where a model outwardly seems aligned but is in fact misaligned and conceals this fact from human oversight – is a crucial component of many catastrophic risk scenarios from AI (see here for more). We also think that detecting/measuring deception is causally upstream of many potential solutions. For example, having good detection tools enables higher quality and safer feedback loops for empirical alignment approaches, enables us to point to concrete failure modes for lawmakers and the wider public, and provides evidence to AGI labs whether the models they are developing or deploying are deceptively misaligned.

Ultimately, we aim to develop a holistic and far-ranging suite of deception evals that includes behavioral tests, fine-tuning, and interpretability-based approaches. Unfortunately, we think that interpretability is not yet at the stage where it can be used effectively on state-of-the-art models. Therefore, we have split the agenda into an interpretability research arm and a behavioral evals arm. We aim to eventually combine interpretability and behavioral evals into a comprehensive model evaluation suite.

On the interpretability side, we are currently working on a new unsupervised approach and continuing work on an existing approach to attack the problem of superposition. Early experiments have shown promising results, but it is too early to tell if the techniques work robustly or are scalable to larger models. Our main priority, for now, is to scale up the experiments and ‘fail fast’ so we can either double down or cut our losses. Furthermore, we are interested in benchmarking interpretability techniques by testing whether given tools meet specific requirements (e.g. relationships found by the tool successfully predict causal interventions on those variables) or solve specific challenges such as discovering backdoors and reverse engineering known algorithms encoded in network weights.

On the model evaluations side, we want to build a large and robust eval suite to test models for deceptive capabilities. Concretely, we intend to break down deception into its component concepts and capabilities. We will then design a large range of experiments and evaluations to measure both the component concepts as well as deception holistically. We aim to start running eval experiments and set up pilot projects with labs as soon as possible to get early empirical feedback on our approach.

Plans beyond technical research

As an evals research org, we intend to put our research into practice by engaging directly in auditing and governance efforts. This means we aim to work with AGI labs to reduce the chance that they develop or deploy deceptively misaligned models. The details of this transition depend a lot on our research progress and our level of access to frontier models. We expect that sufficiently capable models will be able to fool all behavioral evaluations and thus some degree of ‘white box’ access will prove necessary. We aim to work with labs and regulators to build technical and institutional frameworks wherein labs can securely provide sufficient access without undue risk to intellectual property. 

On the governance side, we want to use our technical expertise in auditing, model evaluations, and interpretability to inform the public and lawmakers. We are interested in demonstrating the capacity of models for dangerous capabilities and the feasibility of using evaluation and auditing techniques to detect them. We think that showcasing dangerous capabilities in controlled settings makes it easier for the ML community, lawmakers, and the wider public to understand the concerns of the AI safety community. We emphasize that we will only demonstrate such capabilities if it can be done safely in controlled settings. Showcasing the feasibility of using model evaluations or auditing techniques to prevent potential harms increases the ability of lawmakers to create adequate regulation. 

We want to collaborate with independent researchers, technical alignment organizations, AI governance organizations, and the wider ML community. If you are (potentially) interested in collaborating with us, please reach out

Theory of change

We aim to achieve a positive impact on multiple levels:

  1. Direct impact through research: If our research agenda works out, we will further the state of the art in interpretability and model evaluations. These results could then be used and extended by academics and other labs. We can have this impact even if we never get any auditing access to state-of-the-art models. We carefully consider how to mitigate potential downside risks from our research by controlling which research we publish. We plan to release a document on our policy and processes related to this soon.
  2. Direct impact through auditing: Assuming we are granted some level of access to state-of-the-art models of various AGI labs, we could help them determine if their model is, or could be, strategically deceptive and thus reduce the chance of developing and deploying deceptive models. If, after developing state-of-the-art interpretability tools and behavioral evals and using them to audit potentially dangerous models, we find that our tools are insufficient for the task, we commit to using our knowledge and position to make the inadequacy of current evaluations widely known and to argue for the prevention of potentially dangerous models from being developed and deployed.
  3. Indirect impact through demonstrations: We hope that demonstrating the capacity of models for dangerous capabilities shifts the burden of proof from the AI safety community to the AGI labs. Currently, the AI safety community has the implicit burden of showing that models are dangerous. We would like to move toward a world where the burden is on AGI labs to show why their models are not dangerous (similar to medicine or aviation). Additionally, demonstrations of deception or other forms of misalignment ‘in the wild’ can provide an empirical test bed for practical alignment research and also be used to inform policymakers and the public of the potential dangers of frontier models. 
  4. Indirect impact through governance work: We intend to contribute technical expertise to AI governance where we can. This could include the creation of guidelines for model evaluations, conceptual clarifications of how AIs could be deceptive, suggestions for technical legislation, and more.

We do not think that our approach alone could yield safe AGI. Our work primarily aims to detect deceptive unaligned AI systems and prevent them from being developed and deployed. The technical alignment problem still needs to be solved. The best case for strong auditing and evaluation methods is that it can convert a ‘one-shot’ alignment problem into a many-shot problem where it becomes feasible to iterate on technical alignment methods in an environment of relative safety. 

Status

We have received sufficient starter funding to get us off the ground. However, we estimate that we have a $1.4M funding gap for the first year of operations and could effectively use an additional $7-10M* in total funding. If you are interested in funding us, please reach out. We are happy to address any questions and concerns. We currently pay lower than competitive salaries but intend to increase them as we grow to attract and retain talent.

We are currently fiscally sponsored by Rethink Priorities but intend to spin out after 6-12 months. The exact legal structure is not yet determined, and we are considering both fully non-profit models as well as limited for-profit entities such as public benefit corporations. Whether we will attempt the limited for-profit route depends on the availability of philanthropic funding and whether we think there is a monetizable product that increases safety. Potential routes to monetization would be for-profit auditing or red-teaming services and interpretability tooling, but we are wary of the potentially misaligned incentives of this path. In an optimal world, we would be fully funded by philanthropic or public sources to ensure maximal alignment between financial incentives and safety. 

Our starting members include:

  • Marius Hobbhahn (Director/CEO)
  • Beren Millidge (left on good terms to pursue a different opportunity)
  • Lee Sharkey (Research/Strategy Lead, VP)
  • Chris Akin (COO)
  • Lucius Bushnaq (Research scientist)
  • Dan Braun (Lead engineer)
  • Mikita Balesni (Research scientist)
  • Jérémy Scheurer (Research scientist, joining in a few months)

FAQ

How is our approach different from ARC evals?

There are a couple of technical and strategic differences:

  1. At least early on, we will focus primarily on deception and its prerequisites, while ARC evals is investigating a large range of capabilities including the ability of models to replicate themselves, seek power, acquire resources, and more.
  2. We intend to use a wide range of approaches to detect potentially dangerous model properties right from the start, including behavioral tests, fine-tuning, and interpretability. To the best of our knowledge, ARC evals intends to use these tools eventually but is currently mostly focused on behavioral tools. 
  3. We intend to perform fundamental scientific research in interpretability in addition to developing a suite of behavioral evaluation tools. We think it is important that audits ultimately include evaluations of both external behavior and internal cognition. This seems necessary to make strong statements about cognitive strategies such as deception.

We think our ‘narrow and deep’ approach and ARC’s ‘broad and less deep’ approach are complementary strategies. Even if we had no distinguishing features from ARC Evals other than being a different team, we still would deem it net positive to have multiple somewhat uncorrelated evaluation teams. 

When will we start hiring?

We are starting with an unusually large team. We expect this to work well because many of us have worked together previously, and we all agree on this fairly concrete agenda. However, we still think it is wise to take a few months to consolidate before growing further. 

We think our agenda is primarily bottlenecked by engineering and hands-on research capacity rather than conceptual questions. Furthermore, we think we have the management capacity to onboard additional people. We are thus heavily bottlenecked by funding at the moment and it is unclear when and how many people we can hire in the near future. If this bottleneck is resolved we plan to start hiring soon. 

We have an expression of interest form for potential applicants. You can add your name and we will inform you when we open a hiring round. We might also reach out individually to researchers who are a great fit for collaborations.

Do we ever plan to be a for-profit organization?

This depends on a lot of factors and we have not made any final decisions. In the case where we take a constrained for-profit route, we would legally ensure that we are not obligated to maximize profit and carefully select the donors and investors we work with to make sure they share our AI safety goals and understand our mission. We are currently unsure whether the mission of reducing catastrophic risks from AI can be fully compatible with a for-profit setup. We think offering auditing or red-teaming services or providing interpretability tools are candidates for monetizable strategies that align with reducing catastrophic risks but trying to maximize profits from these strategies introduces obvious perverse incentives which we need to think carefully about how to mitigate.

Isn’t this research dangerous?

Some people have argued that behavioral evals that investigate dangerous capabilities could be a cause of risk in itself, e.g. that we accidentally create a dangerous deceptive model through our efforts or that we create a public blueprint for others to create one. We think this is a plausible concern. We have two main considerations.

  1. There are trade-offs between risk and information. Right now, we think that state-of-the-art models are not sufficiently capable to cause the serious risks that we are worried about, but we expect this to change in the future. The information we get from model evaluations should shed light on upcoming dangerous capabilities. Therefore the tradeoff currently seems worth making. But that might change in future and we will reassess our stance on this regularly. 
  2. We can simulate potentially dangerous capabilities in controlled settings. By default, this means that we are very careful about which experiments we run and the environments they are run in. Furthermore, many dangerous capabilities are hard to test for directly. Therefore, our plan is to split them up into narrower capabilities that are less dangerous. For example, rather than testing if a model is capable of deception, we can test if it has goals, instrumental reasoning, and situational awareness. This is analogous to pointing out that someone has enough Uranium for an atomic bomb but isn’t capable of building it. However, in case we think it is the best strategy after taking the risks involved into account, we would consider doing end-to-end evaluations, e.g. directly testing for deception in carefully controlled environments.

We are also aware that good interpretability research might eventually run the risk of improving capabilities. We have thought a considerable amount about this in the past and are making concrete plans to mitigate the risks. Overall, however, we think that current interpretability research is strongly net positive for safety in expectation. 


 

Comments5


Sorted by Click to highlight new comments since:
[anonymous]7
1
0

Congratulations on launching!

On the governance side, one question I'd be excited to see Apollo (and ARC evals & any other similar groups) think/write about is: what happens after a dangerous capability eval goes off? 

Of course, the actual answer will be shaped by the particular climate/culture/zeitgeist/policy window/lab factors that are impossible to fully predict in advance.

But my impression is that this question is relatively neglected, and I wouldn't be surprised if sharp newcomers were able to meaningfully improve the community's thinking on this. 

Congratulations, I’m happy to see this launching!

How did you decide to be a not-for-profit? I imagine that the evals/audit work will likely be very lucrative at some point?

As stated in the post itself (section "Status"), we are not yet decided about this and are considering both non-profit and public-benefit-type for-profit style organizations. 

This is great to see and the backgrounds of your team members look impressive. I really hope someone will step in to fund this.

Curated and popular this week
 ·  · 5m read
 · 
[Cross-posted from my Substack here] If you spend time with people trying to change the world, you’ll come to an interesting conundrum: Various advocacy groups reference previous successful social movements as to why their chosen strategy is the most important one. Yet, these groups often follow wildly different strategies from each other to achieve social change. So, which one of them is right? The answer is all of them and none of them. This is because many people use research and historical movements to justify their pre-existing beliefs about how social change happens. Simply, you can find a case study to fit most plausible theories of how social change happens. For example, the groups might say: * Repeated nonviolent disruption is the key to social change, citing the Freedom Riders from the civil rights Movement or Act Up! from the gay rights movement. * Technological progress is what drives improvements in the human condition if you consider the development of the contraceptive pill funded by Katharine McCormick. * Organising and base-building is how change happens, as inspired by Ella Baker, the NAACP or Cesar Chavez from the United Workers Movement. * Insider advocacy is the real secret of social movements – look no further than how influential the Leadership Conference on Civil Rights was in passing the Civil Rights Acts of 1960 & 1964. * Democratic participation is the backbone of social change – just look at how Ireland lifted a ban on abortion via a Citizen’s Assembly. * And so on… To paint this picture, we can see this in action below: Source: Just Stop Oil which focuses on…civil resistance and disruption Source: The Civic Power Fund which focuses on… local organising What do we take away from all this? In my mind, a few key things: 1. Many different approaches have worked in changing the world so we should be humble and not assume we are doing The Most Important Thing 2. The case studies we focus on are likely confirmation bias, where
 ·  · 2m read
 · 
I speak to many entrepreneurial people trying to do a large amount of good by starting a nonprofit organisation. I think this is often an error for four main reasons. 1. Scalability 2. Capital counterfactuals 3. Standards 4. Learning potential 5. Earning to give potential These arguments are most applicable to starting high-growth organisations, such as startups.[1] Scalability There is a lot of capital available for startups, and established mechanisms exist to continue raising funds if the ROI appears high. It seems extremely difficult to operate a nonprofit with a budget of more than $30M per year (e.g., with approximately 150 people), but this is not particularly unusual for for-profit organisations. Capital Counterfactuals I generally believe that value-aligned funders are spending their money reasonably well, while for-profit investors are spending theirs extremely poorly (on altruistic grounds). If you can redirect that funding towards high-altruism value work, you could potentially create a much larger delta between your use of funding and the counterfactual of someone else receiving those funds. You also won’t be reliant on constantly convincing donors to give you money, once you’re generating revenue. Standards Nonprofits have significantly weaker feedback mechanisms compared to for-profits. They are often difficult to evaluate and lack a natural kill function. Few people are going to complain that you provided bad service when it didn’t cost them anything. Most nonprofits are not very ambitious, despite having large moral ambitions. It’s challenging to find talented people willing to accept a substantial pay cut to work with you. For-profits are considerably more likely to create something that people actually want. Learning Potential Most people should be trying to put themselves in a better position to do useful work later on. People often report learning a great deal from working at high-growth companies, building interesting connection
 ·  · 31m read
 · 
James Özden and Sam Glover at Social Change Lab wrote a literature review on protest outcomes[1] as part of a broader investigation[2] on protest effectiveness. The report covers multiple lines of evidence and addresses many relevant questions, but does not say much about the methodological quality of the research. So that's what I'm going to do today. I reviewed the evidence on protest outcomes, focusing only on the highest-quality research, to answer two questions: 1. Do protests work? 2. Are Social Change Lab's conclusions consistent with the highest-quality evidence? Here's what I found: Do protests work? Highly likely (credence: 90%) in certain contexts, although it's unclear how well the results generalize. [More] Are Social Change Lab's conclusions consistent with the highest-quality evidence? Yes—the report's core claims are well-supported, although it overstates the strength of some of the evidence. [More] Cross-posted from my website. Introduction This article serves two purposes: First, it analyzes the evidence on protest outcomes. Second, it critically reviews the Social Change Lab literature review. Social Change Lab is not the only group that has reviewed protest effectiveness. I was able to find four literature reviews: 1. Animal Charity Evaluators (2018), Protest Intervention Report. 2. Orazani et al. (2021), Social movement strategy (nonviolent vs. violent) and the garnering of third-party support: A meta-analysis. 3. Social Change Lab – Ozden & Glover (2022), Literature Review: Protest Outcomes. 4. Shuman et al. (2024), When Are Social Protests Effective? The Animal Charity Evaluators review did not include many studies, and did not cite any natural experiments (only one had been published as of 2018). Orazani et al. (2021)[3] is a nice meta-analysis—it finds that when you show people news articles about nonviolent protests, they are more likely to express support for the protesters' cause. But what people say in a lab setting mig