Hide table of contents

TL;DR: You can apply here.

About Apollo

Apollo Research is a London-based AI safety organization. We focus on high-risk failure modes, particularly deceptive alignment, and intend to audit frontier AI models. Our primary objective is to minimize catastrophic risks associated with advanced AI systems that may exhibit deceptive behavior, where misaligned models appear aligned in order to pursue their own objectives. Our approach involves conducting fundamental research on interpretability and behavioral model evaluations, which we then use to audit real-world models. Ultimately, our goal is to leverage interpretability tools for model evaluations, as we believe that examining model internals in combination with behavioral evaluations offers stronger safety assurances compared to behavioral evaluations alone.

Culture: At Apollo, we aim for a culture that emphasizes truth-seeking; being goal-oriented; giving and receiving constructive feedback; and being friendly and helpful. If you’re interested in more details about what it’s like working at Apollo, you can find more information here.

Context & Agenda

We feel like we have converged on a good technical agenda for evals and interpretability. Therefore, we are currently not bottlenecked by ideas and could easily scale and parallelize existing research efforts with more people. We are confident that more people would help both agendas progress faster and that we can easily integrate them into our existing teams.

On the interpretability side, we’re pursuing a new approach to mechanistic interpretability that we’re not yet publicly discussing due to potential infohazard concerns. However, we expect the day-to-day work of most scientists and engineers to be comparable to existing public interpretability projects such as sparse codingindirect object identificationcausal scrubbingtoy models of superposition, or transformer circuits, as well as converting research insights into robust tools that can scale to very large models. 

For the interpretability team, we aim to make 1-3 offers depending on funding and fit.

To evaluate models, we employ a variety of methods. Firstly, we intend to evaluate model behavior using basic prompting techniques and agentic scaffolding, similar to AutoGPT. Secondly, we aim to fine-tune models to study their generalization capabilities and elicit their dangerous potential within a safe, controlled environment (we have several security policies in place to mitigate potential risks). On a high level, our current approach to evaluating deceptive alignment consists of breaking down necessary capabilities and tracking how these scale with increasingly capable models. Some of these capabilities include situational awareness, stable non-myopic preferences, and particular kinds of generalization. In addition, we plan to build useful demos of precursor behaviors for further study.

For the evals team, we aim to make 2-4 offers depending on funding and fit.

We’re aiming for start dates between September and November 2023 but are happy to consider individual circumstances.

About the team

The evals efforts are currently spearheaded by Mikita Balesni (Evals Researcher) and Jérémy Scheurer (Evals Researcher) with guidance and advice from Marius Hobbhahn (CEO). 

The interpretability efforts are spearheaded by Lucius Bushnaq (Interpretability Researcher) and  Dan Braun (Lead Engineer) with guidance and advice from Lee Sharkey (CSO). 

As of recently, we have a small policy team with Clíodhna Ní Ghuidhir as a full-time hire and one part-time position (tba) to support our technical work by helping build an adequate AI auditing ecosystem. 

Leadership consists of Marius Hobbhahn, Lee Sharkey and Chris Akin (COO). 

Our hierarchies are relatively flat and we’re happy to give new employees responsibilities and the ability to shape the organization.

 

In case you have questions, feel free to reach out at info@apolloresearch.ai 

Comments1


Sorted by Click to highlight new comments since:

So exciting! In my personal opinion, evals and mech interpretability seem like the most tractable parts of the AI Safety ecosystem right now, so I'm very happy to see talented people work on this. 

Curated and popular this week
 ·  · 20m read
 · 
Advanced AI could unlock an era of enlightened and competent government action. But without smart, active investment, we’ll squander that opportunity and barrel blindly into danger. Executive summary See also a summary on Twitter / X. The US federal government is falling behind the private sector on AI adoption. As AI improves, a growing gap would leave the government unable to effectively respond to AI-driven existential challenges and threaten the legitimacy of its democratic institutions. A dual imperative → Government adoption of AI can’t wait. Making steady progress is critical to: * Boost the government’s capacity to effectively respond to AI-driven existential challenges * Help democratic oversight keep up with the technological power of other groups * Defuse the risk of rushed AI adoption in a crisis → But hasty AI adoption could backfire. Without care, integration of AI could: * Be exploited, subverting independent government action * Lead to unsafe deployment of AI systems * Accelerate arms races or compress safety research timelines Summary of the recommendations 1. Work with the US federal government to help it effectively adopt AI Simplistic “pro-security” or “pro-speed” attitudes miss the point. Both are important — and many interventions would help with both. We should: * Invest in win-win measures that both facilitate adoption and reduce the risks involved, e.g.: * Build technical expertise within government (invest in AI and technical talent, ensure NIST is well resourced) * Streamline procurement processes for AI products and related tech (like cloud services) * Modernize the government’s digital infrastructure and data management practices * Prioritize high-leverage interventions that have strong adoption-boosting benefits with minor security costs or vice versa, e.g.: * On the security side: investing in cyber security, pre-deployment testing of AI in high-stakes areas, and advancing research on mitigating the ris
 ·  · 15m read
 · 
In our recent strategy retreat, the GWWC Leadership Team recognised that by spreading our limited resources across too many projects, we are unable to deliver the level of excellence and impact that our mission demands. True to our value of being mission accountable, we've therefore made the difficult but necessary decision to discontinue a total of 10 initiatives. By focusing our energy on fewer, more strategically aligned initiatives, we think we’ll be more likely to ultimately achieve our Big Hairy Audacious Goal of 1 million pledgers donating $3B USD to high-impact charities annually. (See our 2025 strategy.) We’d like to be transparent about the choices we made, both to hold ourselves accountable and so other organisations can take the gaps we leave into account when planning their work. As such, this post aims to: * Inform the broader EA community about changes to projects & highlight opportunities to carry these projects forward * Provide timelines for project transitions * Explain our rationale for discontinuing certain initiatives What’s changing  We've identified 10 initiatives[1] to wind down or transition. These are: * GWWC Canada * Effective Altruism Australia funding partnership * GWWC Groups * Giving Games * Charity Elections * Effective Giving Meta evaluation and grantmaking * The Donor Lottery * Translations * Hosted Funds * New licensing of the GWWC brand  Each of these is detailed in the sections below, with timelines and transition plans where applicable. How this is relevant to you  We still believe in the impact potential of many of these projects. Our decision doesn’t necessarily reflect their lack of value, but rather our need to focus at this juncture of GWWC's development.  Thus, we are actively looking for organisations and individuals interested in taking on some of these projects. If that’s you, please do reach out: see each project's section for specific contact details. Thank you for your continued support as we
 ·  · 3m read
 · 
We are excited to share a summary of our 2025 strategy, which builds on our work in 2024 and provides a vision through 2027 and beyond! Background Giving What We Can (GWWC) is working towards a world without preventable suffering or existential risk, where everyone is able to flourish. We do this by making giving effectively and significantly a cultural norm. Focus on pledges Based on our last impact evaluation[1], we have made our pledges –  and in particular the 🔸10% Pledge – the core focus of GWWC’s work.[2] We know the 🔸10% Pledge is a powerful institution, as we’ve seen almost 10,000 people take it and give nearly $50M USD to high-impact charities annually. We believe it could become a norm among at least the richest 1% — and likely a much wider segment of the population — which would cumulatively direct an enormous quantity of financial resources towards tackling the world’s most pressing problems.  We initiated this focus on pledges in early 2024, and are doubling down on it in 2025. In line with this, we are retiring various other initiatives we were previously running and which are not consistent with our new strategy. Introducing our BHAG We are setting ourselves a long-term Big Hairy Audacious Goal (BHAG) of 1 million pledgers donating $3B USD to high-impact charities annually, which we will start working towards in 2025. 1 million pledgers donating $3B USD to high-impact charities annually would be roughly equivalent to ~100x GWWC’s current scale, and could be achieved by 1% of the world’s richest 1% pledging and giving effectively. Achieving this would imply the equivalent of nearly 1 million lives being saved[3] every year. See the BHAG FAQ for more info. Working towards our BHAG Over the coming years, we expect to test various growth pathways and interventions that could get us to our BHAG, including digital marketing, partnerships with aligned organisations, community advocacy, media/PR, and direct outreach to potential pledgers. We thin