jacquesthibs

AI Safety Researcher @ Independent Researcher
1166 karmaJoined Working (6-15 years)London, UK
jacquesthibodeau.com/about/

Bio

I work primarily on AI Alignment. My main direction at the moment is to accelerate alignment work via language models and interpretability.

Comments
86

We're doing a hackathon with Apart Research on 26th. I created a list of problem statements for people to brainstorm off of.

Pro-active insight extraction from new research

Reading papers can take a long time and is often not worthwhile. As a result, researchers might read too many papers or almost none. However, there are still valuable nuggets in papers and posts. The issue is finding them. So, how might we design an AI research assistant that proactively looks at new papers (and old) and shares valuable information with researchers in a naturally consumable way? Part of this work involves presenting individual research with what they would personally find valuable and not overwhelm them with things they are less interested in.

How can we improve the LLM experience for researchers?

Many alignment researchers will use language models much less than they would like to because they don't know how to prompt the models, it takes time to create a valuable prompt, the model doesn't have enough context for their project, the model is not up-to-date on the latest techniques, etc. How might we make LLMs more useful for researchers by relieving them of those bottlenecks?

Simple experiments can be done quickly, but turning it into a full project can take a lot of time 

One key bottleneck for alignment research is transitioning from an initial 24-hour simple experiment in a notebook to a set of complete experiments tested with different models, datasets, interventions, etc. How can we help researchers move through that second research phase much faster?

How might we use AI agents to automate alignment research?

As AI agents become more capable, we can use them to automate parts of alignment research. The paper "A Multimodal Automated Interpretability Agent" serves as an initial attempt at this. How might we use AI agents to help either speed up alignment research or unlock paths that were previously inaccessible?

How can we nudge research toward better objectives (agendas or short experiments) for their research?

Even if we make researchers highly efficient, it means nothing if they are not working on the right things. Choosing the right objectives (projects and next steps) through time can be the difference between 0x to 1x to +100x. How can we ensure that researchers are working on the most valuable things?

What can be done to accelerate implementation and iteration speed?

Implementation and iteration speed on the most informative experiments matter greatly. How can we nudge them to gain the most bits of information in the shortest time? This involves helping them work on the right agendas/projects and helping them break down their projects in ways that help them make progress faster (and avoiding ending up tunnel-visioned on the wrong project for months/years). 

How can we connect all of the ideas in the field?

How can we integrate the open questions/projects in the field (with their critiques) in such a way that helps the researcher come up with well-grounded research directions faster? How can we aid them in choosing better directions and adjust throughout their research? This kind of work may eventually be a precursor to guiding AI agents to help us develop better ideas for alignment research.

As an update to the Alignment Research Assistant I'm building, here is a set of shovel-ready tasks I would like people to contribute to (please DM if you'd like to contribute!):

Core Features

1. Setup the Continue extension for research: https://www.continue.dev/ 

  • Design prompts in Continue that are suitable for a variety of alignment research tasks and make it easy to switch between these prompts
  • Figure out how to scaffold LLMs with Continue (instead of just prompting one LLM with additional context)
    • Can include agents, search, and more
  • Test out models to quickly help with paper-writing

2. Data sourcing and management

  • Integrate with the Alignment Research Dataset (pulling from either the SQL database or Pinecone vector database): https://github.com/StampyAI/alignment-research-dataset 
  • Integrate with other apps (Google Docs, Obsidian, Roam Research, Twitter, LessWrong)
  • Make it easy to look and edit long prompts for project context

3. Extract answers to questions across multiple papers/posts (feeds into Continue)

  • Develop high-quality chunking and scaffolding techniques
  • Implement multi-step interaction between researcher and LLM

4. Design Autoprompts for alignment research

  • Creates lengthy, high-quality prompts for researchers that get better responses from LLMs

5. Simulated Paper Reviewer

  • Fine-tune or prompt LLM to behave like an academic reviewer
  • Use OpenReview data for training

6. Jargon and Prerequisite Explainer

  • Design a sidebar feature to extract and explain important jargon
  • Could maybe integrate with some interface similar to https://delve.a9.io/ 

7. Setup automated "suggestion-LLM"

  • An LLM periodically looks through the project you are working on and tries to suggest *actually useful* things in the side-chat. It will be a delicate balance to make sure not to share too much and cause a loss of focus. This could be custom for the research with an option only to give automated suggestions post-research session.

8. Figure out if we can get a useable browser inside of VSCode (tried quickly with the Edge extension but couldn't sign into the Claude chat website)

  • Could make use of new features other companies build (like Anthropic's Artifact feature), but inside of VSCode to prevent context-switching in an actual browser

9. "Alignment Research Codebase" integration (can add as Continue backend)

  • Create an easily insertable set of repeatable code that researchers can quickly add to their project or LLM context
  • This includes code for Multi-GPU stuff, best practices for codebase, and more
  • Should make it easy to populate a new codebase
  • Pro-actively gives suggestions to improve the code
  • Generally makes common code implementation much faster

Specialized tooling (outside of VSCode)

Bulk fast content extraction

  • Create an extension to extract content from multiple tabs or papers
  • Simplify the process of feeding content to the VSCode backend for future use

Personalized Research Newsletter

  • Create a tool that extracts relevant information for researchers (papers, posts, other sources)
  • Generate personalized newsletters based on individual interests (open questions and research they care about)
  • Sends pro-active notification in VSCode and Email

Discord Bot for Project Proposals

  • Suggest relevant papers/posts/repos based on project proposals
  • Integrate with Apart Research Hackathons

I've created a private discord server to discuss this work. If you'd like to contribute to this project (or might want to in the future if you see a feature you'd like to contribute to) or if you are an alignment/governance researcher who would like to be a beta user so we can iterate faster, please DM me for a link!

Yes, I’ve talked to them a few times in the last 2 years!

Hey everyone, my name is Jacques, I'm an independent technical alignment researcher (primarily focused on evaluations, interpretability, and scalable oversight). I'm now focusing more of my attention on building an Alignment Research Assistant. I'm looking for people who would like to contribute to the project. This project will be private unless I say otherwise.

Side note: I helped build the Alignment Research Dataset ~2 years ago. It has been used at OpenAI (by someone on the alignment team), (as far as I know) at Anthropic for evals, and is now used as the backend for Stampy.ai.

If you are interested in potentially helping out (or know someone who might be!), send me a DM with a bit of your background and why you'd like to help out. To keep things focused, I may or may not accept.

I have written up the vision and core features for the project here. I expect to see it evolve in terms of features, but the vision will likely remain the same. I'm currently working on some of the features and have delegated some tasks to others (tasks are in a private GitHub project board).

I'm also collaborating with different groups. For now, the focus is to build core features that can be used individually but will eventually work together into the core product. In 2-3 months, I want to get it to a place where I know whether this is useful for other researchers and if we should apply for additional funding to turn it into a serious project.

GPT-2 was trained in 2019 with an estimated 4e21 FLOP, and GPT-4 was trained in 2023 with an estimated 8e24 to 4e25 FLOP.

Correction: GPT-2 was trained in 2018 but partially released in February 2019. Similarly, GPT-4 was trained in 2022 but released in 2023.

For instance (and to their credit), OpenAI has already committed 20% of their compute secured to date to solving the problem of aligning superintelligent AI systems.

lol


I'm currently trying to think of project/startup ideas in the space of d/acc. If anyone would like to discuss ideas on how to do this kind of work outside of AGI labs, send me a DM.

Note that Entrepreneurship First will be running a cohort of new founders focused on d/acc for AI.

I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.

Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.

CURRENT WORK

  • Collaborating with Quintin Pope on our Supervising AIs Improving AIs agenda (making automated AI science safe and controllable). The current project involves a new method allowing unsupervised model behaviour evaluations. Our agenda.
  • I'm a research lead in the AI Safety Camp for a project on stable reflectivity (testing models for metacognitive capabilities that impact future training/alignment).
  • Accelerating Alignment: augmenting alignment researchers using AI systems. A relevant talk I gave. Relevant survey post.
  • Other research that currently interests me: multi-polar AI worlds (and how that impacts post-deployment model behaviour), understanding-based interpretability, improving evals, designing safer training setups, interpretable architectures, and limits of current approaches (what would a new paradigm that addresses these limitations look like?).
  • Used to focus more on model editing, rethinking interpretability, causal scrubbing, etc.

TOPICS TO CHAT ABOUT

  • How do you expect AGI/ASI to actually develop (so we can align our research accordingly)? Will scale plateau? I'd like to get feedback on some of my thoughts on this.
  • How can we connect the dots between different approaches? For example, connecting the dots between Influence Functions, Evaluations, Probes (detecting truthful direction), Function/Task Vectors, and Representation Engineering to see if they can work together to give us a better picture than the sum of their parts.
  • Debate over which agenda actually contributes to solving the core AI x-risk problems.
  • What if the pendulum swings in the other direction, and we never get the benefits of safe AGI? Is open source really as bad as people make it out to be?
  • How can we make something like the d/acc vision (by Vitalik Buterin) happen?
  • How can we design a system that leverages AI to speed up progress on alignment? What would you value the most?
  • What kinds of orgs are missing in the space?

POTENTIAL COLLABORATIONS

  • Examples of projects I'd be interested in: extending either the Weak-to-Strong Generalization paper or the Sleeper Agents paper, understanding the impacts of synthetic data on LLM training, working on ELK-like research for LLMs, experiments on influence functions (studying the base model and its SFT, RLHF, iterative training counterparts; I heard that Anthropic is releasing code for this "soon") or studying the interpolation/extrapolation distinction in LLMs.
  • I’m also interested in talking to grantmakers for feedback on some projects I’d like to get funding for.
  • I'm slowly working on a guide for practical research productivity for alignment researchers to tackle low-hanging fruits that can quickly improve productivity in the field. I'd like feedback from people with solid track records and productivity coaches.

TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH

  • Strong math background, can understand Influence Functions enough to extend the work.
  • Strong machine learning engineering background. Can run ML experiments and fine-tuning runs with ease. Can effectively create data pipelines.
  • Strong application development background. I have various project ideas that could speed up alignment researchers; I'd be able to execute them much faster if I had someone to help me build my ideas fast.

Another data point: I got my start in alignment through the AISC. I had just left my job, so I spent 4 months skilling up and working hard on my AISC project. I started hanging out on EleutherAI because my mentors spent a lot of time there. This led me to do AGISF in parallel.

After those 4 months, I attended MATS 2.0 and 2.1. I've been doing independent research for ~1 year and have about 8.5 more months of funding left.

More information about the alleged manipulative behaviour of Sam Altman

Source

Load more