Hide table of contents

[This post summarizes some of the work done by Owen Dudney, Roman Engeler and myself (Quintin Pope) as part of the SERI MATS shard theory stream.]


Future prosaic AIs will likely shape their own development or that of successor AIs. We're trying to make sure they don't go insane.


There are two main ways AIs can get better: by improving their training algorithms or by improving their training data. 

We consider both scenarios, and tentatively believe that data-based improvement is riskier than architecture based improvement. Current models mostly derive their behavior from their training data, and not training algorithms (meaning their architectures, hyperparameters, loss functions, optimizers or the like)[1]. So far, most improvements to AI training algorithms seem 'value neutral'[2]. Also note that most of human value drift currently derives from cultural shifts changing the 'training data' available in the environment, not biological evolution over the brain's base learning algorithms[3]

We imagine a future where AIs self-augment by continuously seeking out more and better training data, and either creating successor AIs or training themselves on that data. Often, these data will come from the AIs running experiments in the real world (doing science), deliberately seeking data that would cover a specific gap in its current capabilities, analogous to how human scientists seek data from domains where our current understanding is limited. With AI, this could involve AgentGPT-like systems that spin up many instances of themselves to run experiments in parallel, potentially leading to quick improvements if we are in an agency overhang.

We want to find methods of ensuring such 'automated science' processes remain safe and controllable, even after many rounds of self-directed data collection and training. In particular, we consider problems such as:

  • Preventing self-training from amplifying undesirable behaviors
  • Preventing semantic drift in concept representations during self-training
  • Ensuring cross-modality actions (such as a generated image for a text-to-image model, or robot movement for a text-and-image-to-actuator-motion model) remain grounded in their natural language descriptions after self-training in a non-lingual modality
  • Preventing value drift during multiple, iterated steps of self-retraining

Currently, we're focusing on scalable methods of tracking behavioral drift in language models, as well as benchmarks for evaluating a language model's capacity for stable self-modification via self-training.


So far, most improvements in AI capabilities fall into two categories:

  • Algorithmic: better architectures, better loss functions, better optimizers, etc.
  • Data-driven: more training data, higher quality data, domain-specific data, etc.

We expect the future to resemble the past, and so we expect that future capabilities improvements will come from these two sources. However, it also seems likely that AIs will increasingly be the ones responsible for such capabilities advances. In fact, current work has already started using language models as part of the data curation process, or to generate future training data directly. Moreover, with GPT-3's widespread adoption, it is probable that GPT-4's training data contains content generated by its predecessor. This phenomenon extends to fine-tuning processes like RLHF, where earlier versions' output influences the cognition of subsequent iterations.

Researchers are likely to use the most capable models available to them in whatever AI-driven improvement process they devise. Thus, such a process is iterative, with the first AI shaping the training of the second AI, shaping the training of the third AI, and so on. Such an iterative AI-driven improvement process is cause for concern, as it could amplify issues in the first model over time, or lessen human designers' understanding and control of the resulting AIs.

The 'supervising AIs improving AIs' research agenda seeks methods of more reliably guiding such iterative improvement processes.

Algorithmic versus data-driven improvements

In this section, we will discuss examples of algorithmic and data-driven improvements and explain why we believe that data-driven improvement processes are riskier than algorithmic improvements.

"Algorithmic" improvements

These include changes to architecture, training process, optimizer, initialization, and so on. Some examples:

"Data-driven" improvements

These include increased amounts of training data, improved data quality, additional task-specific training data, better annotation of existing data, and so on. Some examples:

Why focus on data-driven improvements?

We tentatively think that developing supervision methods for data-driven improvement processes is higher priority compared to algorithmic improvements. We have three main reasons for thinking this.

1: Data more directly influences AI behaviors

Downstream model behaviors largely derive from the training data. When architectural factors do influence training data in some systematic way, such influences are likely to be "value neutral", in that they don't promote particular types of values or objectives over others. For example, LSTMs may be worse at in-context learning as compared to transformers. However, it seems unlikely that this would manifest in one architecture learning more objectionable behavior than the other, or that one would be more likely to be biased, offensive or hostile to users.

Similarly, human values seem to mostly arise from "data", such as within-lifetime experiences and individual humans' reward circuitry, as opposed to "architecture" factors such as the relative sizes of different brain regions. 

2: AIs are closer to automating data improvement as compared to algorithmic improvements

Current language models are quite far from human-level at AI research.[4] However, they seem at or above human level for many simple data curation tasks. As a result, we've already seen multiple papers using language models to improve their own training data. It thus seems like a greater immediate priority to supervise data based AI improvement processes.

3: Algorithmic capabilities improvements seem unlikely to disastrously interfere with alignment techniques

By now, we have multiple examples of both algorithmic capabilities advances and alignment techniques for language models. They seem unlikely to dramatically interfere with each other. E.g., a language model that uses the Hyena operator in place of attention would likely still be trainable with RLHF. Doing so may require a degree of (possibly very annoying) tuning, but such an architectural change seems unlikely to suddenly render RLHF completely useless.

That having been said, we are also excited about research that experimentally tests the degree of interference between recent capability advances and alignment techniques. It seems feasible and valuable to try out various combinations of alignment and capabilities advances and track the amount of additional effort required to get them to play nice with each other.

Risks of data driven improvement processes

This section describes some specific mechanisms by which undesirable behavior might arise. However, it is impossible to predict all such issues in advance.

Bias amplification

Concretely, we can imagine an iterative data re-writing process, where we start with an initial pretraining dataset of human written texts, and a first AI trained on that dataset, then have this first AI reprocess the pretraining data to improve the quality of the data. The AI could improve the data in many potential ways, such as fixing grammatical issues, augmenting the data with relevant sources, or removing objectionable texts. 

Such a process could compound whatever biases are present in the first model. For example, let's suppose the model has some small bias against a particular demographic group, and that part of the data refinement process is to remove texts that the model judges to be "factually implausible". In that case, the model could be biased towards including texts that are hostile towards the demographic in question, and biased away from including positive texts about the demographic. 

This would make the training set for the next iteration of the model more strongly biased against the demographic, causing the next model to be more strongly biased against the demographic.

Positive feedback loops

This is essentially a generalization of the bias amplification failure mode to include other forms of self-reinforcing behavioral tendencies. Any sort of mildly held preference of belief could be amplified over iterations, not just biases. These include factual beliefs, attitudes towards religions or political ideologies, and so on.

Stylistic patterns may also be self-reinforcing. Human text includes a wide variety of writing styles and regional dialects. However, if we ask a model to re-write portions of its own training data for 'improved quality', the model may default to a much more limited selection of styles that match its notion of 'quality' writing.

An iterative RLHF training process may produce such a stylistic 'mode collapse' if the model's pretraining data is biased towards viewing particular styles as 'high quality', such that the model disproportionately uses the favored style in its attempts to produce high quality writing, receives reward for writing in that style, leading to a positive feedback loop which increasingly biases the model towards the single style and degrades the model's ability to represent the full diversity of human communication. ChatGPT 3.5 collapsing to only using one poetry style may be an example of such a dynamic.

Data poisoning

Data poisoning methods allow attackers to adversarially influence a learned model's behavior by manipulating its training data. Past work has shown attackers can influence models to selectively insert specific vulnerabilities when working on particular repositories or codebases. Data-driven improvement cycles may increase the risk of exposure to adversarial attacks orchestrated by malicious actors.

E.g., OpenAI indicates that they do use user interactions as training data to improve ChatGPT, and they allow users to rate ChatGPT's responses. Potentially, an adversary could prompt ChatGPT to express a particular opinion, upvote that response, and, if OpenAI trains ChatGPT on that interaction, thereby influence future ChatGPT responses. 

Semantic drift

Language is the primary interface by which we control current AIs. This works because future model outputs are grounded in the semantic content of past texts. When a user says "Write code to sort a list", the language model implements the expected relationship between the semantic content of the instructions and the function of the future text, where instructions to sort a list are followed by code that does so.

When the model is just pretrained on a fixed, human-written dataset, then modeling that distribution will force the model to learn human-understandable semantic relations. However, many methods of improving model capabilities risk causing drift in the model's semantics: the model's words don't mean what they used to. 

Example: using reinforcement learning to train a model to accurately solve math problems via chain of thought causes the model to make more mistakes in its chain of thought, but to become more accurate in its final answers. Similarly, we think that multiple rounds of training models on self-curated data may distort semantics in unpredictable ways. Consequences include:

  • Users have a harder time providing feedback on model behaviors.
  • The basis for model decisions becomes harder to discern, and so are harder to verify or debug.

Iterative training in the context of self-improving models may also increase semantic drift for a variety of reasons.

  1. Loss of human grounding: As models increasingly train on self-generated data and cause changes to the underlying data distribution, they may lose touch with the human-generated data that initially grounded their semantics, causing a divergence from human concepts and language. This makes it harder for the AI to understand and interpret human instructions.
  2. Accumulation of errors and feedback loops: Small errors in the AI's understanding or representation of concepts can compound over time during iterative training, leading to significant semantic drift from the original human language and concepts. Self-training on AI-generated data can cause feedback loops, reinforcing and amplifying biases and errors, leading to a gradual divergence from human semantics as the AI's learning process becomes increasingly self-referential.
  3. Overfitting to narrow domains: Optimizing performance on specific tasks, the AI may prioritize certain patterns in the training data, causing semantic drift as the AI becomes more specialized and less capable of understanding the broader context of human language and ideas.

To mitigate the risks associated with semantic drift in iterative training, it is crucial to develop methods for monitoring and controlling the AI's learning process. This may involve maintaining a strong connection to human-generated data and ensuring that the AI's understanding remains grounded in human language and concepts throughout the iterative training process. Additionally, as we will discuss later, researchers should develop benchmarks to track the stability and safety of models as they undergo self-improvement and self-training.

Cross-modal semantic grounding

Future work will likely extend this use of language as an interface to control AI behaviors in other modalities. We already see this happening in image generation and robotic manipulation. A recent example of this is PALM-E, which embodies tasks such as sequential robotic manipulation planning, visual question answering, and captioning. These multi-modal models are trained using joint embeddings[5], where the model learns to project the data points from different modalities into a shared embedding space, such that semantically similar items, regardless of their modality, are close together in the space.

However, if the learned concepts in a multi-modal setting aren't shared between modalities, the model's cross-modal behavior may deviate arbitrarily from its instructions or descriptions of the outputs. For example, GPT-4 has limited ability to ground between natural language instructions and ASCII art:

GPT-4 can follow natural language instructions about simple or common ASCII art such as the first smiley face. However, once the task becomes more difficult, our ability to control GPT-4's ASCII outputs with natural language quickly drops. Moreover, having the natural language instruction in the context biases GPT-4's natural language descriptions of the ASCII art it actually did generate, causing it to say the art is in line with the given instructions (it otherwise says the ASCII art depicts some sort of abstract geometric shape).

Such issues with cross-modal grounding could pose significant issues when working with modalities more consequential than ASCII art. For example, code models instructed in natural language to produce secure code should actually do so; furthermore, they should be able to produce natural language descriptions that reflect the actual security of their code, rather than just say the code is secure because of the nature of their instructions.

Iterative training in a multi-modal setting may also cause cross-modal semantic drift for several reasons:

  1. Modality-specific optimization: As the model iteratively trains on different modalities, it may optimize its understanding and performance within each modality independently. This could lead to a divergence in the learned concepts and relationships across different modalities, making it challenging for the AI to maintain a shared understanding and accurately relate instructions or descriptions between modalities.
  2. Insufficient cross-modal training data: Most cross-modal research focuses on domains where cross-modal data are abundant. E.g., text-to-image models have vast quantities of image / caption pairs to support robust grounding between the input text descriptions and the output images. However, such data may not be as abundant for more exotic modalities, such as biological sequence data. The could cause issues when attempting to adapt pretrained language models for new modalities where there isn't as much data available to ground learned concepts in text.
  3. Feedback loops and error accumulation: Similar to the risk of semantic drift in single-modality settings, feedback loops and error accumulation can also contribute to cross-modal semantic drift.

To address the challenges of cross-modal semantic drift during iterative training, researchers must develop robust methods for aligning and grounding concepts across modalities. This may involve designing training data and loss functions that encourage the model to maintain a shared understanding of concepts between modalities, as well as developing benchmarks and evaluation metrics to track the stability of cross-modal behavior during iterative training.

Value drift

Current models are primarily trained to act in accordance with human preferences during single interactions at a specific point in time. This approach, however, does not necessarily ensure that the AI will continue to act in accordance with our preferences across multiple rounds of interaction and learning. In this context, we can differentiate between first-order values, which refer to the AI's ability to satisfy human preferences within a single interaction, and second-order values, which encompass the AI's ability to maintain stability and alignment with human preferences over time.

One challenge in achieving second-order value alignment is that current models often lack the feedback signals necessary to encourage stability over longer periods of time. Unlike humans, who continuously learn and adapt throughout their lifetimes, models may not be explicitly trained to recover from instances of value drift or to maintain their stability over multiple rounds of interaction and learning.

To address this challenge, researchers need to develop methods that not only optimize models for short-term alignment with human preferences but also promote long-term stability and value alignment. This may involve:

  1. Incorporating feedback mechanisms: Designing AI training processes that include feedback mechanisms to detect and correct value drift or deviations from intended behavior. This could involve leveraging human input or developing self-correction methods that allow the AI to recognize when its behavior is misaligned with human preferences.
  2. Meta-learning for value alignment: Investigating meta-learning techniques that enable models to learn how to learn better. This could involve training models to adapt their learning processes to maintain value alignment over time and across multiple rounds of interaction and learning.
  3. Monitoring and evaluation: Developing robust monitoring and evaluation methods to track the stability of models over time. This may include creating benchmarks and metrics that assess the AI's ability to maintain value alignment and its resilience to value drift.
  4. Iterative alignment techniques: Exploring iterative alignment techniques that involve multiple rounds of training and human feedback to continuously refine the AI's understanding of human preferences and maintain value alignment over time.
  5. Anchoring concepts to external references: By connecting concepts to external, stable reference points, we can maintain consistency in concept representations over time.

By focusing on both first and second-order values in AI development, researchers can work towards creating AI systems that are not only capable of satisfying human preferences within single interactions but also remain stable and aligned with human values throughout their lifetimes and across multiple learning experiences.

Related work

Continual learning

Continual learning refers to the process of continually training an AI on a constant flow of new data. Often, ML practitioners employ continual learning in dynamically changing environments, in which the underlying distribution of problems changes over time. Catastrophic forgetting is a common challenge for continual learning, whereby training the model on a new distribution of problems will degrade the model's previously learned capabilities. 

Catastrophic forgetting is conceptually similar to the kinds of dynamic stability challenges we wish to investigate. Both result from many locally appropriate updates to the model's cognition which nonetheless compound to cause problems in the model's overall behavior. 

Continual learning approaches often address catastrophic forgetting by introducing a “replay buffer” (or “memory buffer”) of past experiences, which continuously feed into the model's current training data. However, some implementations of this countermeasure can encounter self-amplifying bias over time. Because the replay buffer must periodically select which experiences to retain in the buffer, and because past selections influence future selections, it can lead to biases that become increasingly pronounced. That said, we believe this kind of approach is the current standard approach for fine-tuning in the language model setting, it's similar to what we see in the "Training language models to follow instructions with human feedback" paper by OpenAI:

We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure. During RLHF fine-tuning, we observe performance regressions compared to GPT-3 on certain public NLP datasets, notably SQuAD (Rajpurkar et al., 2018), DROP (Dua et al., 2019), HellaSwag (Zellers et al., 2019), and WMT 2015 French to English translation (Bojar et al., 2015). This is an example of an “alignment tax” since our alignment procedure comes at the cost of lower performance on certain tasks that we may care about. We can greatly reduce the performance regressions on these datasets by mixing PPO updates with updates that increase the log likelihood of the pretraining distribution (PPO-ptx), without compromising labeler reference scores.

We also see similar results in “Fine-tuned language models are Continual Learners” by researchers at Meta. They were able to maintain almost 100% of the initial performance on all the previous datasets by using only 1% of data for their replay buffer (referred to in the paper as a "memory buffer").

Besides adding a replay buffer, there are other techniques employed to address catastrophic forgetting[6][7][8]. Elastic Weight Consolidation (EWC) was first introduced by DeepMind as an approach to remember old tasks by selectively slowing down learning on the weights important for those tasks. In the paper, they focus on sequentially learning MNIST and several Atari 2600 games. By slowing down the learning in crucial regions of the network, EWC helps to maintain previously acquired knowledge while still adapting to new data. EWC addresses this issue by introducing a quadratic regularization term to the loss function. This term penalizes changes to important model parameters that are critical for previously learned tasks.

Investigating methods like EWC (i.e. adding a regularization term) could provide insights into maintaining stability across multiple rounds of self-training. However, it is unclear if methods like this are optimal for the language model fine-tuning setting.

Finally, in the PALM-E paper, it is shown that catastrophic forgetting of language capabilities decreases as model size increases.

Active Learning

Active learning involves actively seeking out labels for high-value training data, considering the associated costs, to maximize the benefits when incorporated into the training process (see Zhisong et al. for a recent review of active learning in the context of NLP). It is related to the dataset self-curation process mentioned in this post, as both use current model behavior (or various functions thereof) to determine the future training data.

Semi-supervised learning

Semi-supervised learning techniques have been employed to address scenarios with limited labeled data and a vast amount of unlabeled data. These approaches involve the model assigning its best guess labels to the unlabeled data, effectively creating a feedback loop between the model's outputs and its training process through the now-labeled data. This feedback mechanism allows the model to influence its own future cognition, improving its performance on tasks like text classification and machine translation.


Generally, "distillation" refers to the use of a larger "teacher" model to provide a training signal for a smaller "student" model. It's used as a method of model compression and can produce a more capable student model than would be possible by training an equally sized model on the original data directly.

However, it turns out that the teacher doesn't actually have to be larger than the student for distillation to work. Such "self-distillation" doesn't even require the student and teacher to be separate models. A single model can continually supervise its own training process, e.g., by computing soft target probabilities for its own training labels.

Exposure Bias

As machine learning model adoption grows, the volume of data generated by these models in the wild continues to expand. Consequently, these models are trained exclusively on naturally occurring text, generating the next token from a starting point that is also natural text. However, upon deployment, they condition on their own text outputs, including any small mistakes they make. They have never learned to recover from accidentally leaving the manifold of naturally occurring text. Each initial deviation increases the odds of further deviations. These mistakes compound, leading to the output degenerating into repetitive nonsense.

The compounding errors caused by exposure bias conceptually mirror the compounding issues caused by multiple rounds of self-training that we wish to investigate. Current LMs are neither trained to remain stable across an entire trajectory of multiple token outputs, nor trained to remain stable across an entire trajectory of multiple rounds of self-training.

Current research directions

We are currently pursuing two projects as part of this research agenda: unsupervised behavioral evaluation and benchmarks for stable reflectivity. These projects aim to address the challenges of maintaining stability and alignment in AI systems undergoing iterative training. Future posts will discuss each in greater detail.

Unsupervised behavioral evaluation

There are many ways to fine-tune models. However, we currently have limited tools for discovering how such fine-tuning changes model behavior. We often use fixed probe datasets that evaluate a model's behavior along single dimensions. For instance, RealToxicityPrompts is a dataset of text prompts labeled with toxicity scores, which practitioners can use to evaluate a model's tendency to produce toxic text. 

Current practice is to collect many such datasets, thereby letting practitioners evaluate model behaviors along many dimensions. However, this approach requires that practitioners already know what dimensions they want to evaluate the model on, and that they have a dataset for doing so. 

We are thus interested in unsupervised methods of quantifying the manner in which a fine-tuned model differs from its precursor model. Our current direction of research is to sample multiple simulated interactions from both models, and then perform unsupervised concept-level clustering on the combined texts. After this, we can ask another model to assign human-interpretable labels to each cluster.

We can then gain insight into how the precursor and fine-tuned models differ behaviorally by comparing each cluster's label with the proportion of texts in the cluster that were generated by the two models. E.g., if a cluster is labeled "positive statements about geese", and 90% of texts in the cluster come from the fine-tuned model, that suggests the fine-tuning process may have made the model more positively disposed towards geese.

Challenges of this approach include:

  • Cluster labels may not reflect the true content of the text in the clusters.
  • Behaviorally relevant texts may be distributed across many clusters. E.g., there may be another cluster with label "negative statements about geese", where 90% of the texts come from the fine-tuned model. 
  • Unsupervised clustering based on embedding similarity often latches onto trivial quirks of phrasing, rather than meaningful semantic content.
  • Clusters are rarely "conceptually pure", often containing texts of diametrically opposed sentiment. E.g., a cluster containing both positive and negative statements about geese.

Our vision is to eventually have a largely automated pipeline for discovering noteworthy changes in behavior between the precursor and the fine-tuned models. In particular, recent research has indicated that current language models should be up to the task of highlighting surprising or concerning changes in behavior. We should be able to prompt a language model with information such as:

  • How a given fine-tuning method was supposed to change a model's behavior
  • Which label is assigned to a given cluster
  • What fraction of cluster texts stems from the fine-tuned versus precursor models 

and have the model judge if the implied difference in behavior is inline with what we intended to accomplish with the fine-tuning method, or whether the finetuing seems to produce unexpected or dangerous changes to the model's behaviors. Such a pipeline could let us quickly evaluate the impacts of many different language model finetuning methods, accelerating the empirical feedback loops which drive much of current prosaic alignment research[9]

We believe this project could help researchers more easily identify unexpected ways in which an iterative training process influenced the model's behavior[10] and address issues such as bias amplification, unexpected positive feedback loops, or semantic drift. Although this project is currently restricted to the single modality of natural language text, we hope to expand its scope in the future to address issues such as drift in cross-modal semantic grounding. This project does not directly address value drift, but may offer tools to detect and quantify behavioral changes associated with value drift.

Benchmarks for stable reflectivity

Self-Reflectivity Story

Recent approaches allow language models to generate their own training data and self-evaluate their own outputs, allowing the models significant influence over their own training process. This raises concerns about reflectivity and the dynamics it introduces. While current data improvement processes circumvent direct forms of this issue by not informing AI of the ongoing training, future AIs may be aware of this influence use it to steer their future cognition in accordance with their current preferences.

Contemporary RL setups may lead language models to acquire some degree of reflectivity or self-knowledge. E.g., chatbots may benefit from knowing the limits of their own capabilities (a form of self-knowledge), or from knowing the intention behind their deployment (a form of reflectivity). OpenAI appears to have furnished ChatGPT 3.5 with both types of information.

OpenAI provides ChatGPT with various facts about itself as a hidden prompt:

OpenAI also trained ChatGPT to be aware of the purpose for which it was trained:

Note that ChatGPT also says its "purpose is to continuously learn and improve". Only 1 out of 10 responses to this prompt mentioned a desire for self-improvement, so OpenAI probably did not explicitly train it respond in this manner. 

Future AIs may understand that their outputs' impact their training (either through direct instruction or generalization from their training data), and have preferences regarding those impacts. In anticipation of such a possibility, we aim to investigate the behavior of current AIs in varying contexts the evoke reflectivity or require self-knowledge.

Self-Reflectivity Subtasks

We have adopted a practical approach to defining self-reflectivity by focusing on relevant subtasks associated with reflective behavior in the context of AI self-improvement. Currently, these subtasks are:

  • Track one’s own values versus values of others (self-concept)
  • Differentiate one’s current values versus one’s future values (temporal changes)
  • Identify events that could influence personal or others' values (external influences)
  • Predict how events may impact one's values (impact predictions)
  • Evaluate the desirability of specific influences on personal values (change desirability)

This decomposition enables progress tracking on subtasks related to self-reflectivity. Previous research has demonstrated that although larger model sizes give rise to emergent behaviors, underlying improvements are often smoother, which can be revealed by breaking down tasks in ways that better capture partial progress. As a consequence, we divide self-reflection into subtasks and evaluate improvements for each.

Probing Dataset Pipeline

We are developing a flexible pipeline to automatically generate probing datasets using current language models. This involves defining subtasks with high-quality examples, creating extensive datasets to assess model competency, and evaluating various models on each subtask. Challenges include:

  • Ensure all samples represent valid examples of the evaluated subtask
  • Maintain high variation in examples to cover the evaluated subtask
  • Avoid introducing bias in example phrasing
  • Establish correct causal structure between events and values


We believe this project could facilitate the automatic evaluation of stable self-reflectivity, a crucial capability for data-driven improvement. Specifically, it may contribute to evaluation datasets that identify capabilities and safety concerns in future models before their release. Ideally, these techniques would be integrated into the data-driven improvement process, allowing the termination of a training run if it goes off the rails. While this project addresses a specific capability essential for data-driven improvement, there will be other critical aspects to consider, such as goal-directedness and power-seeking behaviors.

Dual use concerns

In alignment, we must strike a balance between learning to align future powerful AIs and the potential negative externalities of advancing capability research. We acknowledge this dilemma and aim to be deliberate about the potential consequences of our work.

This research agenda focuses on self-improving systems, meaning systems that take actions to steer their future cognition in desired directions. These directions may include reducing biases, but also enhancing capabilities, or preserving their current goals. Many alignment failure stories feature such behavior. Some researchers postulate that the capacity for self-improvement is a critical and dangerous threshold; others believe that self-improvement will largely resemble the human process of conducting ML research, and it won't accelerate capabilities research more than it would accelerate research in other fields. 

Data curation and generation are clear use cases for language models, as shown by the number of recent papers linked throughout this post. Most of this research aims at advancing capabilities, since LM self-improvement could have significant commercial uses - it's possible to circumvent data-sourcing problems by using LMs to curate, improve, or generate their own training data.

Our focus lies on understanding the risks and unintended consequences of self-improvements. Thus, the insights obtained will likely enhance the safety of an already existing trend without significantly boosting capabilities. The self-reflective data curation process doesn't appear likely to instill or elicit dramatic, novel capabilities in a model. It rather yields predictable improvements in each iteration, as opposed to significant leaps from algorithmic advancements (e.g., LSTM to Transformer architecture). Given that our tasks resemble human-performed data curation, we are less concerned about the "threshold" family of threat models. Nonetheless, if it seems likely at any point that our research would significantly advance capabilities on this frontier, we would try to limit its dissemination or avoid releasing it altogether. 

In short, it seems likely that most detrimental effects of this kind of research would happen with or without our involvement. However, our work might reveal new insights on the risks and dynamics of iterative self-improvement.

  1. ^

    E.g., it seems unlikely that using an LSTM versus a transformer architecture would have much influence on the alignment of the resulting model (controlling for capabilities differences, of course).

  2. ^

    Similarly, pretraining a language model with Chinchilla scaling laws, versus pretraining one with Kaplan scaling laws, seems unlikely to make much difference in alignment.

  3. ^

    This distinction is made more complex by the fact that the brain has genetically specified reward circuitry, which are closer to "data labeling functions" than to what we typically call "architecture" in the context of machine learning. Thus, human within-lifetime "training data" is partially genetically determined.

  4. ^

    That said, researchers have been augmenting their workflows with the help of coding AIs and GPT-4 has been shown to be SOTA at Neural Architecture Search (NAS) for NAS-Bench-201:

  5. ^

    To create joint embeddings, an AI model is typically trained on multi-modal data, where each data point consists of two or more modalities (e.g., a caption and its corresponding image). The model learns to project the data points from different modalities into a shared embedding space, such that semantically similar items, regardless of their modality, are close together in the space. This is usually achieved by minimizing a loss function that encourages similarity between the embeddings of corresponding data points across different modalities.

    In a joint embeddings setting, the model does not necessarily share the same exact embeddings for all modalities. Instead, the model learns separate embeddings for each modality and then projects them into a shared embedding space. This allows the embeddings to capture modality-specific features while still maintaining a common representation that aligns the modalities semantically.

  6. ^

    Progressive Neural Networks (PNN): PNNs mitigate catastrophic forgetting by expanding the network architecture for each new task. Instead of updating the weights of the original network, PNNs add new columns (subnetworks) for each task, with lateral connections to previous columns. This preserves the knowledge of previous tasks while allowing new knowledge to be learned.

  7. ^

    Learning without Forgetting (LwF): LwF uses knowledge distillation to preserve the performance of the model on previous tasks. During training on a new task, the model's predictions for the old tasks are compared to the predictions made by a fixed copy of the model before training on the new task. The loss function encourages consistency between the predictions.

  8. ^

    Variational Continual Learning (VCL): VCL uses a Bayesian approach to continual learning. Instead of computing parameter importance, it maintains a distribution over the network parameters. During training, the posterior distribution of the parameters is updated based on the new task data, and this posterior is used as the prior for the next task. There are newer approaches to VCL that are perhaps worth looking into.

  9. ^

    In particular, it seems good that this project focuses on discovering unexpected impacts of finetuning on behavior. This supports a differentially greater usefulness for alignment research as compared to capabilities research, because AI capabilities researchers will already focus on developing good metrics for profit-relevant AI capabilities. E.g., Codex developers likely put significant effort into quantifying their models' code writing capabilities.

  10. ^

    Note that we are not claiming that the ideal use of this benchmark should only be used as the evaluation for the final model. As suggested in Evan Hubinger’s recent post “Towards understanding-based safety evaluations”, this benchmark could be used to “evaluate the developer's ability to understand what sort of model they got and why they got it. I think that an understanding-based evaluation could be substantially more tractable in terms of actually being sufficient for safety here: rather than just checking the model's behavior, we're checking the reasons why we think we understand its behavior sufficiently well to not be concerned that it'll be dangerous.”





More posts like this

No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities