Hide table of contents

Summary: This post documents research by SatisfIA, an ongoing project on non-maximizing, "aspiration-based" designs for AI agents that fulfill goals specified by constraints ("aspirations") rather than maximizing an objective function​​. We aim to contribute to AI safety by exploring design approaches and their software implementations that we believe might be promising, but neglected or novel. Our approach is roughly related to, but largely complementary to, concepts like Decision Transformers, Active Inference, Quantilization and Satisficing (or soft optimization in general).

It's divided into two parts, the informal introduction and the formal introduction.

First, the informal introduction describes the motivation for the research, our working hypotheses, and our theoretical framework. It does not contain results and can safely be skipped if you want to get directly into the actual research.

Second, the formal introduction presents the simplest form of the type of aspiration-based algorithms we study. We do this for a simple form of aspiration-type goals: making the expectation of some variable equal to some given target value.[1]

At the end of the post we give an overview of the projects status (we're looking for collaborators!) and have a small glossary of terms.

This post is the summary for an emerging sequence on LessWrong and the Alignment Forum, authored by us and other members of the SatisfIA project team. If you wish to learn more about our approach you can read the full sequence there.
 

🄸🄽🄵🄾🅁🄼🄰🄻 🄸🄽🅃🅁🄾🄳🅄🄲🅃🄸🄾🄽


Motivation

We share a general concern regarding the trajectory of Artificial General Intelligence (AGI) development, particularly the risks associated with creating AGI agents designed to maximize objective functions. We have two main concerns:

AGI development might be inevitable

The development of AGI seems inevitable due to its immense potential economic, military, and scientific value. This inevitability is driven by:

  • Economic and military incentives: The prospect of gaining significant advantages in efficiency, capability, and strategic power makes the pursuit of AGI highly attractive to both private and state actors​​.
  • Scientific curiosity: The inherent human drive to understand and replicate intelligence propels the advancement of AGI research.

It might be impossible to implement an objective function the maximization of which would be safe

The conventional view on A(G)I agents (see, e.g., Wikipedia) is that they should aim to maximize some function of the state or trajectory of the world, often called a "utility function", sometimes also called a "welfare function". It tacitly assumes that there is such an objective function that can adequately make the AGI behave in a moral way. However, this assumption faces several significant challenges:

  • Moral ambiguity: The notion that a universally acceptable, safe utility function exists is highly speculative. Given the philosophical debates surrounding moral cognitivism and moral realism and similar debates in welfare economics, it is possible that there are no universally agreeable moral truths, casting doubt on the existence of a utility function that encapsulates all relevant ethical considerations​​.
  • Historical track-record: Humanity's long-standing struggle to define and agree upon universal values or ethical standards raises skepticism about our capacity to discover or construct a comprehensive utility function that safely governs AGI behavior (Outer Alignment)​​ in time.
  • Formal specification and Tractability: Even if a theoretically safe and comprehensive utility function could be conceptualized, the challenges of formalizing such a function into a computable and tractable form are immense. This includes the difficulty of accurately capturing complex human values and ethical considerations in a format that can be interpreted and tractably evaluated by an AGI​​.
  • Inner Alignment and Verification: Even if such a tractable formal specification of the utility function has been written down, it might not be possible to make sure it is actually followed by the AGI because the specification might be extremely complicated and computationally intractable to verify that the agent's behavior complies with it. The latter is also related to Explainability.

Given these concerns, the implication is clear: AI safety research should spend considerably more effort on identifying and developing AGI designs that do not rely on the maximization of an objective function. Given our impression that currently not enough researchers pursue this, we chose to work on it, complementing existing work on what some people call "mild" or "soft optimization", such as Quantilization or Bayesian utility meta-modeling. In contrast to the latter approaches, which are still based on the notion of a "(proxy) utility function", we explore the apparently mostly neglected design alternative that avoids the very concept of a utility function altogether.[2] The closest existing aspiration-based design we know of is Decision Transformer, which is inherently a learning approach. To complement this, we focus on a planning approach. 
 

Working hypotheses

We use the following working hypotheses during the project, which we ask the reader to adopt as hypothetical premises while reading our text (you don't need to believe them, just assume them to see what might follow from them).

1: We should not allow the maximization of any function of the state or trajectory of the world

Following the motivation described above, our primary hypothesis posits that it must not be allowed that an AGI aims to maximize any form of objective function that evaluates the state or trajectory of the world. (This does not necessarily rule out that the agent employs any type of optimization of any objective function as part of its decision making, as long as that function is not only a function of the state or trajectory of the world. For example, we might allow some form of constrained maximization of  entropy or minimization of free energy or the like, which are functions of a probabilistic policy rather than of the state of the world.)

2: The core decision algorithm must be hard-coded

The 2nd hypothesis is that to keep an AGI from aiming to maximize some utility function, the AGI agent must use a decision algorithm to pick actions or plans on the basis of the available information, and that decision algorithm must be hard-coded and cannot be allowed to emerge as a byproduct of some form of learning or training process. This premise seeks to ensure that the foundational principles guiding the agent's decision making are known in advance and have verifiable (ideally even provable) properties. This is similar to the "safe by design" paradigm and implies a modular design where decision making and knowledge representation are kept separate. In particular, it rules out monolithic architectures (like using only a single transformer or other large neural network that represents a policy, and a corresponding learning algorithm).

3: We should focus on model-based planning first and only consider learning later

Although in reality, an AGI agent will never have a perfect model of the world and hence also needs some learning algorithm(s) to improve its imperfect understanding of the world on the basis of incoming data, our 3rd hypothesis is that for the design of the decision algorithm, it is helpful to hypothetically assume in the beginning that the agent already possesses a fixed, sufficiently good probabilistic world model that can predict the consequences of possible courses of action, so that the decision algorithm can choose actions or plans on the basis of these predictions. The rationale is that this "model-based planning" framework is simpler and mathematically more convenient to work with and allows us to address the fundamental design issues that arise even without learning, before addressing additional issues related to learning. (see also Project history)

4: There are useful generic, abstract safety criteria largely unrelated to concrete human values

We hypothesize the existence of generic and abstract safety criteria that can enhance AGI safety in a broad range of scenarios. These criteria focus on structural and behavioral aspects of the agent's interaction with the world, largely independent of the specific semantics or contextual details of actions and states and mostly unrelated to concrete human values. Examples include the level of randomization in decision-making, the degree of change introduced into the environment, the novelty of behaviors, and the agent's capacity to drastically alter world states. Such criteria are envisaged as broadly enhancing safety, very roughly analogous to guidelines such as caution, modesty, patience, neutrality, awareness, and humility, without assuming a one-to-one correspondence to the latter.

5: One can and should provide certain guarantees on performance and behavior

We assume it is possible to offer concrete guarantees concerning certain aspects of the AGI's performance and behavior. By adhering to the outlined hypotheses and safety criteria, our aim is to develop AGI systems that exhibit behaviour that is in some essential respects predictable and controlled, reducing risk while fulfilling their intended functions within safety constraints.

6: Algorithms should be designed with tractability in mind, ideally employing Bellman-style recursive formulas

Because the number of possible policies grows exponentially fast with the number of possible states, any algorithm that requires scanning a considerable portion of the policy space (like, e.g., "naive" quantilization over full policies would) soon becomes intractable in complex environments. In optimal control theory, this problem is solved by exploiting certain mathematical features of expected values that allow to make decisions sequentially and compute the relevant quantities (V- and Q-values) in an efficient recursive way using the Hamilton–Jacobi–Bellman equation. Based on our preliminary results, we hypothesize that a very similar recursive approach is possible in our aspiration-based framework in order to keep our algorithms tractable in complex environments. 

Theoretical framework

Agent-environment interface. Our theoretical framework and methodological tools draw inspiration from established paradigms in artificial intelligence (AI), aiming to construct a robust understanding of how agents interact with their environment and make decisions. At the core of our approach is the standard agent–environment interface, a concept that we later elaborate through standard models like (Partially Observed) Markov Decision Processes. For now, we consider agents as entirely separate from the environment — an abstraction that simplifies early discussions, though we acknowledge that the study of agents which are a part of their environment is an essential extension for future exploration.

Simple modular design. To be able to analyse decision-making more clearly, we assume a modular architecture that divides the agent into two main components: the world model and the decision algorithm. The world model represents the agent's understanding and representation of its environment. It encompasses everything the agent knows or predicts about the environment, including how it believes the environment responds to its actions. The decision algorithm, on the other hand, is the mechanism through which the agent selects actions and plans based on its world model and set goals. It evaluates potential actions and their predicted outcomes to choose courses of action that appear safe and aligned with its aspirations.

Studying a modular setup can also be justified by the fact that some leading figures in AI push for modular designs even though current LLM-based systems are not yet modular in this way.

Information theory. For formalizing generic safety criteria, we mostly employ an information theoretic approach, using the mathematically very convenient central concept of Shannon entropy (unconditional, conditional, mutual, directed, etc.) and derived concepts (e.g., channel capacity). This also allows us to compare and combine our findings with approaches based on the free energy principle and active inference. where goals are formulated as desired probability distributions of observations.

High-level structure of studied decision algorithms

The decision algorithms we currently study have the following high-level structure:

  1. The agent accepts tasks given as aspirations, i.e., as constraints on the probability distribution of certain task-relevant evaluation metrics. The aspiration will be updated over time in response to achievements, good or bad luck, and other observations.
  2. Whenever the agent must decide what the next action is, it performs a number of steps.
    1. First, it uses its world model to estimate what evolutions of the evaluation metrics are still possible from the current state on, and after taking each possible action in the current state. This defines the state's and all possible actions' "feasibility sets".
    2. It then compares each possible action's feasibility set with its current aspiration and decides whether it would have to adjust its aspiration if it were to take that action, and how to adjust its aspiration in that case. This defines all possible actions' "action-aspirations".
    3. It also evaluates all possible actions using one or more safety and performance criteria and picks a combination of one or more candidate actions that appear rather safe.
    4. It then randomizes between these candidate actions in a certain way.
    5. Based on the next observation, it updates its beliefs about the current state.
    6. Finally, in particular when the environment is partially unpredictable and the agent has had some form of good or bad luck, it potentially adjusts its aspiration again ("aspiration propagation").

Depending on the type of aspiration, the details can be designed so that the algorithm comes with certain guarantees about the fulfillment of the goal. 
 

🄵🄾🅁🄼🄰🄻   🄸🄽🅃🅁🄾🄳🅄🄲🅃🄸🄾🄽


Assumptions

In line with the working hypotheses, we assume more specifically the following in this post:

  • The agent is a general-purpose AI system that is given a potentially long sequence of tasks, one by one, which it does not know in advance. Most aspects of what we discuss focus on the current task only, but some aspects relate to the fact that there will be further, unknown tasks later (e.g., the question of how much power the agent shall aim to retain at the end of the task).
  • It possesses an overall world model that represents a good enough general understanding of how the world works.
  • Whenever the agent is given a task, an episode begins and its overall world model provides it with a (potentially much simpler) task-specific world model that represents everything that is relevant for the time period until the agent gets a different task or is deactivated, and that can be used to predict the potentially stochastic consequences of taking certain actions in certain world states.
  • That task-specific world model has the form of a (fully observed) Markov Decision Process (MDP) that however does not contain a reward function  but instead contains what we call an evaluation function related to the task (see 2nd to next bullet point).
  • As a consequence of a state transition, i.e., of taking a certain action  in a certain state  and finding itself in a certain successor state , a certain task-relevant evaluation metric changes by some amount. Importantly, we do not assume that the evaluation metric inherently encodes things of which more is better. E.g., the evaluation metric could be global mean temperature, client's body mass, x coordinate of the agent's right thumb, etc. We call the step-wise change in the evaluation metric the received Delta in that time step, denoted . We call its cumulative sum over all time steps of the episode the Total, denoted . Formally, Delta and Total play a similar role for our aspiration-based approach as the concepts of "reward" and "return" play for maximization-based approaches. The crucial difference is that our agent is not tasked to maximize Total (since the evaluation metric does not have the interpretation of "more is better") but to aim for some specific value of the Total.
  • The evaluation function contained in the MDP specifies the expected value of  for all possible transitions: