Aligned AI is an Oxford based startup focused on applied alignment research. Our goal is to implement scalable solutions to the alignment problem, and distribute these solutions to actors developing powerful transformative artificial intelligence (related Alignment Forum post here).

In the tradition of AI safety startups, Aligned AI will be doing an AMA this week, from today, Tuesday the 1st of March, till Friday the 4th, inclusive. It will be mainly me, Stuart Armstrong, answering these questions, though Rebecca Gorman and Oliver Daniels-Koch may also answer some of them. GPT-3 will not be invited.

From our post introducing Aligned AI:

We think AI poses an existential risk to humanity, and that reducing the chance of this risk is one of the most impactful things we can do with our lives. Here we focus not on the premises behind that claim, but rather on why we're particularly excited about Aligned AI's approach to reducing AI existential risk.

  1. We believe AI Safety research is bottle-necked by a core problem: how to extrapolate values from one context to another.
  2. We believe solving value extrapolation is necessary and almost sufficient for alignment.
  3. Value extrapolation research is neglected, both in the mainstream AI community and the AI safety community. Note that there is a lot of overlap between value extrapolation and many fields of research (e.g. out of distribution detection, robustness, transfer learning, multi-objective reinforcement learning, active reward learning, reward modelling...) which provide useful research resources. However, we've found that we've had to generate our most of the key concepts ourselves.
  4. We believe value extrapolation research is tractable (and we've had success generating the key concepts).
  5. We believe distributing (not just creating) alignment solutions is critical for aligning powerful AIs.




Sorted by Click to highlight new comments since:
  1. How does your theory of change or understanding of the alignment problem differ from that of other orgs? (e.g., ARC, Redwood, MIRI, Anthropic). Note that I see you answered a similar question here, though I think this question is a bit different.
  2. How would you explain what value extrapolation is & why it's important to a college freshman?
  3. What kinds of skills/backgrounds/aptitudes are you looking for in new employees? What kinds of people would you be really excited to see join the team?
  4. Are there any skills/aptitudes that would be a uniquely good fit for value extrapolation research? (As in, skills that would make someone an especially good fit for working on this problem as opposed to other problems in AI alignment research)

(Feel free to skip any of these that don't seem like a good use of time!)

  1. Nothing much to add to the other post.
  2. Imagine that you try to explain to a potential superintelligence that we want it to preserve a world with happy people in it by showing it videos of happy people. It might conclude that it should make people happy. Or it might conclude that we want more videos of happy people. The latter is more compatible with the training that we have given it. The AI will be safer if it hypothesizes that we may have meant the former, despite having given it evidence more compatible with the latter, and pursues both goals rather than merely the latter. This is what we are working towards.   
  3. Value alignment. Good communication and collaboration skills.  Machine learning skills. Smart, reliable, and creative. Good at research. At present we are looking for a Principal ML Engineer and other senior roles.
  4. The ability to move quickly from theory to model to testing the model and back 
  1. What do you see as Aligned AI’s core output, and what is its success condition? What do you see the payoff curve being — i.e. if you solve 10% of the problem, do you get [0%|10%|20%] of the reward?
  2. I think a fresh AI safety approach may (or should) lead to fresh reframes on what AI safety is. Would your work introduce a new definition for AI safety?
  3. Value extrapolation may be intended as a technical term, but intuitively these words also seem inextricably tied to both neuroscience and phenomenology. How do you plan on interfacing with these fields? What key topics of confusion within neuroscience and phenomenology are preventing interfacing with these fields?
  4. I was very impressed by the nuance in your “model fragments” frame, as discussed at some past EAG. As best as I can recall, the frame was: that observed preferences allow us to infer interesting things about the internal models that tacitly generate these preferences, that we have multiple overlapping (and sometimes conflicting) internal models, and that it is these models that AI safety should aim to align with, not preferences per se. Is this summary fair, and does this reflect a core part of Aligned AI’s approach?

Finally, thank you for taking this risk.

Hey there! It is a risk, but the reward is great :-)

  1. Value extrapolation makes most other AI safety approaches easier (eg interpretability, distillation and amplification, low impact...). Many of these methods also make value extrapolation easier (eg interpretability, logical uncertainty,...). So I'd say the contribution is superlinear - solving 10% of AI safety our way will give us more than 10% progress.
  2. I think it already has reframed AI safety from "align AI to the actual (but idealised) human values" to "have an AI construct values that are reasonable extensions of human values".
  3. Can you be more specific here, with examples from those fields?
  4. I see value extrapolation as including almost all my previous ideas - it would be much easier to incorporate model fragments into our value function, if we have decent value extrapolation.

Great, thank you for the response.

On (3) — I feel AI safety as it’s pursued today is a bit disconnected from other fields such as neuroscience, embodiment, and phenomenology. I.e. the terms used in AI safety don’t try to connect to the semantic webs of affective neuroscience, embodied existence, or qualia. I tend to take this as a warning sign: all disciplines ultimately refer to different aspects of the same reality, and all conversations about reality should ultimately connect. If they aren’t connecting, we should look for a synthesis such that they do.

That’s a little abstract; a concrete example would be the paper “Dissecting components of reward: ‘liking’, ‘wanting’, and learning” (Berridge et al. 2009), which describes experimental methods and results showing that ‘liking’, ‘wanting’, and ‘learning’ can be partially isolated from each other and triggered separately. I.e. a set of fairly rigorous studies on mice demonstrating they can like without wanting, want without liking, etc. This and related results from affective neuroscience would seem to challenge some preference-based frames within AI alignment, but it feels there‘s no ‘place to put’ this knowledge within the field. Affective neuroscience can discover things, but there’s no mechanism by which these discoveries will update AI alignment ontologies.

It’s a little hard to find the words to describe why this is a problem; perhaps that not being richly connected to other fields runs the risk of ‘ghettoizing‘ results, as many social sciences have ‘ghettoized’ themselves.

One of the reasons I’ve been excited to see your trajectory is that I’ve gotten the feeling that your work would connect more easily to other fields than the median approach in AI safety.

Thanks, that makes sense.

I've been aware of those kind of issues; what I'm hoping is that we can get a framework to include these subtleties automatically (eg by having the AI learn them from observations or from human published papers)  without having to put it all in by hand ourselves.

I am wondering what you think about the notion that persons develop their values in response to the systems that they exist in, which may be suboptimal; then, suboptimal values could be developed. For example, if there is a situation of scarcity or external abuse, persons may seek to dominate others to keep safe, whereas, in the scenario of abundance and overall consideration, persons may seek to develop considerate relationships with others to increase their and others wellbeing. Assuming that currently, some perceived scarcity and abuse exists in various environments, it could be suboptimal, from the long-term potential of humanity perspective, if that is measured by overall enjoyment in pursuing 'the most good' objectives, to extrapolate values now, if these are reinforced by AI. A solution can be to offer individuals an understanding of various situations and let them decide which ones they would prefer (e. g. a person in scarcity offered an understanding of abundance can instead of choosing threatening ability select ability to enjoy being enjoyed). This could work if all individuals are asked and all possibilities shareably understood. Since this is challenging, an alternative is to entertain persons on their perspective of an optimal system that they would like to see exist (rather than one which would benefit them personally), considering the objectives, under perfect alternatives awareness, of all individuals. What do you think about some of these thoughts on gathering values to extrapolate? Are you going to implement it or look for research in this area of values understanding under overall consideration and perfect alternatives understanding? I will also appreciate any comments on my Widespread values brainstorming draft which was developed using this reasoning.

A problem here is that values that are instrumentally useful, can become terminal values that humans value for their own sake.

For example, equality under the law is very useful in many societies, especially modern capitalistic ones; but a lot of people (me included) feel it has strong intrinsic value. In more traditional and low-trust societies, the tradition of hospitality is necessary for trade and other exchanges; yet people come to really value it for its own sake. Family love is evolutionarily adaptive, yet also something we value.

So just because some value has developed from a suboptimal system does not mean that it isn't worth keeping.

Ok, that makes sense. Rhetorically, how would one differentiate the terminal values worth keeping from those worth updating. For example, hospitality 'requirement' from the free ability to choose to be hospitable from the ability to choose environments of various hospitability attitudes. I would really offer the emotional understanding of all options and let individuals freely decide. This should resolve the issue of persons favoring their environments due to limited awareness of alternatives or the fear of consequences of choosing an alternative. Then, you could get to more fundamental terminal values, such as the perception of living in a truly fair system (instead of equality under the law, which can still perpetuate some unfairness), ability to interact only with those with whom one wishes to (instead of hospitality), and understanding others' preferences for interactions related to oxytocin, dopamine, and serotonin release and choosing to interact with those where preferences are mutual (instead of family love), for example. Anyway, thank you.

What are some practical/theoretical developments that would make your work much less/more successful than you currently expect? (4 questions in 1, but feel free to just answer the most salient for you)

Most of the alignment research pursued by other EA groups (eg Anthropic, Redwood, ARC, MIRI, the FHI,...) would be useful to us if successful (and vice versa: our research would be useful for them). Progress in inner alignment, logical uncertainty, and interpretability is always good.

Fast increase in AI capabilities might result in a superintelligence before our work is ready. If the top algorithms become less interpretable than they are today, this might make our work harder.

Whole brain emulations would change things in ways that are hard to predict, and could make our approach either less or more successful.

Hey :)

Disclaimer: I am no AI alignment expert, so consider skipping this comment and reading the quality ones instead. But there are no other comments yet so here goes:


If I understood correctly, 

  1. You want to train a model, based on a limited training dataset (like all models)
  2. To work reliably on inputs that are outside of the initial dataset
  3. By iterating on the model, refining it, every time it gets new inputs that were outside of the previously available dataset

It seems to me (not that I know anything!!) like the model might update in very bad-for-humans ways, even while being well "aligned" to the initial data, and to all iterations, regardless of how they're performed.

TL;DR: I think so because concept space is superexponential and because value is fragile.

Imagine we are very stupid humans [0], and we give the AI some training data from an empty room containing a chess board, and we tell the AI which rooms-with-chess-boards are better for us. And the AI learns this well and everyone is happy (except for the previous chess world champion).

And then we run the AI and it goes outside the room and sees things very different from its training data.

Even if the AI notices the difference and alerts the humans,

  1. The humans can't review all that data
  2. The humans don't understand all the data (or concepts) that the AI is building
  3. The humans probably think that they trained the AI on the morally important information, and the humans think that the AI is using a good process for extrapolating value, if I understood you correctly

And then the AI proceeds to act on models far beyond what it was trained on, and so regardless of how it extrapolated, that was an impossible task to begin with, and it probably destroys the world.

What am I missing?



Why did I use the toy empty-room-with-chess story?

Because part of the problem that I am trying to point out is "imagine how a training dataset can go wrong", but it will never go wrong if for every missing-thing-in-the-dataset that we can imagine, we automatically imagine that the dataset contains that thing.

An AI that is aware that value is fragile will behave in a much more cautious way. This gives a different dynamic to the extrapolation process.



So ok, the AI knows that some human values are unknown to the AI.

What does the AI do about this?

The AI can do some action that maximizes the known-human-values, and risk hurting others.

The AI can do nothing and wait until it knows more (wait how long? There could always be missing values).


Something I'm not sure I understood from the article:

Does the AI assume that the AI is able to list all the possible values that humans maybe care about? Is this how the AI is supposed to guard against any of the possible-human-values from going down too much?

"We think AI poses an existential risk to humanity"

I'm struggling to understand why someone would believe this. What are some good resources to learn more about why I would?

Nick Bostrom's "Superintelligence" is an older book, but still a good overview. Stuart Russell's "Human Compatible" is a more modern take. I touch upon some of the main issues in my talk here. Paul Christiano's excellent "What Failure Looks Like" tackles the argument from another angle.

I'd like to more fully understand why you've made this a for-profit company instead of a charity. From your other post:

If we believe we can commercialize a successful sub-project responsibly (without differential enhancing AI capabilities), it will be incorporated into our product and marketed to potential adopters (e.g. tech companies meeting regulatory requirements for fairness, robustness, etc).

Are there other roads to profit that you're considering? Is this the main one? How much does the success of this approach (or others) hinge on governments adopting particular legislation or applying particular regulations? In other words, if governments don't regulate the thing you're solving, why would companies still buy your product?

No worries if you don't want to say much at this time. I'm excited for this project regardless, it seems like a novel and promising approach!

Thanks for the great questions, Sawyer!

I'd like to more fully understand why you've made this a for-profit company instead of a charity. 


When I Stuart and I were collaborating on AI safety research, I'd occasionally ask him, 'So what's the plan for getting alignment research incorporated into AIs being build, once we have it?' He'd answer that DeepMind, Open AI, etc would build it in. Then I'd say, 'But what about everybody else?' Aligned AI is our answer to that question.

We also want to be able to bring together a brilliant, substantial team to work on these problems. A lot of brilliant minds choose to go the earning to give route, and we think it would be fantastic to be a place where people can both go that route and still work on an aligned organisation.

Are there other roads to profit that you're considering? Is this the main one? How much does the success of this approach (or others) hinge on governments adopting particular legislation or applying particular regulations? In other words, if governments don't regulate the thing you're solving, why would companies still buy your product?

The "etc" here doesn't refer just to "other regulations", but also to "other ways that unsafe AI cause costs and risks to companies".

I like to use the analogy of CAD (computer-aided design) software for building sky scrapers and bridges. It's useful even without regulations, because engineers like building sky scrapers and bridges that don't fall down.  We can be useful in the same sort of way for AI (companies like profit, but they also like reducing expenses, such as costs for PR and settlements when things go wrong). 

We're starting with research - with the AI equivalent of developing principles that civil engineers can use to build taller safe sky scrapers and longer safe bridges, to build into our CAD-analogous product.

Curated and popular this week
Relevant opportunities