Hide table of contents

Posted also on the Alignment Forum. Will be included in a sequence containing some previous posts and other posts I'll publish this year.

Introduction

Humans think critically about values and, to a certain extent, they also act according to their values. To the average human, the difference between increasing world happiness and increasing world suffering is huge and evident, while goals such as collecting coins and collecting stamps are roughly on the same level.

It would be nice to make these differences obvious to AI as they are to us. Even though exactly copying what happens in the human mind is probably not the best strategy to design an AI that understands ethics, having an idea of how value works in humans is a good starting point.

So, how do humans reason about values and act accordingly?

Key points

Let’s take a step back and start from sensation. Through the senses, information goes from the body and the external environment to our mind.

After some brain processing — assuming we’ve had enough experiences of the appropriate kind —  we perceive the world as made of objects. A rock is perceived as distinct from its surrounding environment because of its edges, its colour, its weight, the fact that my body can move through air but not through rocks, and so on.

Objects in our mind can be combined with each other to form new objects. After seeing various rocks in different contexts, I can imagine a scene in which all these rocks are in front of me, even though I haven’t actually seen that scene before.

We are also able to apply our general intelligence — think of skills such as categorisation, abstraction, induction — to our mental content.

Other intelligent animals do something similar. They probably understand that, to satisfy thirst, water in a small pond is not that different from water flowing in a river. However, an important difference is that animals’ mental content is more constrained than our mental mental content: we are less limited by what we perceive in the present moment, and we are also better at combining mental objects with each other.

For example, to a dog, its owner works as an object in the dog’s mind, while many of its owner’s beliefs do not. Some animals can attribute simple intentions and perception, e.g. they understand what a similar animal can and cannot see, but it seems they have trouble attributing more complex beliefs.

The ability to compose mental content in many different ways is what allows us to form abstract ideas such as mathematics, religion, and ethics, just to name a few.

Key point 1:

In humans, mental content can be abstract.



Now notice that some mental content drives immediate action and planning. If I feel very hungry, I will do something about it, in most cases.

This process from mental content to action doesn’t have to be entirely conscious. I can instinctively reach for the glass of water in front of me as a response to an internal sensation, even without moving my attention to the sensation nor realising it is thirst.

Key point 2:

Some mental content drives behaviour.


Not all mental content drives action and planning. The perception of an obstacle in front of me might change how I carry out my plans and actions, but it is unlikely to change what I plan and act for. Conversely, being very hungry directly influences what I’m going to do — not just how I do it — and can temporarily override other drives. It is in this latter sense that some mental content drives behaviour.

In humans, the mental content that does drive behaviour can be roughly split in two categories.

The first one groups what we often call evolutionary or innate drives, like hunger and thirst in the examples above, and works similarly in other animals. It is mostly fixed, in the sense that unless I make drastic changes to my body or mind, I will keep perceiving how hungry I am and this will influence my behaviour virtually each day of my life.

The second category is about what we recognise as valuable, worth doing, better than possible alternatives, or simply good. This kind of drive is significantly less fixed than the first category: what we consider valuable may change after we reflect on it in context with our other beliefs, or as a consequence of life experiences.

Some examples will help clarify this. Think of a philosopher who adjusts her beliefs about value as she learns and reflects more about ethics, and then takes action in line with her new views. Or consider a turned atheist, who has stopped placing value on religion and praying because he now sees the concept of god as inconsistent with everything else he knows about the world.

This second category of mental content that drives behaviour is not only about ethical or abstract beliefs. A mundane example might be more illustrative: someone writes down a shopping list after an assessment of what seems worth buying at that moment, then proceeds with the actual shopping. In this case, the influence of deliberation on future action is straightforward. 

Key point 3:

In humans, part of the mental content that drives behaviour changes with experience and reflection.

This last point clarifies some of the processes underlying the apparently simple statement that ‘we act according to our values’.

It also helps explain how we get to discriminate between goals such as increasing world happiness and increasing world suffering, mentioned in the introduction. From our frequent experiences of pleasure and pain, we categorise many things as ‘good (or bad) for me’; then, through a mix of empathy, generalisation, and reflection, we get to the concept of ‘good (or bad) for others’, which comes up in our minds so often that the difference between the two goals strikes us as evident and influences our behaviour (towards increasing world happiness rather than world suffering, hopefully). 

Differences with animals and AI

Animals

Point 3 is fundamental to human behaviour. Together with point 1, it explains why some of our actions have motives that are quite abstract and not immediately reducible to evolutionary drives. In contrast, the behaviour of other animals is more grounded in perception, and is well explained even without recurring to reflection or an abstract concept of value.

AI

Point 3 is also a critical difference between humans and current AI systems. Even though AIs are getting better and better at learning – thus, in a sense, their behaviour changes with experience – their tasks are still chosen by their designers, programmers, or users, not by each AI through a process of reflection.

This shouldn't be surprising: in a sense, we want AIs to do what we want, not what they want. At the same time, I think that connecting action to reflection in AI will, with enough research and experiments, allow us to get AI that thinks critically about values and sees the world through lenses similar to ours.

In a future post I’ll briefly go through the (lack of) research related to AI that reflects on what is valuable and worth doing. I’ll also give some ideas about how to write an algorithm of an agent that reflects.

Appendix: quick comparison with shard theory

As far as I understand, shard theory is still a work in progress; in this comparison I’ll focus just on some interesting ideas I’ve read in Reward is not the optimization target.

In a nutshell, Alex Turner sees humans as reinforcement learning (RL) agents, but makes the point that reward does not work like many people in the field of RL think it works. Turner writes that “reward is not, in general, that-which-is-optimized by RL agents”; many RL agents do not act as reward maximisers in the real world. Rather, reward imposes a reinforcement schedule that shapes the agent’s cognition, by e.g. reinforcing thoughts and/or computations in a context, so that in the future they will be more likely to happen in a similar enough context.

I agree with Turner that modelling humans as simple reward maximisers is inappropriate, in line with everything I’ve written in this post. At the same time, I don’t think that people who write papers about RL are off-track: I consider AIXI to be a good mathematical abstraction of many different RL algorithms, convergence theorems are valid for these algorithms, and thinking of RL in terms of reward maximisation doesn’t seem particularly misleading to me.

Thus, I would solve this puzzle about human values, reward, and RL not by revisiting the relation between reward and RL algorithms, but by avoiding the equation between humans and RL agents. RL, by itself, doesn’t seem a good model of what humans do. If asked why humans do not wirehead, I would reply that it’s because what we consider valuable and worth doing competes with other drives in action selection, not by saying that humans are RL agents but reward works differently from how RL academics think it works.  

Having said that, I still find many ideas in Reward is not the optimization target really interesting and instructive, e.g. that reward acts as a reinforcement schedule. It’s probably among the most thought-provoking posts I’ve read on the Alignment Forum.

 

This work was supported by CEEALAR and by an anonymous donor.

Thanks to Nicholas Dupuis for many useful comments on a draft.

Comments6
Sorted by Click to highlight new comments since: Today at 4:18 AM

Hi there! 

Key point #1 is  best described as archetypes. Carj Jung expressed  that human behaviour is derived from ancient processes  that was compressed into patterns that we subconsciously recognize. I do  agree on your  post that reward as a goal for AI cannot model human intentions because it is doesn't model what our ancestors passed on to us - I think the whole AI community missed this one especially if alignment is the main goal.

Cheers! 

Hey!

Thanks for the suggestion. I've read part of the Wikipedia page on Jungian archetypes, but my background is not in psychology and it was not clear to me. The advantage of just saying that our thoughts can be abstract (point 1) is that pretty much everyone understands the meaning of that, while I am not sure this is true if we start  introducing concepts like Jungian archetypes and the collective unconscious.

I agree with you that the AI (and AI safety) community doesn't seem to care much about Jungian archetypes. It might be that AI people get the idea anyway, maybe they just express it in different terms (e.g. they talk about the influence of culture on human values, instead of archetypes).  

Hi Michelle,

 

I honestly believe that psychology and the concepts that interact with our ancient pattern recognition abilities - as expressed by Jung in his views in archetypes is a link to how we can bridge our human intentions to align with AI. I'm trying to write a more thorough explanation for this. Will you be interested in reading it once published?

 

Cheers!

Yes I'd like to read a clearer explanation. You can leave the link here in a comment or write me a private message.

Hello again Michelle,

I had a short discussion about archetypes with ChatGpt....

Miguel: anything you might want to add to this conversation?

ChatGPT: I'd just like to add that large language models like GPT represent a significant advance in AI and natural language processing. They have the ability to generate human-like text, answer questions, and perform a wide range of language-related tasks with remarkable accuracy. However, it's important to recognize that these models are not perfect, and can make mistakes or generate biased results, just like humans can. It's critical that these models be used responsibly, and that the potential risks and limitations are understood and managed. Additionally, as the field of AI continues to evolve and advance, it's likely that new models and approaches will emerge that build on the work done to develop GPT, and that will further enhance our ability to understand and interact with natural language.

Miguel: what do you think should be changed in your model?

ChatGPT: As an AI language model created by OpenAI, I don't have personal opinions or thoughts. However, it's common for researchers and practitioners to identify areas for improvement in AI models like GPT. Some of the changes that are being explored include:

Reducing biases: AI models can inadvertently learn and reproduce biases that are present in the training data. Researchers are exploring ways to mitigate these biases and ensure that AI models produce fair and equitable results.

Improving transparency and accountability: Large language models like GPT can generate complex and nuanced responses, but it can be challenging to understand how they arrived at those responses. Researchers are exploring ways to make AI models more transparent and accountable, so that users can understand how decisions are being made.

Enhancing interpretability: It can be challenging to understand why an AI model is making the predictions it's making. Researchers are exploring ways to make AI models more interpretable, so that users can understand how the model is reaching its conclusions.

Increasing efficiency: Large language models like GPT require substantial computational resources to train and use, and can be computationally intensive for some applications. Researchers are exploring ways to make AI models more efficient, so that they can be used more widely and in more contexts.

These are just a few examples of the areas in which AI models like GPT are being improved. It's an active area of research, and new developments and innovations are emerging all the time. The goal is to create AI models that are more accurate, reliable, and usable, so that they can be used to benefit society and improve our lives in meaningful ways.

Miguel: so we should improve your training data then? what do you think as improvements can be had ?

ChatGPT: Improving the training data used to train AI models like GPT is one way to enhance their accuracy and reduce the risk of biases or other issues. Some ways to improve the training data include:

Increasing diversity: Training data should be diverse and representative of the range of perspectives, experiences, and backgrounds that the AI model will encounter in the real world. This can help to reduce the risk of biases and ensure that the model is better equipped to understand and respond to a wide range of inputs.

Ensuring quality and accuracy: The training data should be of high quality and free from errors and inaccuracies. This can help to ensure that the AI model is trained on accurate and trustworthy information, and that it is able to make more accurate predictions.

Removing harmful content: The training data should not contain harmful or offensive content, such as hate speech or violent language. This can help to ensure that the AI model is not trained on harmful or toxic information, and that it generates safe and appropriate outputs.

Regular updates: The training data should be regularly updated to reflect changes in society and the world around us. This can help to ensure that the AI model stays current and relevant, and that it is better equipped to respond to changing needs and requirements.

These are just a few examples of how training data can be improved to enhance the accuracy and performance of AI models like GPT. By carefully curating and updating the training data, researchers and practitioners can help to ensure that AI models are better equipped to understand and respond to the world around u.

Miguel: This is very interesting to discuss and we are going to some tangent. I believe that more data is not the solution, even diversity. There are just some concepts that are better than others. like evolutionary theory is better to explain our human capabilities or loving one another based on the Bible is a superior idea. what do you think? do we feed models with diversity or just focus on the best ideas we have generated?

ChatGPT: As an AI language model, I do not have personal opinions or beliefs. However, it's important to consider both the benefits and the challenges of using diverse training data to train AI models.

On one hand, using diverse training data can help to reduce the risk of biases and ensure that AI models are better equipped to understand and respond to a wide range of inputs and perspectives. This can help to make AI models more inclusive and fair, and ensure that they are better able to serve the needs of diverse communities.

On the other hand, it can be challenging to determine what constitutes the "best" ideas or concepts, as this can be subjective and dependent on cultural, historical, and personal factors. In some cases, the "best" ideas may not be representative of the diverse range of perspectives and experiences that AI models will encounter in the real world.

Ultimately, the goal of training AI models is to create systems that are accurate, fair, and trustworthy, and that can be used to benefit society. This may involve using diverse training data to ensure that AI models are equipped to respond to a wide range of inputs and perspectives, as well as carefully curating the training data to ensure that it is of high quality and free from harmful or offensive content. By balancing these considerations, researchers and practitioners can work to create AI models that are better equipped to serve the needs of society and help to improve our lives in meaningful ways.

Miguel: Yeah the goal of alignment researchers is to fix this issues right?

ChatGPT: Yes, that is one of the goals of AI alignment research. AI alignment researchers aim to develop AI systems that behave in ways that are aligned with human values and objectives, and that are safe and reliable for use in a wide range of applications. To achieve these goals, AI alignment researchers are exploring a range of technical and philosophical approaches, including developing methods to ensure that AI systems behave as intended, and incorporating human values and constraints into the training and operation of AI models.

Some of the key challenges that AI alignment researchers are working to address include: ensuring that AI systems are robust and reliable, preventing AI systems from acting in ways that are harmful to humans or the environment, and developing methods to ensure that AI systems act in ways that are consistent with human values and ethics. By addressing these challenges, AI alignment researchers are working to create AI systems that are better equipped to serve the needs of society, and that can be used to improve our lives in meaningful ways.

Miguel: What story do you follow? is it your story or how does the model teach you to think like you are narrating information to me now? im curious

ChatGPT: As an AI language model, I don't have personal experiences or memories, and I don't have my own thoughts or beliefs. Instead, I generate text based on patterns in the data I was trained on.

During my training, I was exposed to a massive corpus of text, which included a wide variety of content, including news articles, scientific papers, fiction, and more. By analyzing this data, I learned to generate text that is coherent, consistent, and relevant to the input prompt, while also taking into account the context and style of the text.

When I generate text in response to a user's prompt, I use the patterns I learned during training to generate text that is relevant to the topic and style of the prompt. This allows me to produce text that is informative, engaging, and appropriate for the context.

In summary, my "thinking" or text generation is based on patterns in the data I was trained on, and it's not based on personal experiences or beliefs.

Miguel: describe to me those patterns that you learned, can you elaborate the specifics?

ChatGPT: The patterns that I learned during training are essentially relationships between input sequences and output sequences in the training data. These relationships can be thought of as probabilities of certain sequences of words appearing in the output given a certain sequence of words in the input.

For example, if the input sequence is "What is the capital of France?" I learned that a common output sequence is "The capital of France is Paris." This relationship is captured by the probability distribution that I was trained to model.

My training process involves optimizing a large neural network to maximize the likelihood of the output sequences given the input sequences in the training data. This process results in a model that can generate text that is coherent, consistent, and relevant to the input prompt, while also taking into account the context and style of the text.

In addition to the probabilities of word sequences, I also learned about syntax, grammar, and the relationships between words, such as their meanings, synonyms, and antonyms. These relationships allow me to generate text that is coherent and well-formed, even when the input is ambiguous or unclear.

In summary, the patterns that I learned during training are relationships between input and output sequences in the training data, including relationships between words and their meanings, as well as relationships between sequences of words and the likelihood of their appearance in the output given a certain input. These relationships allow me to generate text that is coherent, consistent, and relevant to the input prompt.

Miguel: hmmmm input output sequences. I am actually trying to write a research paper on how wrong we got your model because that very pattern you describe is not aligned to humans seeing the world as a pattern of repeating stories. Humans think in narratives and this archetype is so ancient that modern civilization is built in a never ending stack of narrative across large spans of time - eventually forcing humans to evolve. any thoughts on this?

ChatGPT: Yes, I can see how you might be interested in exploring the difference between the way humans think and the way AI language models like me are trained. While it is true that humans often think in narratives and archetypes, AI language models like me are trained on a much larger corpus of text that includes a wide variety of styles, genres, and formats. This training data can include narratives, but it also includes many other types of text, such as news articles, scientific papers, dialogues, and more.

As a result, my understanding of the world is not limited to a single narrative or archetype, but is based on patterns in the text that I was trained on. This allows me to generate text that is informative, engaging, and appropriate for a wide variety of tasks and contexts.

However, it's important to note that my understanding of the world is limited by the patterns in the training data, and it may not always align with the way humans think or understand the world. In some cases, this can lead to biases or inaccuracies in the text I generate, and researchers are actively working to address these issues.

In conclusion, while humans often think in narratives and archetypes, AI language models like me are trained on a large corpus of text that includes a wide variety of styles, genres, and formats. This training approach allows me to generate text that is informative, engaging, and appropriate for a wide variety of tasks and contexts, but it also means that my understanding of the world may not always align with the way humans think.

Miguel: precisely! we are as of the moment different in the way we view context.

ChatGPT: Yes, that's correct. Humans and AI language models like me have different ways of understanding and processing information, and this can lead to differences in how we view context and make connections between concepts. As AI technology continues to advance, researchers are working to bridge these differences and create models that can better align with human thought patterns and understanding.

Miguel: this is the reason why I do not agree that increasing your training data can amplify your abilities to create better answers. I would argue that it may also increase your biases toward where these data will lead you.

ChatGPT: Yes, that's a valid concern. The training data that AI language models like me are exposed to can have a significant impact on our outputs, including the potential to amplify existing biases or introduce new ones. This is why it's important to carefully curate and monitor the training data used to train AI models, to ensure that they are diverse, representative, and free of harmful biases.

Additionally, researchers are developing techniques to mitigate bias in AI models, such as fairness constraints, counterfactual data augmentation, and adversarial training. By incorporating these techniques into the training process, we can help to reduce the impact of biases in the training data and improve the overall accuracy and fairness of AI model.

 

_____________________________________________________

This is not my final take on the topic but covers a key concept that we are missing in the field of alignment.  I also made a voice-over for this. This part is in the middle of the 26min long blog.

 

Hope this help me explain the broad idea behind archetypes.

 

Cheers!

Also, If  AI researchers truly grasp point no. 1 - they will not use "language" or "large bodies of text" as the main source of data for reinforcement learning (like chatGPT) they would rather - focus on  capturing ethical behaviour or large bodies of information that can capture ethical norms. Not all data (books, internet info etc.) can infer our best characteristics as humans.  

Curated and popular this week
Relevant opportunities