Hide table of contents

Overview (TL;DR):

Shard Theory is a new approach to understanding the formation of human values, which aims to help solve the problem of how to align advanced AI systems with human values (the ‘AI alignment problem’). Shard Theory has provoked a lot of interest and discussion on LessWrong, AI Alignment Forum, and EA Forum in recent months. However, Shard Theory incorporates a relatively Blank Slate view about the origins of human values that is empirically inconsistent with many studies in behavior genetics indicating that many human values show heritable genetic variation across individuals. I’ll focus in this essay on the empirical claims of Shard Theory, the behavior genetic evidence that challenges those claims, and the implications for developing more accurate models of human values for AI alignment.

Introduction: Shard Theory as an falsifiable theory of human values

The goal of the ‘AI alignment’ field is to help future Artificial Intelligence systems become better aligned with human values. Thus, to achieve AI alignment, we might need a good theory of human values. A new approach called “Shard Theory” aims to develop such a theory of how humans develop values. 

My goal in this essay is to assess whether Shard Theory offers an empirically accurate model of human value formation, given what we know from behavior genetics about the heritability of human values. The stakes here are high. If Shard Theory becomes influential in guiding further alignment research, but if its model of human values is not accurate, then Shard Theory may not help improve AI safety. 

These kinds of empirical problems are not limited to Shard Theory. Many proposals that I’ve seen for AI ‘alignment with human values’ seem to ignore most of the research on human values in the behavioral and social sciences. I’ve tried to challenge this empirical neglect of value research in four previous essays for EA Forum, on the heterogeneity of value types in humans individuals, the diversity of values across individuals, the importance of body/corporeal values, and the importance of religious values

Note that this essay is a rough draft of some preliminary thoughts, and I welcome any feedback, comments, criticisms, and elaborations. In future essays I plan to critique Shard Theory from the perspectives of several other fields, such as evolutionary biology, animal behavior research, behaviorist learning theory, and evolutionary psychology.

Background on Shard Theory

Shard Theory has been developed mostly by Quintin Pope (a computer science Ph.D. student at Oregon State University) and Alex Turner (a post-doctoral researcher at the Center for Human-Compatible AI at UC Berkeley). Over the last few months, they posted a series of essays about Shard Theory on LessWrong.com, including this main essay here , ‘The shard theory of human values’ (dated Sept 3, 2022), plus auxiliary essays such as: ‘Human values & biases are not accessible to the genome’ (July 7, 2022), ‘Humans provide an untapped wealth of evidence about alignment’ (July 13, 2022), ‘Reward is not the optimizer’ (July 24, 2022), and ‘Evolution is a bad analogy for AGI: Inner alignment’ (Aug 13, 2022). [This is not a complete list of their Shard Theory writings; it’s just the set that seems most relevant to the critiques I’ll make in this essay.] Also, David Udell published this useful summary: ‘Shard Theory: An overview’ (Aug 10, 2022). 

There’s a lot to like about Shard Theory. It takes seriously the potentially catastrophic risks from AI. It understands that ‘AI alignment with human values’ requires some fairly well-developed notions about where human values come from, what they’re for, and how they work. It is intellectually ambitious, and tries to integrate reinforcement learning, self-supervised predictive learning, decision theory, developmental psychology, and cognitive biases. It seeks to build some common ground between human intelligence and artificial intelligence, at the level of how complex cognitive systems develop accurate world models and useful values. It tries to be explicit about its empirical commitments and theoretical assumptions. It is open about being a work-in-progress rather than a complete, comprehensive, or empirically validated theory. It has already provoked much discussion and debate.

Even if my critiques of Shard Theory are correct, and some of its key evolutionary, genetic, and psychological assumptions are wrong, that isn’t necessarily fatal to the whole Shard Theory project. I imagine some form of Shard Theory 2.0 could be developed that updates its assumptions in the light of these critiques, and that still makes some progress in developing a more accurate model of human values that is useful for AI alignment.

Shard Theory as a Blank Slate theory

However, Shard Theory includes a model of human values that is not consistent with what behavioral scientists have learned about the origins and nature of values over the last 170 years of research in psychology, biology, animal behavior, neurogenetics, behavior genetics, and other fields.

The key problem is that Shard Theory re-invents a relatively ‘Blank Slate’ theory of human values. Note that no Blank Slate theory posits that the mind is 100% blank. Every Blank Slate theory that’s even marginally credible accepts that there are at least a few ‘innate instincts’ and some ‘hardwired reward circuitry’. Blank Slate theories generally accept that human brains have at least a few ‘innate reinforcers’ that can act as a scaffold for the socio-cultural learning of everything else. For example, even the most radical Blank Slate theorists would generally agree that sugar consumption is reinforcing because we evolved taste receptors for sweetness. 

The existence of a few innate reinforcement circuits was accepted even by the most radical Behaviorists of the 1920s through 1960s, and by the most ‘social constructivist’ researchers in the social sciences and humanities from the 1960s onwards. Blank Slate theorists just try to minimize the role of evolution and genetics in shaping human psychology, and strongly favor Nurture over Nature in explaining both psychological commonalities across sentient beings, and psychological differences across species, sexes, ages, and individuals. Historically, Blank Slate theories were motivated not so much by empirical evidence, as by progressive political ideologies about the equality and perfectibility of humans. (See the 2002 book The Blank Slate by Steven Pinker, and the 2000 book Defenders of the Truth by Ullica Segerstrale.)

Shard Theory seems to follow in that tradition – although I suspect that it’s not so much due to political ideology, as to a quest for theoretical simplicity, and for not having to pay too much attention to the behavioral sciences in chasing AI alignment.

At the beginning of their main statement of Shard Theory, in their TL;DR, Pope and Turner include this bold statement: “Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry.” 

Then they make three explicit neuroscientific assumptions. I’ll focus on Assumption 1 of Shard Theory: “Most of the circuits in the brain are learned from scratch, in the sense of being mostly randomly initialized and not mostly genetically hard-coded.”

This assumption is motivated by an argument explored here that ‘human values and biases are inaccessible to the genome’. For example, Quintin Trout argues “it seems intractable for the genome to scan a human brain and back out the “death” abstraction, which probably will not form at a predictable neural address. Therefore, we infer that the genome can’t directly make us afraid of death by e.g. specifying circuitry which detects when we think about death and then makes us afraid. In turn, this implies that there are a lot of values and biases which the genome cannot hardcode.” 

This Shard Theory argument seems to reflect a fundamental misunderstanding of how evolution shapes genomes to produce phenotypic traits and complex adaptations. The genome never needs to ‘scan’ an adaptation and figure out how to reverse-engineer it back into genes. The genetic variants simply build a slightly new phenotypic variant of an adaptation, and if it works better than existing variants, then the genes that built it will tend to propagate through the population. The flow of design information is always from genes to phenotypes, even if the flow of selection pressures is back from phenotypes to genes. This one-way flow of information from DNA to RNA to proteins to adaptations has been called the ‘Central Dogma of molecular biology’, and it still holds largely true (the recent hype about epigenetics notwithstanding). 

Shard Theory implies that biology has no mechanism to ‘scan’ the design of fully-mature, complex adaptations back into the genome, and therefore there’s no way for the genome to code for fully-mature, complex adaptations. If we take that argument at face value, then there’s no mechanism for the genome to ‘scan’ the design of a human spine, heart, hormone, antibody, cochlea, or retina, and there would be no way for evolution or genes to influence the design of the human body, physiology, or sensory organs. Evolution would grind to a halt – not just at the level of human values, but at the level of all complex adaptations in all species that have ever evolved. 

As we will see, this idea that ‘human values and biases are inaccessible to the genome’ is empirically incorrect.

A behavior genetic critique of Shard Theory

In future essays, I plan to address the ways that Shard Theory, as presently conceived, is inconsistent with findings from several other research areas: (1) evolutionary biology models of how complex adaptations evolve, (2) animal behavior models of how nervous systems evolved to act in alignment with fitness interests, (3) behaviorist learning models of how reinforcement learning and reward systems operate in animals and humans, and (4) evolutionary psychology models of human motivations, emotions, preferences, morals, mental disorders, and personality traits.

For now, I want to focus on some conflicts between Shard Theory and behavior genetics research. As mentioned above, Shard Theory adopts a relatively ‘Blank Slate’ view of human values, positing that we inherit only a few simple, crude values related to midbrain reward circuitry, which are presumably universal across humans, and all other values are scaffolded and constructed on top of those.

However, behavior genetics research over the last several decades has shown show that most human values that differ across people, and that can be measured reliably – including some quite abstract values associated with political, religious, and moral ideology – are moderately heritable. Moreover, many of these values show relatively little influence from ‘shared family environment’, which includes all of the opportunities and experiences shared by children growing up in the same household and culture. This means that genetic variants influence the formation of human values, and genetic differences between people explain a significant proportion of the differences in their adult values, and family environment explains a lot less about differences in human values than we might have thought. This research is based on convergent findings using diverse methods such as twin studies, adoption studies, extended twin family designs, complex segregation analysis, and genome-wide association studies (GWAS). All of these behavior genetic observations are inconsistent with Shard Theory, particularly its Assumption 1. 

Behavior genetics was launched in 1869 when Sir Francis Galton published his book Hereditary Genius, which proposed some empirical methods for studying the inheritance of high levels of human intelligence. A few years earlier, Galton’s cousin Charles Darwin had developed the theory of evolution by natural selection, which focused on the interplay of heritable genetic variance and evolutionary selection pressures. Galton was interested in how scientists might analyze heritable genetic variance in human mental traits such as intelligence, personality, and altruism. He understood that Nature and Nurture interact in very complicated ways to produce species-typical human universals. However, he also understood that it was an open question how much variation in Nature versus variation in Nurture contributed to individual differences in each trait.

Note that behavior genetics was always about explaining the factors that influence statistical variation in quantitative traits, not about explaining the causal, mechanistic development of traits. This point is often misunderstood by modern critics of behavior genetics who claim ‘every trait is an inextricable combination of Nature and Nurture, so there’s no point in trying to partition their influence.’ The mapping from genotype (the whole set of genes in an organism) to phenotype (the whole set of body, brain, and behavioral traits in an organism) is, indeed, extremely complicated and remains poorly understood. However, behavior genetics doesn’t need to understand the whole mapping; it can trace how genetic variants influence phenotypic trait variants using empirical methods such as twin, adoption, and GWAS studies. 

In modern behavior genetics, the influence of genetic variants on traits is indexed by a metric called heritability, which can range from 0 (meaning genetic variants have no influence on individual differences in a phenotypic trait) to 1 (meaning genetic variants explain 100% of individual differences in a phenotypic trait). So-called ‘narrow-sense heritability’ includes only additive genetic effects due to the average effects of alleles; additive genetic effects are most important for predicting responses to evolutionary selection pressures – whether in the wild or in artificial selective breeding of domesticated species. ‘Broad-sense heritability’ includes additive effects plus dominant and epistatic genetic effects. For most behavioral traits, additive effects are by far the most important, so broad-sense heritability is usually only a little higher than narrow-sense heritability. 

The most important result from behavior genetics is that all human behavioral traits that differ across people, and that can be measured reliably, are heritable to some degree – and often to a surprisingly high degree. This is sometimes called the First Law of Behavior Genetics – not because it’s some kind of natural law that all behavioral traits must be heritable, but because the last 150 years of research has found no replicable exceptions to this empirical generalization. Some behavioral traits such as general intelligence show very high heritability – over 0.70 – in adults, which is about as heritable as human height. (For a good recent introduction to the ‘Top 10 replicated findings from behavior genetics, see this paper.)

Does anybody really believe that values are heritable?

To people who accept a Blank Slate view of human nature, it might seem obvious that human values, preferences, motivations, and moral judgments are instilled by family, culture, media, and institutions – and the idea that genes could influence values might sound absurd. Conversely, to people familiar with behavior genetics, who know that all psychological traits are somewhat heritable, it might seem obvious that human values, like other psychological traits, will be somewhat heritable. It’s unclear what proportion of people lean towards the Blank Slate view of human values, versus the ‘hereditarian’ view that values can be heritable.

As a reality check, I ran this Twitter poll on Oct 17, 2022, with the results shown in this screenshot:

I was surprised that so many people took a slightly or strongly hereditarian view of values. Maybe the idea isn’t as crazy as it might seem at first glance. However, this poll is just illustrative that there is real variation in people’s views about this. It should not be taken too seriously as data, because it is just one informal question on social media, answered by a highly non-random sample. Only about 1.4% of my followers (1,749 out of 124,600) responded to this poll (which is a fairly normal response rate). My typical follower is an American male who’s politically centrist, conservative, or libertarian, and probably has a somewhat more hereditarian view of human nature than average. The poll’s main relevance here is in showing that a lot of people (not just me) believe that values can be heritable.

Human traits in general are heritable

A 2015 meta-analysis of human twin studies analyzed 17,804 traits from 2,748 papers including over 14 million twin pairs. These included mostly behavioral traits (e.g. psychiatric conditions, cognitive abilities, activities, social interactions, social values), and physiological traits (e.g. metabolic, neurological, cardiovascular, and endocrine traits). Across all traits, average heritability was .49, and shared family environment (e.g. parenting, upbringing, local culture) typically had negligible effects on the traits. For 69% of traits, heritability seemed due solely to additive genetic variation, with no influence of dominance or epistatic genetic variation. 

Heritability of human traits is generally caused by many genes that each have very small, roughly additive effects, rather than by a few genes that have big effects (see this review). Thus, to predict individual values for a given trait, molecular behavior genetics studies generally aggregate the effects of thousands of DNA variants into a polygenic score. Thus, each trait is influenced by many genes. But also, each gene influences many traits (this is called pleiotropy). So, there is a complex genetic architecture that maps from many genetic variants onto many phenotypic traits, and this can be explored using multivariate behavior genetics methods that track genetic correlations between traits. (Elucidating the genetic architecture of human values would be enormously useful for AI alignment, in my opinion.)

Human values are heritable

The key point here, in relation to Shard Theory, is that ‘all human behavioral traits’ being heritable includes ‘all human values that differ across people’. Over the last few decades, behavior geneticists have expanded their focus from studying classic traits, such as general intelligence and mental disorders, to explicitly studying the heritability of human values, and values-adjacent traits. So far, behavior geneticists have found mild to moderate heritability for a wide range of values-related traits, including the following:

In addition, the Big Five personality traits are moderately heritable (about 40%) according to this 2015 meta-analysis of 134 studies.  Each personality trait is centered around some latent values that represent how rewarding and reinforcing various kinds of experiences are. For example, people higher in Extraversion value social interaction and energetic activity more, people higher in Openness value new experiences and creative exploration more, people higher in Agreeableness value friendliness and compassion more, people higher in Conscientiousness value efficiency and organization more, and people higher in Neuroticism value safety and risk-aversion more. Each of these personality traits is heritable, so these values are also heritable. In fact, personality traits might be central to the genetic architecture of human values.

Moreover, common mental disorders, which are all heritable, can be viewed as embodying different values. Depression reflects low reward sensitivity and disengagement from normally reinforcing behaviors. Anxiety disorders reflect heightened risk-aversion, loss aversion, and hyper-sensitivity to threatening stimuli; these concerns can be quite specific (e.g. social anxiety disorder vs. specific phobias vs. panic disorder). The negative symptoms of schizophrenia reflect reduced reward-sensitivity to social interaction (asociality), speech (alogia), pleasure (anhedonia), and motivation (avolution). The ‘Dark Triad’ personality traits (Machiavellianism, Narcissism, Psychopathy) reflect a higher value placed on personal status-seeking and short-term mating, and a lower value placed on other people’s suffering. A 2010 review paper showed that heritabilities of psychiatric ‘diseases’ (such as schizophrenia or depression) that were assumed to develop ‘involuntarily’ are about the same as heritabilities of ‘behavioral disorders’ (such as drug addiction or anorexia) that were assumed to reflect individual choices and values.

Specific drug dependencies and addictions are all heritable, reflecting the differential rewards that psychoactive chemicals have in different brains. Genetic influences have been especially well-studied in alcoholism, cannabis use, opiate addiction, cocaine addiction, and nicotine addiction. Other kinds of ‘behavioral disorders’ also show heritability, including gambling, compulsive Internet use, and sugar addiction – and each reflects a genetic modulation of the relevant reward/reinforcement systems that govern responses to these experiences.

Heritability for behavioral traits tends to increase, not decrease, during lifespan development

Shard Theory implies that genes shape human brains mostly before birth, setting up the basic limbic reinforcement system, and then Nurture takes over, such that heritability should decrease from birth to adulthood. This is exactly the opposite of what we typically see in longutudinal behavior genetic studies that compare heritabilities across different ages. Often, heritabilities for behavioral traits increase rather than decrease as people mature from birth to adulthood. For example, the heritability of general intelligence increases gradually from early childhood through young adulthood, and genes, rather than shared family environment, explain most of the continuity in intelligence across ages. A 2013 meta-analysis confirmed increasing heritability of intelligence between ages 6 months and 18 years. A 2014 review observed that heritability of intelligence is about 20% in infancy, but about 80% in adulthood. This increased heritability with age has been called ‘the Wilson Effect’ (after its discoverer Ronald Wilson), and it is typically accompanied by a decrease in the effect of shared family environment. 

Increasing heritability with age is not restricted to intelligence. This study found increasing heritability of prosocial behavior in children from ages 2 through 7, and decreasing effects of shared family environment. Personality traits show relatively stable genetic influences across age, with small increases in genetic stability offsetting small decreases in heritability, according to this meta-analysis of 24 studies including 21,057 sibling pairs. A frequent finding in longitudinal behavior genetics is that the stability of traits across life is better explained by the stability of genes across life, than by the persistence of early experiences, shared family environment effects, or contextually reinforced values. 

More generally, note that heritability does not just influence ‘innate traits’ that are present at birth. Heritability also influences traits that emerge with key developmental milestones such as social-cognitive maturation in middle childhood, sexual maturation in adolescence, political and religious maturation in young adulthood, and parenting behaviors after reproduction. Consider some of the findings in the previous section, which are revealed only after individuals reach certain life stages. The heritability of mate preferences, sexual orientation, orgasm rate, and sexual jealousy are not typically manifest until puberty, so are not ‘innate’ in the sense of ‘present at birth’. The heritability of voter behavior is not manifest until people are old enough to vote. The heritability of investment biases is not manifest until people acquire their own money to invest. The heritability of parenting behaviors is not manifest until people have kids of their own. It seems difficult to reconcile the heritability of so many late-developing values with the Shard Theory assumption that genes influence only a few crude, simple, reinforcement systems that are present at birth.

Human Connectome Project studies show that genetic influences on brain structure are not restricted to ‘subcortical hardwiring’

Shard Theory seems to view genetic influences on human values as being restricted mostly to the subcortical limbic system. Recall that Assumption 1 of Shard Theory was that “The cortex is basically (locally) randomly initialized.”   Recent studies in neurogenetics show that this is not accurate. Genetically informative studies in the Human Connectome Project show pervasive heritability in neural structure and function across all brain areas, not just limbic areas. A recent review shows that genetic influences are quite strong for global white-matter microstructure and anatomical connectivity between brain regions; these effects pervade the entire neocortex, not just the limbic system. Note that these results based on brain imaging include not just the classic twin design, but also genome-wide association studies, and studies of gene expression using transcriptional data. Another study showed that genes, rather than shared family environment, played a more important role in shaping connectivity patterns among 39 cortical regions. Genetic influences on the brain’s connectome are often modulated by age and sex – in contrast to Shard Theory’s implicit model that all humans, of all ages, and both sexes, shared the same subcortical hardwiring. Another study showed high heritability for how the brain’s connectome transitions across states through time – in contrast to Shard Theory’s claim that genes mostly determine the static ‘hardwiring’ of the brain.

It should not be surprising that genetic variants influence all areas of the human brain, and the values that they embody. Analysis of the Allen Human Brain Atlas, a map of gene expression patterns throughout the human brain, shows that over 80% of genes are expressed in at least one of 190 brain structures studied. Neurogenetics research is making rapid progress on characterizing the gene regulatory network that governs human brain development – including neocortex. This is also helping genome-wide association studies to discover and analyze the millions of quantitative trait loci (minor genetic variants) that influence individual differences in brain development. Integration of the Human Connectome Project and the Allen Human Brain Atlas reveals pervasive heritability for myelination patterns in human neocortex – which directly contradicts Shard Theory’s Assumption 1 that “Most of the circuits in the brain are learned from scratch, in the sense of being mostly randomly initialized and not mostly genetically hard-coded.” 

Behavioral traits and values are also heritable in non-human animals 

A recent 2019 meta-analysis examined 476 heritability estimates in 101 publications across many species, and across a wide range of 11 behavioral traits– including activity, aggression, boldness, communication, exploration, foraging, mating, migration, parenting, social interaction, and other behaviors. Overall average heritability of behavior was 0.24. (This may sound low, but remember that empirical heritability estimates are limited by the measurement accuracy for traits, and many behavioral traits in animals can measured with only modest reliability and validity.) Crucially, heritability was positive for every type of behavioral trait, was similar for domestic and wild species, was similar for field and lab measures of behavior, and was just as high for vertebrates as for invertebrates. Also, average heritability of behavioral traits was just as high as average heritability of physiological traits (e.g. blood pressure, hormone levels) and life history traits (e.g. age at sexual maturation, life span), and were only a bit lower than the heritability for morphological traits (e.g. height, limb length). 

Note that most of these behavioral traits in animals involve ‘values’, broadly construed as reinforcement or reward systems that shape the development of adaptive behavior. For example, ‘activity’ reflects how rewarding it is to move around a lot; ‘aggression’ reflects how rewarding it is to attack others, ‘boldness’ reflects how rewarding it is to track and investigate dangerous predators, ‘exploratory behavior’ reflects how rewarding it is to investigate novel environments, ‘foraging’ reflects how rewarding it is to find, handle, and consume food, ‘mating’ reflects how rewarding it is to do mate search, courtship, and copulation, ‘parental effort’ reflects how rewarding it is to take care of offspring, and ‘social behavior’ reflects how reward it is to groom others or to hang around in groups.

In other words, every type of value that can vary across individual animals, and that can be reliably measured by animal behavior researchers, seems to show positive heritability, and heritability of values is just as high in animals with complex central nervous systems (vertebrates) as in animals with simpler nervous systems (invertebrates).

So what if human values are heritable?

You might be thinking, OK, all this behavior genetics stuff is fine, and it challenges a naïve Blank Slate model of human nature, but what difference does it really make for Shard Theory, or for AI alignment in general? 

Well, Shard Theory certainly think it matters. Assumption 1 in Shard Theory is presented as foundational to the whole project (although I’m not sure it really is). Shard Theory repeatedly talks about human values being built up from just a few, crude, simple, innate, species-typical reinforcement systems centered in the midbrain (in contrast to the rich set of many, evolved, adaptive, domain-specific psychological adaptations posited by evolutionary psychology). Shard Theory seems to allow no role for genes influencing value formation after birth, even at crucial life stages such as middle childhood, sexual maturation, and parenting. More generally, Shard Theory seems to underplay the genetic and phenotypic diversity of human values across individuals, and seems to imply that humans have only a few basic reinforcement systems in common, and that all divergence of values across individuals reflects differences in family, socialization, cultural, and media exposure. 

Thus, I think that Shard Theory has some good insights and some promise as a research paradigm, but I think it needs some updating in terms of its model of human evolution, genetics, development, neuroscience, psychology, and values. 

Why does the heritability of human values matter for AI alignment?

Apart from Shard Theory, why does it matter for AI alignment if human values are heritable? Well, I think it might matter in several ways. 

First, polygenic scores for value prediction. In the near future, human scientists and AI systems will be able to predict the values of an individual, to some degree, just from their genotype. As GWAS research discovers thousands of new genetic loci that influence particular human values, it will become possible to develop polygenic scores that predict someone’s values given their complete sequenced genome – even without knowing anything else about them. Polygenic scores to predict intelligence are already improving at a rapid rate. Polygenic value prediction would require large sample sizes of sequenced genomes linked to individuals’ preferences and values (whether self-reported or inferred behaviorally from digital records), but it is entirely possible given current behavior genetics methods. As the cost of whole-genome sequencing falls below $1,000, and the medical benefits of sequencing rise, we can expect hundreds of millions of people to get genotyped in the next decade or two. AI systems could request free access to individual genomic data as part of standard terms and conditions, or could offer discounts to users willing to share their genomic data in order to improve the accuracy of their recommendation engines and interaction styles. We should expect that advanced AI systems will typically have access to the complete genomes of the people they interact with most often – and will be able to use polygenic scores to translate those genomes into predicted value profiles.

Second, familial aggregation of values. Heritability means that values of one individual can be predicted somewhat by the values of their close genetic relatives. For example, learning about the values of one identical twin might be highly predictive of the values of the other identical twin – even if they were separated at birth and raised in different families and cultures. This means that an AI system trying to understand the values of one individual could start from the known values of their parents, siblings, and other genetic relatives, as a sort of maximum-likelihood familial Bayesian prior. An AI system could also take into account developmental behavior genetic findings and life-stage effects – for example, an individual’s values at age 40 after they have kids might be more similar in some ways to those of their own parents at age 40, than to themselves as they were at age 20. 

Third, the genetic architecture of values. For a given individual, their values in one domain can sometimes be predicted by values in other domains. Values are not orthogonal to each other; they are shaped by genetic correlations across values. As behavior genetics researchers develop a more complete genetic architecture of values, AI systems could potentially use this to infer a person’s unknown values from their known values. For example, their consumer preferences might predict their political values, or their sexual values might predict their religious values.

Fourth, the authenticity of values. Given information about an individual’s genome, the values of their close family members, and the genetic architecture of values, an AI system could infer a fairly complete expected profile of values for that individual, at each expected life-stage. What if the AI discovers that there’s a big mismatch between an individual’s ‘genetic prior’ (their values are predicted from genomic and family information), and their current stated or revealed values? That might be evidence that the individual has heroically overcome their genetic programming through education, enlightenment, and self-actualization. Or if might be evidence that the individual has been manipulated by a lifetime of indoctrination, mis-education, and propaganda that has alienated them from their instinctive preferences and morals. The heritability of values raises profound questions about the authenticity of human values in our credentialist, careerist, consumerist, media-obsessed civilization. When AI systems are trying to align with our values, but our heritable values don’t align with our current stated cultural values (e.g. this month’s fashionable virtue signals), which should the AI weigh most heavily?


If we’re serious about AI alignment with human values, we need to get more serious about integrating empirical evidence about the origins, nature, and variety of human values. One recent attempt to ground AI alignment in human values – Shard Theory – has some merits and some interesting potential. However, this potential is undermined by Shard Theory’s empirical commitments to a fairly Blank Slate view of human value formation. That view is inconsistent with a large volume of research in behavior genetics on the heritability of many human values. By taking genetic influences on human values more seriously, we might be able to improve Shard Theory and other approaches to AI safety, and we might identify new issues in AI alignment such as polygenic scores for value prediction, familial aggregation of values, and the genetic architecture of values. Finally, a hereditarian perspective raises the thorny issue of which of our values are most authentic and most worthy of being aligned with AI systems – the ones our genes are nudging us towards, the ones our parents taught us, the ones that society indoctrinates us into, or the ones that we ‘freely choose’ (whatever that means). 


Appendix 1: Epistemic status of my arguments

I’m moderately confident that some key assumptions of Shard Theory as currently presented are not empirically consistent with the findings of behavior genetics, but I have very low confidence about whether or not Shard Theory can be updated to become consistent, and I have no idea yet what that update would look like.

As a newbie AI alignment researcher, I’ve probably made some errors in my understanding of the more AI-oriented elements of Shard Theory. I worked a fair amount on neural networks, genetic algorithms, autonomous agents, and machine learning from the late 1980s through the mid-1990s, but I’m still getting up to date with more recent work on deep learning, reinforcement learning, and technical alignment research. 

As an evolutionary psychology professor, I’m moderately familiar with behavior genetics methods and findings, and I’ve published several papers using behavior genetics methods. I’ve been thinking about behavior genetics issues since the late 1990s, especially in relation to human intelligence. I taught a course on behavior genetics in 2004 (syllabus here). I did a sabbatical in 2006 at the Genetic Epidemiology Center at QIMR in Brisbane, Australia, run by Nick Martin. We published two behavior genetics studies, one in 2011 on the heritability of female orgasm rates, and one in 2012 on the heritability of talking and texting on smartphones. I did a 2007 meta-analysis of brain imaging data to estimate the coefficient of additive genetic variance in brain size. I also published a couple of papers in 2008 on genetic admixture studies, such as this. However, I’m not a full-time behavior genetics researcher, and I’m not actively involved in the large international genetics consortia that dominate current behavior genetics studies.

Overall, I’m highly confident in the key lessons of behavior genetics (e.g. all psychological traits are heritable, including many values; shared family environment has surprisingly small effects on many traits). I’m moderately confident in the results from meta-analyses and large-scale international consortia studies. I’m less confident in specific heritability estimates from individual papers that haven’t yet been replicated. 

Sorted by Click to highlight new comments since:

Some behavioral traits such as general intelligence show very high heritability – over 0.70 – in adults, which is about as heritable as human height.

I'm very confused about what numbers such as this mean in practice, since the most natural interpretation ("70% of the trait is genetically determined") is wrong, but there aren't very many clear explanations of what the correct interpretation is. When I tried asking this on LW, the top-voted answer was that it's a number that's mostly useful if you're doing animal breeding, but probably not useful for much else.

You mention a lot of heritability numbers, could you clarify what it is that we're intended to infer from them? (It seems to me that the main thing we can infer from heritability numbers is that if a trait has heritability above zero, then there's some genetic influence on it, but since you mention some traits having "very high" heritability, I presume that you find there to be some other information too.)

Hi Kaj,

There's a lot of politically motivated misinformation about heritability, mostly so people feel comfortable ignoring and dismissing the results of behavior genetics.

IMHO, if one knows that the heritability of a psychological trait is fairly high, and the long-term effect of shared family environment is fairly low, this has big practical implications in a number of real-life domains:

  1. Mate choice. It's really important to pay attention to highly heritable trait when choosing a mate to have kids with, because their genes will have a big impact on their kids, and their parenting won't (within the range of parenting observable in one's population
  2. Parenting. If a trait is highly heritable, but parenting doesn't make much difference, then parents can relax, enjoy their kids, and not try to push them, hothouse them, or try to shape them into someone they're not. Bryan Caplan, Steven Pinker, and others have emphasized that behavior genetics is very liberating for anxious parents. COnversely, if a kid gets a serious mental disorders that's highly heritable (e.g. schizophrenia, bipolar), then parents should feel less guilt that they messed up or caused the disorder (vs. 1950s theories that mothers caused schizophrenia)
  3. Education. If intelligence is highly heritable, and conscientiousness is moderately heritable, within a population with a wide range of educational opportunities, then education won't do very much to boost intelligence or conscientiousness; this has implications for parents trying to decide whether to spend an extra few hundred thousand dollars on private school (vs public school)
  4. Embryo selection. As polygenic scores get more accurate in predicting traits, parents will be able to choose among fertilized embryos to influence their offspring traits. (This will be much less effective than choosing a good mate, for the next couple of decades, but it will become important eventually).

Those are just a few examples; behavior geneticists have discussed many others.

I recommend Making Sense of Heritability by Neven Sesardić.

Yes, that's a good book on this issue.

My understanding is that the technical translation is: 70% of the variance in that trait is attributable to genes, given the time and place of the studied population. 

For example, 70% of the variance in intelligence is attributable to genes, given a white American population, living in non-abusive homes, from the 1960s to the 1990s. (The specifics are just to provide a concrete example.)

The farther one gets from the originally studied population, the less one can extrapolate exact findings. And vice versa. 

Update: there's some interesting discussion about my post over on LessWrong, including some replies from the developers of Shard Theory, Quintin Trout and Alex Turner.  See the comments after the post there

(My post is also up on AI Alignment Forum here, but there are no comments there yet, as of Oct 25). 

PS the developers of Shard Theory, Quintin Pope and Alex Turner, have just offered (Oct 25, 2022) some comments and replies on my post, over at LessWrong here (scroll down to comments).

PS, Gary Marcus at NYU makes some related points about Blank Slate psychology being embraced a bit too uncritically by certain strands of thinking in AI research and AI safety.

His essay '5 myths about learning and innateness

His essay 'The new science of alt intelligence'

His 2017 debate with AI researcher Yann LeCunn 'Does AI need more innate machinery' 

I don't agree with Gary Marcus about everything, but I think his views are worth a bit more attention from AI alignment thinkers.

Curated and popular this week
Relevant opportunities