The Importance of AI Alignment, explained in 5 points

Daniel_Eth

The Importance of AI Alignment, explained in 5 points

Daniel_Eth

16 min readFeb 11, 2023

Comments 4

Sorted by

New & upvoted

[anonymous]

Excellent breakdown, thanks

Dušan D. Nešić (Dushan)

If you, the person reading this, are interested in helping with AI safety but do not have a technical background in AI Safety and instead have it in complex sciences (evolutionary biology, sociology, economics, law, neurology...) do reach out to us at PIBBSS.ai . You will be late for this year's applications, but we may have other ways of cooperating, or we can put you on the list to be the first one to find out when we have new programs by signing up to our mailing list.

Seb Stent

One underserved aspect of this debate is bias within large language models (LLMs) behind current AI/ML implementations.

Africa, for example, is very poorly represented in these data sets. Black skin is poorly served by Wearables and their ability to determine health characteristics for their wearers, reducing health outcomes rather than improving them.

African languages make up a tiny percentage of language data, despite a billion humans represented within these language groups. This technology offers the best outcomes for those in limited literacy areas, yet is focused on the markets with the least need for this.

African faces are under represented in CV data sets, increasing fraud and preventing the effective use of biometrics in developing markets - where their benefits could be greatest.

alx

-2

While many accurate and valid points are made here, the overarching flaw of this approach to AI alignment is evident in the very first paragraph. Perhaps it is meta and semantic but I believe we must take more effort to define the nature and attributes of the 'Advanced' AI/AGI that we are referring to when talking about AI alignment. The statistical inference used in transformer-encoder models that simply generate an appropriate next-word are far from advanced. They might be seen as a form of linguistic understanding but remain in a completely different league than brain-inspired cognitive architectures that could conceivably become self aware.

Many distinctions are critical when evaluating, framing and categorizing AI . I would argue the primary distinction will soon become that of the elusive C-word: Consciousness. If we are talking about embodied Human-Equivalent self-aware conscious AGI (HE-AGI) , I think it would be unwise and frankly immoral to jump to the concept of control and forced compliance as a framework for alignment.

Clearly limits should be placed on AI's capacity to interact with and engage the physical real world (including its own hardware), just as limits are placed on our own ability to act in our world. However, we should be thinking about alignment in terms of genuine social alignment, empathy, respect and instilling foundational conceptions of morality. It seems clear to me that Conscious Human Equivalent AGI by definition deserve all the innate rights, respect, civic responsibilities and moral consideration as an adolescent human... and eventually (likely ) those of an adult.

This framework is still in progress but presents an ordinal taxonomy to better frame and categorize AGI: http://www.pathwai.org . Feedback would be welcomed!

Comments

More from the author

302

Some potential lessons from Carrick’s Congressional bid

Daniel_Eth·4y ago·12m read

150

Partial Transcript of Recent Senate Hearing Discussing AI X-Risk

Daniel_Eth·2y ago·26m read

Techies Wanted: How STEM Backgrounds Can Advance Safe AI Policy

Daniel_Eth·1y ago·35m read

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·1w ago·Curated 6d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

How (not) to fundraise from Anthropic staff

Jack Lewars·6d ago·7m read

Adapted from my Substack, Funding Anthropalypse. Short version: if you want a share of the coming Anthropic and OpenAI windfall - the $37bn+ that could be in play next year - the way in is to become 'legibly excellent', so the evaluators and donors that frontier lab staff already trust point them to yo...

If you're agentic, work in biosecurity

sharmaayushmaan🔸·4d ago·7m read

Disclaimer: Although I work on the Groups Team at CEA, I’m writing this in a personal capacity, and this post does not constitute an endorsement by CEA. Agency - the realisation that you really can just do things. TL;DR Biosecurity needs people (of any background) who are agentic and have a high execution velocity and track record....

Recent opportunities to take action

Marginal Victories: career advising and opportunities for U.S. democracy preservation & political work

Annika Burman 🔸·2d ago·2m read

I'm stepping down as Hive's Executive Director, and we're hiring my successor

SofiaBalderson, Hive·2d ago·3m read

Starting an EA group @ SUNY Binghamton

micahzarin·1d ago·1m read

^{^}

It should be noted that some researchers view the concept of “general intelligence” as flawed and consider the term “AGI” to be either a misnomer at best or confused at worst. Nevertheless, in this piece we are concerned with the capabilities of AI systems, not whether such systems should be referred to as “generally intelligent,” so disagreement over the coherency of the term “AGI” doesn’t affect the arguments in this piece.

^{^}

In this second scenario, different AGIs might specialize in a similar manner to how human workers specialize in the economy today.

^{^}

A future paradigm could, for instance, be based on future discoveries in neuroscience.

^{^}

The brain is a physical object, and its mechanisms of operation must therefore obey the laws of physics. In theory, these mechanisms could be described in a manner that a computer could replicate.

^{^}

As of today's date: February 10, 2023.

^{^}

E.g., in January 2020, back when conventional wisdom was that COVID would not become a huge deal, Metaculus instead predicted >100,000 people would eventually become infected with the disease.

^{^}

E.g., Metaculus predicted a breakthrough in the computational biology technique of protein structure prediction, before DeepMind’s AI AlphaFold astounded scientists with its performance in this task.

^{^}

Other examples where AI has recently made large strides include: conversing with humans via text, speech recognition, speech synthesis, music generation, language translation, driving vehicles, summarizing books, answering high school- or college-level essay questions, creative storytelling, writing computer code, scientific advancement, mathematical advancement, hardware advancement, mastering classic board games and video games, mastering multiplayer strategy games, doing any one task from a large number of unrelated tasks and switching flexibly between these tasks based on context, using robotics to interact with the world in a flexible manner, integrating cognitive subsystems via an “inner monologue,” etc.

^{^}

Technically, this description is a slight simplification; GPT-3 was actually programmed to learn to predict the next “token” from a sequence of text, where a “token” would generally correspond to either a word or a portion of a word.

^{^}

Depending on whether we extrapolate linearly or using an “S-curve,” most such tasks are implied to reach near-perfect performance with ~10²⁸ to ~10³¹ computer operations of training. Assuming a $100M project, an extrapolation of 2.5 year doubling time in the price-performance of GPUs (computer chips commonly used in AI), and a current GPU computational cost of ~10¹⁷ operations/$, such performance would be expected to be reached in 25 to 50 years. Note this extrapolation is highly uncertain; for instance, high performance on these metrics may not in actuality imply advanced AI (implying this estimate is an underestimate) or algorithmic progress may reduce necessary computing power (implying it’s an overestimate).

^{^}

The most powerful supercomputers today likely already have enough computing power to surpass that of the human brain. However, an arguably more important factor is the amount of computing power necessary to train an AI of this size (the amount of computing power necessary to train large AI systems typically far exceeds the computing power necessary to run such systems). One extensive report used a few different angles of attack to estimate the amount of computing power needed to train an AI system that was as powerful as the human brain, and this report concluded that such computing power would likely become economically available within the next few decades (with a median estimate of 2052).

^{^}

This problem is known as “specification gaming” or “outer misalignment.”

^{^}

E.g., “maximize profits,” if interpreted literally and outside a human lens, may yield all sorts of extreme psychopathic and illegal behavior that would deeply harm others for the most marginal gain in profit.

^{^}

The general phenomena at play here (sometimes referred to as “Goodhart’s law”) has many examples – in one classic-but-possibly-fictitious example, the British Empire put a bounty on cobras within colonial India (to try to reduce the cobra population), but some locals responded by breeding cobras to kill in order to collect the bounty, thus eventually leading to a large increase in the cobra population.

^{^}

Similarly, attempts to train AI systems to not mislead their overseers (by punishing these systems for behavior that the overseer deems to be misleading) might instead train these systems to simply become better at deception so they don't get caught (for instance, only sweeping a mess under the rug when the overseer isn’t looking).

^{^}

This problem is known as “goal misgeneralization” or “inner misalignment.”

^{^}

Note the true story is somewhat more complicated, as evolution “trained” individuals to also support the survival and reproduction of their relatives.

^{^}

As one simple example, we don’t want a video-game-playing AI to hack into its console to give itself a high score once it learns how to accomplish this feat.

^{^}

For instance, understanding what we really mean when we use imprecise language.

^{^}

At least insofar as AI can be said to “understand” anything.

^{^}

The logic here is the AI may reason that if it defected in training, the overseer would simply provide negative feedback (which would adjust its internal processes) until it stopped defecting. Under such a scenario, the AI would be unlikely to be deployed in the world with its current goals, so it would presumably not achieve these goals. Thus, the AI may choose to instead forgo defecting in training so it might be deployed with its current goals.

^{^}

It’s common for cutting-edge AI capabilities to move relatively quickly from matching human abilities in a domain to far surpassing human abilities in that domain (see: chess, Jeopardy!, and Go for high-profile examples). Alternatively, even if it takes a while for advanced AI capabilities to progress to far surpassing human abilities in the relevant domains, the worries sketched out below may still occur in a more drawn-out fashion.

^{^}

In the same way that AI systems can now outcompete humans in chess and Go.

^{^}

Such AI systems might guard against being shut down by using their social-persuasion or cyber-operation abilities. As just one example, these systems might initially pretend to be aligned with the interests of humans who had the ability to shut them off, while clandestinely hacking into various data centers to distribute copies of themselves across the internet.

^{^}

Note that for many animals, the problem is not due to idiosyncrasies of human nature, but instead simply due to human interests steamrolling animal interests where interests collide (e.g., competing for land).

^{^}

For instance, Toby Ord, a leading existential risk researcher at Oxford, estimates that “unaligned AI” is by far the most likely source of existential risk over the next 100 years – greater than all other risks combined.

^{^}

AI governance also encompasses several other areas. For instance, it includes work geared towards ensuring advanced AI isn’t misused by bad actors who intentionally direct such systems towards undesirable goals. Such misuse may, in an extreme scenario, also constitute an existential risk (if it enables the permanent “locking-in” of an undesirable future order) – note this outcome would be conceptually distinct from the alignment failure modes described in this piece (which, instead of being “intentional misuse” are “accidents”), so such misuse cases are not covered in this piece.

^{^}

Feedback on outward behavior may be inadequate for training AI systems away from deception, as if one is being deceptive, then one will generally outwardly behave in a manner designed to not appear deceptive.

^{^}

Interestingly, these same systems are reasonably good at evaluating their own previous claims – that is, if they are asked to evaluate how likely a previous claim they made is to be accurate, they tend to give substantially higher probability of accuracy for claims that are in fact accurate compared to those that are inaccurate.

^{^}

Honest AI may therefore make false claims if it had learned inaccurate information, but it would not generally make false claims on an issue where it had learned accurate information and assimilated this information into its “knowledge” of the world. (Note that researchers disagree about whether current AI systems should or should not be said to have “knowledge” in the sense that the word is commonly used, even setting aside the thorny issue of precisely defining the word “knowledge.”)

^{^}

Note that the latter paper defines alignment research differently than I have – by my definition, most of the research avenues in that paper would be considered technical AI alignment research, even ones the paper does not classify within the section on “alignment.”

^{^}

The more that various organizations feel they are in a competitive race towards advanced AI, the more pressure there may be for at least some of these organizations to cut corners to win the race.

^{^}

The survey described “AI safety research” as having significant overlap with what I’m calling “technical AI alignment research.”

The Importance of AI Alignment, explained in 5 points

The Importance of AI Alignment, explained in 5 points

1 – Advanced AI is possible

2 – Advanced AI might not be that far away

Many AI experts and generalist forecasters think it’s likely advanced AI will be developed within the next few decades:

Extrapolating AI capabilities plausibly suggests advanced AI within a few decades:

3 – Advanced AI might be difficult to direct

Current AI systems are often accidentally misdirected:

Advanced AI systems may similarly technically satisfy specifications in ways that violate what we actually want (i.e., the “King Midas” problem):

Training AI on human feedback may help address the above specification problems, but this introduces its own problems, including incentivizing misleading behavior:

Additionally, advanced AI systems may come to pursue proxies for goals that work well in training, but these proxies may break down during deployment:

To be clear, the above worries don’t imply that advanced AI wouldn’t be able to “understand” what we really wanted, but instead that this understanding wouldn’t necessarily translate to the AI systems acting in accordance with our wants:

It’s possible advanced AI will be built before we solve the above problems, or even without anyone really understanding the systems that are built:

4 – Poorly-directed advanced AI could be catastrophic for humanity

As mentioned above, poorly-directed advanced AI systems may curtail humanity’s ability to course correct:

From there, the world could develop in unexpected and undesirable ways, with no recourse:

While the above worries may sound extreme, they are not particularly fringe among relevant experts who have examined the issue (though there is considerable disagreement among experts and not all share these concerns):

5 – There are steps we can take now to reduce the danger

Some technical AI alignment research involves working with current AI systems to direct them towards desired goals, with the hope that insights transfer to advanced AI:

Other technical AI alignment research involves more theoretical or abstract work:

Understanding the inner workings of current black-box AI systems:

Developing methods for ensuring the honesty or truthfulness of AI systems:

On the nontechnical side, several areas of AI governance are relevant for reducing misalignment risks from advanced AI, including work to: