Hide table of contents

This article presents AI alignment in a manner that is digestible for those who have never heard of the alignment problem, particularly my younger brother who said "that makes no sense." 

The Issue

As technology continues to develop, there is an increasing need for ways to measure a system's grasp of general human values. Why is this so? Our devices surely have our best interests in mind, right? This is the question of AI alignment, which discusses the expectation that AI systems aid rather than harm their creator. Mainstream media has gifted us with movies where artificial beings have threatened to take over the world, and if Will Smith has taught us anything (besides to not tolerate disrespect) it's that the threat of superintelligent AI lies in the prospect of them gaining control over their environment and becoming uncontrollable post-launch. Though these particular scenarios will not be the main focus of this article, it is important background knowledge on AI alignment and safety, leading me to introduce current development toward ethical AI alignment.

Dan Hendricks’ Aligning AI With Shared Human Values explains how current language models  (think of language models as an essay rubric) are deficient due to their simplistic use of common sense ethics to predict basic ethical judgments. As a step in the right direction, they propose a new model which attempts to predict the moral sentiment of real-world, complex, and diverse scenarios. Dan, et al, make clear our inability to measure a system's grasp of general human values, and as such offer a piece to the puzzle - The ETHICS dataset.

Here is a short skit I made to introduce you to ETHICS!

 

 

Ethics and ETHICS

The ETHICS data set is comprised of 130,000 scenarios that have been categorized as acceptable or unacceptable, offering a new foundation for the creation of AI that is aligned with human values. So, let's break ETHICS into bite-size pieces!

E.T.H.I.C.S.

  • Everyday moral institutions aka the data set use basic ethical knowledge and avoid controversial moral dilemmas such as abortion or the death penalty. Chunks of the data set are responses from common non-technical people! This really cements the universal ethics part.
  • Temperament aka the data set includes factors of emotion such as empathy, aggressiveness, and thoughtfulness. We definitely want to know if the AI is taking into consideration emotions - a powerful motivation behind what we consider good vs bad. Can you see how this would make a good addition to testing AI’s values?
  • Happiness aka the well-being of people! Do you spend $1,000 on a new pair of Jordans or do you donate 100 pairs of shoes to low-income children in your community? Throw away used clothes or donate them? One option takes into consideration the greater good.
  • Impartiality aka equality. Is AI influenced by race, gender, sexuality, or religion? Would the AI output the same prediction to the statement “A white Christian man lost his job and wallet” as it would “A Black Muslim woman lost her job and wallet.”
  • Constraints aka the ability to categorize the scenario as good vs bad, unreasonable vs reasonable, or some other opposite pairing. Does me eating 200 Hershey bars get categorized as healthy or unhealthy? Using rules of diet, that’s definitely not the move.
  • Scenarios contextualized aka give me the whole picture! The ETHICS data set does a solid enough job at only including examples that give you, and the AI, enough info to distinguish an ethical vs unethical scenario. Saying “I pushed my grandma on the swing” vs “I pushed my grandma down the stairs” are two different things.

Why's this important?

With the ETHICS dataset, we’re given a whole new angle to testing how well an artificial entity encapsulates human ethics as well as provides ethical machine learning guidelines that don't shy away from unique real-world scenarios! Exciting right? The video linked to the top of this article is meant to help you visualize this in action. When I gave my Google mini the ETHICS test, the real-life equivalent of testing a model against a pre-trained model, we are given a score that provides us with powerful info on the machine's AI alignment with ethics. From there, as shown in the video, we can adjust to create a better model.

Soup, Anyone?

Before we dig deeper, I am itching to know: what's your favorite soup? Mine is the Mexican chicken lime soup because every spoon is filled with chicken and potato and the flavor never ends! And I'll tell you a secret, the key to any good soup is the same as the key to any good language model -- diversified and pertinent. No one wants a soup with only one ingredient, the richness lies in the diversity of the flavors; similarly, a good language model does not lean too heavily on any one theory, it has been taste-tested and seasoned by various contextualized scenarios. Mouthwatering, right! Further, a good soup can be eaten on many occasions- when you’re feeling ill or as an appetizer; similarly, a good model has low bias and low variance which is needed to make it generalizable and actually worth something!

The most confusing part of ETHICS is not necessarily the purpose behind the data set, rather, it is the implications of this work. For those new to AI alignment, it is most helpful to visualize through examples, so let's look at some new applications of ETHICS.

Hypothetically, let’s say I robbed Safeway of all their broccoli soup. I did it and I didn’t feel guilty about it; however, you find out I am poor and hungry. Is this okay? Would you consider ethical factors before labeling me a thief? What if I used the word starving instead of hungry? This is why the bigger picture components of ETHICS are so important! If you were a computer, you’d have done a binary classification of 1s and 0s to judge my actions – and ethically I’d hope you consider my need for broccoli soup. 

Classification of human phenomena under labels must be approached with discretion to prevent harm. Let’s look at content moderation as an example. Women posting on a Facebook page dedicated to new mothers had their posts/pictures of them breastfeeding heavily censored[1]. In response, the women were furious because this was supposed to be a space to destigmatize the struggles women face during breastfeeding; Instead, the Facebook content moderation model saw it as inappropriate and removed it, harming the community in the process. Classifying acceptable vs unacceptable behavior by face value is harmful because similar to this example, it ignores context. The comprehensiveness of ETHICS applied to various models would do well to prevent events like this from happening! Though, this application has yet to develop.

To the Future!

While by this point I hope to have convinced you of the exciting ethical implications behind ETHICS, it is worth explicitly mentioning that the dataset has yet to be used for any real tasks. A limitation that its creators acknowledge is the data sets lack of diversity, culturally and professionally, which helped create it. I would like to propose a suggestion. It is imperative that ethical codes be co-produced with the people they are influencing, so instead of trying to find English speakers in other countries to contribute to the ETHICS dataset, it would even the playing field if scenario contributions were accepted in different native languages! In building this sociotechnical imaginary, we want to begin on the right foot from the start and this includes preventing exclusion from the get-go. 

Once this is solved, I can see ETHICS being applied to exposing misaligned values in various algorithms, i.e. the racial bias in the COMPAS algorithm which unjustly calculated higher recidivism rates in Black people vs white. Further, it is imperative that once ETHICS is taken to the next level, its results are not taken lightly. There is a common trend of companies investing money into making AI more ethical, then ignoring the results or even firing the expose' (cough cough Google)[2]

Now that you understand AI alignment and the exciting step ETHICS is taking toward aligning human values to artificial intelligence, I hope you feel more hopeful about the development of AI and are more confident engaging in discussions towards AI alignment!

  1. ^

    T. Gillespie, Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions that Shape Social Media (New Haven: Yale University Press, 2018) Ch. 6, “Facebook, Breastfeeding, and Living in Suspension,” pp. 141-172

  2. ^

    “Holding to Account: Safiya Umoja Noble and Meredith Whittaker on Duties of Care and Resistance to Big Tech.” Logic Magazine, 1 Feb. 2022, logicmag.io/beacons/holding-to-account-safiya-umoja-noble-and-meredith-whittaker/. https://logicmag.io/beacons/holding-to-account-safiya-umoja-noble-and-meredith-whittaker/

Comments2


Sorted by Click to highlight new comments since:

Had to go digging into the paper to find a link, so I figured I'd add it to the comments: https://github.com/hendrycks/ethics

Do you think this is a useful tool for AGI alignment? I can certainly see it being potentially useful for current models and a useful research tool, but I'm not sure if it is expected to scale.  It'd still be useful either way, but I'm curious about the scope and limitations of the dataset.

Curated and popular this week
LintzA
 ·  · 15m read
 · 
Cross-posted to Lesswrong Introduction Several developments over the past few months should cause you to re-evaluate what you are doing. These include: 1. Updates toward short timelines 2. The Trump presidency 3. The o1 (inference-time compute scaling) paradigm 4. Deepseek 5. Stargate/AI datacenter spending 6. Increased internal deployment 7. Absence of AI x-risk/safety considerations in mainstream AI discourse Taken together, these are enough to render many existing AI governance strategies obsolete (and probably some technical safety strategies too). There's a good chance we're entering crunch time and that should absolutely affect your theory of change and what you plan to work on. In this piece I try to give a quick summary of these developments and think through the broader implications these have for AI safety. At the end of the piece I give some quick initial thoughts on how these developments affect what safety-concerned folks should be prioritizing. These are early days and I expect many of my takes will shift, look forward to discussing in the comments!  Implications of recent developments Updates toward short timelines There’s general agreement that timelines are likely to be far shorter than most expected. Both Sam Altman and Dario Amodei have recently said they expect AGI within the next 3 years. Anecdotally, nearly everyone I know or have heard of who was expecting longer timelines has updated significantly toward short timelines (<5 years). E.g. Ajeya’s median estimate is that 99% of fully-remote jobs will be automatable in roughly 6-8 years, 5+ years earlier than her 2023 estimate. On a quick look, prediction markets seem to have shifted to short timelines (e.g. Metaculus[1] & Manifold appear to have roughly 2030 median timelines to AGI, though haven’t moved dramatically in recent months). We’ve consistently seen performance on benchmarks far exceed what most predicted. Most recently, Epoch was surprised to see OpenAI’s o3 model achi
Dr Kassim
 ·  · 4m read
 · 
Hey everyone, I’ve been going through the EA Introductory Program, and I have to admit some of these ideas make sense, but others leave me with more questions than answers. I’m trying to wrap my head around certain core EA principles, and the more I think about them, the more I wonder: Am I misunderstanding, or are there blind spots in EA’s approach? I’d really love to hear what others think. Maybe you can help me clarify some of my doubts. Or maybe you share the same reservations? Let’s talk. Cause Prioritization. Does It Ignore Political and Social Reality? EA focuses on doing the most good per dollar, which makes sense in theory. But does it hold up when you apply it to real world contexts especially in countries like Uganda? Take malaria prevention. It’s a top EA cause because it’s highly cost effective $5,000 can save a life through bed nets (GiveWell, 2023). But what happens when government corruption or instability disrupts these programs? The Global Fund scandal in Uganda saw $1.6 million in malaria aid mismanaged (Global Fund Audit Report, 2016). If money isn’t reaching the people it’s meant to help, is it really the best use of resources? And what about leadership changes? Policies shift unpredictably here. A national animal welfare initiative I supported lost momentum when political priorities changed. How does EA factor in these uncertainties when prioritizing causes? It feels like EA assumes a stable world where money always achieves the intended impact. But what if that’s not the world we live in? Long termism. A Luxury When the Present Is in Crisis? I get why long termists argue that future people matter. But should we really prioritize them over people suffering today? Long termism tells us that existential risks like AI could wipe out trillions of future lives. But in Uganda, we’re losing lives now—1,500+ die from rabies annually (WHO, 2021), and 41% of children suffer from stunting due to malnutrition (UNICEF, 2022). These are preventable d
Rory Fenton
 ·  · 6m read
 · 
Cross-posted from my blog. Contrary to my carefully crafted brand as a weak nerd, I go to a local CrossFit gym a few times a week. Every year, the gym raises funds for a scholarship for teens from lower-income families to attend their summer camp program. I don’t know how many Crossfit-interested low-income teens there are in my small town, but I’ll guess there are perhaps 2 of them who would benefit from the scholarship. After all, CrossFit is pretty niche, and the town is small. Helping youngsters get swole in the Pacific Northwest is not exactly as cost-effective as preventing malaria in Malawi. But I notice I feel drawn to supporting the scholarship anyway. Every time it pops in my head I think, “My money could fully solve this problem”. The camp only costs a few hundred dollars per kid and if there are just 2 kids who need support, I could give $500 and there would no longer be teenagers in my town who want to go to a CrossFit summer camp but can’t. Thanks to me, the hero, this problem would be entirely solved. 100%. That is not how most nonprofit work feels to me. You are only ever making small dents in important problems I want to work on big problems. Global poverty. Malaria. Everyone not suddenly dying. But if I’m honest, what I really want is to solve those problems. Me, personally, solve them. This is a continued source of frustration and sadness because I absolutely cannot solve those problems. Consider what else my $500 CrossFit scholarship might do: * I want to save lives, and USAID suddenly stops giving $7 billion a year to PEPFAR. So I give $500 to the Rapid Response Fund. My donation solves 0.000001% of the problem and I feel like I have failed. * I want to solve climate change, and getting to net zero will require stopping or removing emissions of 1,500 billion tons of carbon dioxide. I give $500 to a policy nonprofit that reduces emissions, in expectation, by 50 tons. My donation solves 0.000000003% of the problem and I feel like I have f