Sean🔸

8 karmaJoined Pursuing a graduate degree (e.g. Master's)Seeking workMacon, GA, USA

Participation
1

  • Received career coaching from 80,000 Hours

Posts
1

Sorted by New
1
· · 1m read

Comments
7

Thank you for the links. The concerning scenario I imagine is an AI performing something like reflective equilibrium and coming away with something singular and overly-reductive, biting bullets we'd rather it not, all for the sake of coherence. I don't think current LLM systems are doing this, but greater coherence seems generally useful, so I expect AI companies to seek it. I will read these and try to see if something like this is addressed.

You're talking about outer-alignment failure, but I'm concerned about inner-alignment failure. These are different problems: outer-alignment failure is like a tricky genie misinterpreting your wish, while inner-alignment failure involves the AI developing its own unexpected goals.

RLHF doesn't optimize for "human preference" in general. It only optimizes for specific reward signals based on limited human feedback in controlled settings. The aspects of reality not captured by this process can become proxy goals that work fine in training environments but fail to generalize to new situations. Generalization might happen by chance, but it becomes less likely as complexity increases.

An AI getting perfect human approval during training doesn't solve the inner-alignment problem if circumstances change significantly - like when the AI gains more control over its environment than it had during training.

We've already seen this pattern with humans and evolution. Humans became "misaligned" with evolution's goal of reproduction because we were optimized for proxy rewards (pleasure/pain) rather than reproduction directly. When we gained more environmental control through technology, these proxy rewards led to unexpected outcomes: we invented contraception, developed preferences for junk food, and seek thrilling but dangerous experiences - all contrary to evolution's original "goal" of maximizing reproduction.

This isn't a naive or outdated concern. It's a case of a simplified example being misunderstood as the actual concern.

It's worth clarifying that Yudkowsky's squiggle maximizer has nothing to do with actual paperclips you can pick up with your hands.

Many people interpreted this to be about an AI that was specifically given the instruction of manufacturing paperclips, and that the intended lesson was of an outer alignment failure. i.e humans failed to give the AI the correct goal. Yudkowsky has since stated the originally intended lesson was of inner alignment failure, wherein the humans gave the AI some other goal, but the AI's internal processes converged on a goal that seems completely arbitrary from the human perspective.

The concern is about an AI manipulating atoms into an indefinitely repeating mass-energy efficient pattern, optimized along a (seemingly arbitrary) narrow dimension of reward.

Why might an AI do something unexpected like this? For reasons analogous to why a rational person will guess blue every time in the following card experiment, even though there are some red cards. Lawful Uncertainty demonstrates that even in environments with randomness, the optimal strategy is to follow a determinate pattern rather than matching the perceived probabilities of the environment. Similarly, an AI will optimize toward whatever actually maximizes its reward function, not what appears reasonable or balanced to humans.

This problem isn't prevented by RLHF or by an AI having a sufficiently nuanced understanding of what humans want. A model can demonstrate perfect comprehension of human values in its outputs while its internal optimization processes still converge toward something else entirely.

The apparent human-like reasoning we see in current LLMs doesn't guarantee their internal optimization targets match what we infer from their outputs.

I've had similar considerations. Manifund has projects you can fund directly, some of which are about interpretability. Though without specialized knowledge, I find it difficult to trust my judgement more than people whose job it is to research and think strategically about marginal impact.

No one knows how things will shake out in the end, but trade wars don't feel conducive to coordination.

Index cards are good at externalizing, organizing, and engaging with thoughts.

They're small enough to focus thoughts without needing to fill space. Cards can have multiple thoughts, or just one. This is what my ADHD has found to be useful and low-friction:

Organization

  • Reorganize, discard, or add cards fast
  • Arrange on desk/floor to see connections
  • Stack to show categories, priorities, or dependencies
  • Create visual hierarchies by physically overlapping cards (completely or partially)
  • Remove unneeded cards from sight (out of mind)

Enhancements

  • Color-code with pencils or cardstock (use paper guillotine for custom sizes)
  • Hole-punch and add binder rings to keep ordered (hang from thumbtacks, carry in pocket)
  • Print labels for highly-legible permanent information (useful for habits or workflows)
  • Add stickers for fun or indication: I have one stack with the number of stickers indicating pomodoros completed, and flip to the next card on the ring after each session
  • Add fabric with textile glue for tactile elements (polyester ribbon works well)

Warnings

  • Paperclips can get stuck together or fall off
  • Sticky tabs can also fall off

Large numbers are abstract. I experimented with different ways to more closely feel these scales and discovered a personally effective approach using division and per-second counting.

The Against Malaria Foundation has protected 611,336,286 people with insecticide-treated nets.

  1. Divide by seconds in a week (604,800), giving approximately 1,000 people per second
  2. Count aloud: "1 one thousand, 2 one thousand, 3 one thousand..."
  3. Imagine doing it every second for a week

Let's try a larger number: Toby Ord calculates our "affectable universe" as having at least 10²¹ stars.

  1. Divide by Earth's projected peak population (10 billion), yielding 100 billion stars per person
  2. Divide by seconds in a century (3.16 billion), giving approximately 31 stars per person per second
  3. Count aloud: "31 Mississippi, 62 Mississippi, 93 Mississippi..."
  4. Imagine doing it every second for a century

Counting by 31s disrupts the familiar rhythm of adding single digits. Disruptive "Mississippi counting" works for more per-second quotients than just 31.

The full effect comes from simultaneously holding the count, what each increment represents, and the full timespan in mind.

I'm interested in learning what techniques others use to feel large numbers.