You're talking about outer-alignment failure, but I'm concerned about inner-alignment failure. These are different problems: outer-alignment failure is like a tricky genie misinterpreting your wish, while inner-alignment failure involves the AI developing its own unexpected goals.
RLHF doesn't optimize for "human preference" in general. It only optimizes for specific reward signals based on limited human feedback in controlled settings. The aspects of reality not captured by this process can become proxy goals that work fine in training environments but fail to generalize to new situations. Generalization might happen by chance, but it becomes less likely as complexity increases.
An AI getting perfect human approval during training doesn't solve the inner-alignment problem if circumstances change significantly - like when the AI gains more control over its environment than it had during training.
We've already seen this pattern with humans and evolution. Humans became "misaligned" with evolution's goal of reproduction because we were optimized for proxy rewards (pleasure/pain) rather than reproduction directly. When we gained more environmental control through technology, these proxy rewards led to unexpected outcomes: we invented contraception, developed preferences for junk food, and seek thrilling but dangerous experiences - all contrary to evolution's original "goal" of maximizing reproduction.
This isn't a naive or outdated concern. It's a case of a simplified example being misunderstood as the actual concern.
It's worth clarifying that Yudkowsky's squiggle maximizer has nothing to do with actual paperclips you can pick up with your hands.
Many people interpreted this to be about an AI that was specifically given the instruction of manufacturing paperclips, and that the intended lesson was of an outer alignment failure. i.e humans failed to give the AI the correct goal. Yudkowsky has since stated the originally intended lesson was of inner alignment failure, wherein the humans gave the AI some other goal, but the AI's internal processes converged on a goal that seems completely arbitrary from the human perspective.
The concern is about an AI manipulating atoms into an indefinitely repeating mass-energy efficient pattern, optimized along a (seemingly arbitrary) narrow dimension of reward.
Why might an AI do something unexpected like this? For reasons analogous to why a rational person will guess blue every time in the following card experiment, even though there are some red cards. Lawful Uncertainty demonstrates that even in environments with randomness, the optimal strategy is to follow a determinate pattern rather than matching the perceived probabilities of the environment. Similarly, an AI will optimize toward whatever actually maximizes its reward function, not what appears reasonable or balanced to humans.
This problem isn't prevented by RLHF or by an AI having a sufficiently nuanced understanding of what humans want. A model can demonstrate perfect comprehension of human values in its outputs while its internal optimization processes still converge toward something else entirely.
The apparent human-like reasoning we see in current LLMs doesn't guarantee their internal optimization targets match what we infer from their outputs.
I've had similar considerations. Manifund has projects you can fund directly, some of which are about interpretability. Though without specialized knowledge, I find it difficult to trust my judgement more than people whose job it is to research and think strategically about marginal impact.
No one knows how things will shake out in the end, but trade wars don't feel conducive to coordination.
Index cards are good at externalizing, organizing, and engaging with thoughts.
They're small enough to focus thoughts without needing to fill space. Cards can have multiple thoughts, or just one. This is what my ADHD has found to be useful and low-friction:
Organization
Enhancements
Warnings
Large numbers are abstract. I experimented with different ways to more closely feel these scales and discovered a personally effective approach using division and per-second counting.
The Against Malaria Foundation has protected 611,336,286 people with insecticide-treated nets.
Let's try a larger number: Toby Ord calculates our "affectable universe" as having at least 10²¹ stars.
Counting by 31s disrupts the familiar rhythm of adding single digits. Disruptive "Mississippi counting" works for more per-second quotients than just 31.
The full effect comes from simultaneously holding the count, what each increment represents, and the full timespan in mind.
I'm interested in learning what techniques others use to feel large numbers.
Thank you for the links. The concerning scenario I imagine is an AI performing something like reflective equilibrium and coming away with something singular and overly-reductive, biting bullets we'd rather it not, all for the sake of coherence. I don't think current LLM systems are doing this, but greater coherence seems generally useful, so I expect AI companies to seek it. I will read these and try to see if something like this is addressed.