A Rocket–Interpretability Analogy

plex

1.

4.4% of the US federal budget went into the space race at its peak.

This was surprising to me, until a friend pointed out that landing rockets on specific parts of the moon requires very similar technology to landing rockets in soviet cities.^[1]

I wonder how much more enthusiastic the scientists working on Apollo were, with the convenient motivating story of “I’m working towards a great scientific endeavor” vs “I’m working to make sure we can kill millions if we want to”.

2.

The field of alignment seems to be increasingly dominated by interpretability. (and obedience^[2])

This was surprising to me^[3], until a friend pointed out that partially opening the black box of NNs is the kind of technology that would let scaling labs find new unhobblings by noticing ways in which the internals of their models are being inefficient and having better tools to evaluate capabilities advances.^[4]

I wonder how much more enthusiastic the alignment researchers working on interpretability and obedience are, with the motivating story “I’m working on pure alignment research to save the world” vs “I’m building tools and knowledge which scaling labs will repurpose to build better products, shortening timelines to existentially threatening systems”.^[5]

3.

You can’t rely on the organizational systems around you to be pointed in the right direction, and there are obvious reasons for commercial incentives to want to channel your idealistic energy towards types of safety work which are dual-use or even primarily capabilities enabling. And for similar reasons, many of the training programs prepare people for the kind of jobs which come with large salaries and prestige, as a flawed proxy for people moving the needle on x-risk.

If you’re genuinely trying to avert AI doom, please take the time to form inside views away from memetic environments^[6] which are likely to have been heavily influenced by commercial pressures. Then back-chain from a theory of change where the world is more often saved by your actions, rather than going with the current and picking a job with safety in its title as a way to try and do your part.

^{^}
Space Race - Wikipedia:
It had its origins in the ballistic missile-based nuclear arms race between the two nations following World War II and had its peak with the more particular Moon Race to land on the Moon between the US moonshot and Soviet moonshot programs. The technological advantage demonstrated by spaceflight achievement was seen as necessary for national security and became part of the symbolism and ideology of the time.
^{^}
Andrew Critch:
I hate that people think AI obedience techniques slow down the industry rather than speeding it up. ChatGPT could never have scaled to 100 million users so fast if it wasn't helpful at all.

Making AI serve humans right now is highly profit-aligned and accelerant.

Of course, later when robots could be deployed to sustain an entirely non-human economy of producers and consumers, there will be many ways to profit — as measured in money, materials, compute, energy, intelligence, or all of the above — without serving any humans. But today, getting AI to do what humans want is the fastest way to grow the industry.
^{^}
These paradigms do not seem to be addressing the most fatal filter in our future: Strongly coherent goal-directed agents forming with superhuman intelligence. These will predictably undergo a sharp left turn and the soft/fuzzy alignment techniques which worked at lower power levels fail simultaneously and as the system reaches high enough competence to reflect on itself, its capabilities, and the guardrails we built.
Interpretability work could plausibly help with weakly aligned weakly superintelligent systems that do our alignment homework for the much more capable systems to come. But the effort going into this direction seems highly disproportionate to how promising it is, is not backed by plans to pivot to using these systems to do a quite different style of alignment research that's needed, and generally lacks research closure to avert capabilities externalities.
^{^}
From the team that broke the quadratic attention bottleneck:
Simpler sub-quadratic designs such as Hyena, informed by a set of simple guiding principles and evaluation on mechanistic interpretability benchmarks, may form the basis for efficient large models.
^{^}
Ask yourself: “Who will cite my work?”, not "Can I think of a story where my work is used for good things?"
There is work in these fields which might be good for x-risk, but you need to figure out if what you're doing is in that category to be good for the world.
^{^}
Humans are natural mimics, we copy the people who have visible signals of doing well, because those are the memes which are likely to be good for our genes, and genes direct where we go looking for memes.
Wealth, high confidence that they’re doing something useful, being part of a growing coalition; great signs of good memes. All much more possessed by people in the interpretability/obedience kind of alignment than the old-school “this is hard and we don’t know what we’re doing, but it’s going to involve a lot of careful philosophy and math” crowd.
Unfortunately, this memetic selection is not particularly adaptive for trying to solve alignment.

Jackson WagnerOct 22 20242

Cross-posting a lesswrong comment where I argue (in response to another commenter) that not only did NASA's work on rocketry probably benefitted military missile/ICBM technology, but their work on satellites/spacecraft also likely contributed to military capabilities:

Satellites were also plausibly a very important military technology. Since the 1960s, some applications have panned out, while others haven't. Some of the things that have worked out:

GPS satellites were designed by the air force in the 1980s for guiding precision weapons like JDAMs, and only later incidentally became integral to the world economy. They still do a great job guiding JDAMs, powering the style of "precision warfare" that has given the USA a decisive military advantage since 1991's first Iraq war.
Spy satellites were very important for gathering information on enemy superpowers, tracking army movements and etc. They were especially good for helping both nations feel more confident that their counterpart was complying with arms agreements about the number of missile silos, etc. The Cuban Missile Crisis was kicked off by U-2 spy-plane flights photographing partially-assembled missiles in Cuba. For a while, planes and satellites were both in contention as the most useful spy-photography tool, but eventually even the U-2's successor, the incredible SR-71 blackbird, lost out to the greater utility of spy satellites.
Systems for instantly detecting the characteristic gamma-ray flashes of nuclear detonations that go off anywhere in the world (I think such systems are included on GPS satellites), and giving early warning by tracking ballistic missile launches during their boost phase (the Soviet version of this system famously misfired and almost caused a nuclear war in 1983, which was fortunately forestalled by one Lieutenant colonel Stanislav Petrov).

Some of the stuff that hasn't:

The air force initially had dreams of sending soldiers into orbit, maybe even operating a military base on the moon, but could never figure out a good use for this. The Soviets even test-fired a machine-gun built into one of their Salyut space stations: "Due to the potential shaking of the station, in-orbit tests of the weapon with cosmonauts in the station were ruled out.The gun was fixed to the station in such a way that the only way to aim would have been to change the orientation of the entire station. Following the last crewed mission to the station, the gun was commanded by the ground to be fired; some sources say it was fired to depletion".
Despite some effort in the 1980s, were were unable to figure out how to make "Star Wars" missile defence systems work anywhere near well enough to defend us against a full-scale nuclear attack.
Fortunately we've never found out if in-orbit nuclear weapons, including fractional orbit bombardment weapons, are any use, because they were banned by the Outer Space Treaty. But nowadays maybe Russia is developing a modern space-based nuclear weapon as a tool to destroy satellites in low-earth orbit.

Overall, lots of NASA activities that developed satellite / spacecraft technology seem like they had a dual-use effect advancing various military capabilities. So it wasn't just the missiles. Of course, in retrospect, the entire human-spaceflight component of the Apollo program (spacesuits, life support systems, etc) turned out to be pretty useless from a military perspective. But even that wouldn't have been clear at the time!

Effective Altruism Forum
EA Forum

A Rocket–Interpretability Analogy

13

1.

2.

3.

13

Reactions