Hide table of contents

Epistemic status: Written in one 2 hour session for a deadline. Probably ill-conceptualised in some way I can't quite make out. 

Broader impacts: Could underwrite unfair cynicism. Has been read by a couple careful alignment people who didn't hate it. 


I propose a premortem. The familiar sense of ‘dual-use’ technology (when a civilian technology has military implications) receives a gratifying amount of EA, popular, and government attention. But consider a different sense: AI alignment (AIA) or AI governance work which actually increases existential risk. This tragic line of inquiry has some basic theory under the names ‘differential progress’, ‘accidental harm’, and a generalised sense of ‘dual-use’.

Some missing extensions:

  • there is almost no public evaluation of the downside risks of particular AIA agendas, projects, or organisations. (Some evaluations exist in private but I, a relative insider, have access to only one, a private survey by a major org. I understand why the contents might not be public, but the existence of the documents seems important to publicise.)
  • In the (permanent) absence of concrete feedback on these projects, we are trading in products which neither producer nor consumer know the quality of. We should model this. Tools from economics could help us reason about our situation (see Methods).
  • As David Krueger noted some years ago, there is little serious public thought regarding how much AI capabilities work it is wise to do for e.g. career capital or research training for young alignment researchers.

(There’s a trivial sense that mediocre projects increase existential risk: they represent an opportunity cost, by nominally taking resources from good projects.[1] I instead mean the nontrivial sense that the work could actively increase risk.)

Example: Reward learning

Some work in ML safety will enable the deployment of new systems. Ben Garfinkel gives the example of a robot cleaner:

Let’s say you’re trying to develop a robotic system that can clean a house as well as a human house-cleaner can... This is essentially an alignment problem... until we actually develop these techniques, probably we’re not in a position to develop anything that even really looks like it’s trying to clean a house, or anything that anyone would ever really want to deploy in the real world.

He sees this as positive: it implies massive economic incentives to do some alignment, and a block on capabilities until this is done. But it could be a liability as well, if the alignment of weak systems is correspondingly weak, and if mid-term safety work fed into a capabilities feedback loop with greater amplification. (That is, successful deployment means profit, which means reinvestment and induced investment in AI capabilities.)

More generally, human modelling approaches to alignment risk improving the capability of deceiving operators, and invite beyond-catastrophic ‘alignment near-misses’ (i.e. S-risks). 

 

Methods

1. Private audits plus canaries. 

Interview members of AIA projects under an NDA, or with the interviewee anonymous to me. The resulting public writeup then merely reports 5 bits of information about each project: 1) whether the organisation has a process for managing accidental harm, 2) whether this has been vetted by any independent party, 3) whether any project has in fact been curtailed as a result, 4) because of potentially dangerous capabilities or not, and 5) whether we are persuaded that they are net-positive. Refusal to engage is also noted. This process has problems (e.g. positivity bias from employees, or the audit team's credibility) but seems the best thing we can do with private endeavours, short of soliciting whistleblowers. Audit the auditors too, why not.

[EDIT: I learn that Allen Dafoe has a very similar idea, not to mention the verifiability mega-paper I never got around to.]

2. Quasi-economic model

We want to model the AIA ecosystem as itself a weakly aligned optimiser. One obvious route is microeconomic: asymmetric information and unobservable quality of research outputs, and the associated perils of goodharting and adverse selection. The other end would be a macroeconomic or political economy model of AI governance: phenomena like regulatory capture, eminent domain for intellectual property, and ethics-washing as a model for the co-option of alignment resources. The output would be an adapted model offering some qualitative insights despite the (vast) parameter uncertainty, à la Aschenbrenner (2020).


3. Case studies. 

What other risks are currently met with false security and safety theatre? What leads to ineffective regulation? (Highly contentious: What fields have been thus hollowed out?) Even successful regulation with real teeth frequently lapses. If this goes well then a public choice model of AI safety could be developed.
 

Risks

Outputs from this project are likely to be partially private. This is because the unsubstantiated critique could indirectly malign perfectly good and productive people, and have PR and talent pipeline effects. It is not clear that the macroeconomic model would give any new insights, rather than formalising my own existing intuitions.

The canary approach to reporting on private organisations relies on my judgment and credibility. This could be helped by pairing me with someone with more of either. Similarly, it is easily possible that all of the above has been done before and merely not reported, leading to my marginal impact being near zero. In this case at least my canaries prevent further waste.

Cynicism seems as great a risk as soft-pedalling [2]. But I'm a member of the AIA community, and likely suffer social desirability bias and soft-pedal as a result. Social desirability bias is notable even in purely professional settings, and will be worse when a field is also a tight-knit social group. It would be productive to invite relative outsiders to vet my work. (My work on governance organisations may be less biased for that reason.)

 

  1. ^

    It's not clear whether this bites in the current funding regime tbh.

  2. ^

    Consider the cynicism elsewhere:
    https://forum.effectivealtruism.org/posts/
    kageSSDLSMpuwkPKK/response-to-recent-criticisms-of-
    longtermism-1  

Comments12


Sorted by Click to highlight new comments since:
UwU
34
0
0

https://reducing-suffering.org/near-miss/

Just gonna boost this excellent piece by Tomasik. I think partial alignment/near-misses causing s-risk is potentially an enormous concern. This is more true the shorter timelines are and thus the more likely people are to try using "hail mary" risky alignment techniques. Also more true for less principled/Agent Foundations-type alignment directions.

Can someone provide a more realistic example of partial alignment causing s-risk than SignFlip or MisconfiguredMinds? I don't see either of these as something that you'd be reasonably likely to get by say, only doing 95% of the alignment research necessary rather than 110%.

Brian Tomasik wrote something similar about the risks of slightly misaligned artificial intelligence, although it is focused on suffering risks specifically rather than on existential risks in general.

I want a word which covers {x-risk, s-risk}, "Existential or worse".

"x-risk" covers "x-risk or worse" right?

Yes, I'd say so.

I guess that might raise the question if there is a term specifically for x-risks that aren't s-risks. My sense is that people often use the term "x-risk" for that concept as well, but in some contexts one might want to have another term; to distinguish the two concepts.

I always thought s-risks are a subset of x-risks, e.g. that's how CLR framed it here:

https://longtermrisk.org/s-risks-talk-eag-boston-2017/ 

Basic argument seems to be: Permanent astronomical hell is also curtailment of humanity's potential, one that is very high in the dimensions of scope (astronomical) and intensity (involves hellish levels of suffering).

Good framing, but I'm surprised they went for it since it partially obscures S behind its larger more popular brother X.

One explanation might be that historically there seemed to have been somewhat of a divide between people worrying about s-risks and x-risks (which were ~ suffering-focused  and ~ classic utilitarians), and this framing might've helped getting more cooperation started.

"At least existential"

Gotta be one word or bust

Curated and popular this week
 ·  · 32m read
 · 
Summary Immediate skin-to-skin contact (SSC) between mothers and newborns and early initiation of breastfeeding (EIBF) may play a significant and underappreciated role in reducing neonatal mortality. These practices are distinct in important ways from more broadly recognized (and clearly impactful) interventions like kangaroo care and exclusive breastfeeding, and they are recommended for both preterm and full-term infants. A large evidence base indicates that immediate SSC and EIBF substantially reduce neonatal mortality. Many randomized trials show that immediate SSC promotes EIBF, reduces episodes of low blood sugar, improves temperature regulation, and promotes cardiac and respiratory stability. All of these effects are linked to lower mortality, and the biological pathways between immediate SSC, EIBF, and reduced mortality are compelling. A meta-analysis of large observational studies found a 25% lower risk of mortality in infants who began breastfeeding within one hour of birth compared to initiation after one hour. These practices are attractive targets for intervention, and promoting them is effective. Immediate SSC and EIBF require no commodities, are under the direct influence of birth attendants, are time-bound to the first hour after birth, are consistent with international guidelines, and are appropriate for universal promotion. Their adoption is often low, but ceilings are demonstrably high: many low-and middle-income countries (LMICs) have rates of EIBF less than 30%, yet several have rates over 70%. Multiple studies find that health worker training and quality improvement activities dramatically increase rates of immediate SSC and EIBF. There do not appear to be any major actors focused specifically on promotion of universal immediate SSC and EIBF. By contrast, general breastfeeding promotion and essential newborn care training programs are relatively common. More research on cost-effectiveness is needed, but it appears promising. Limited existing
 ·  · 11m read
 · 
Our Mission: To build a multidisciplinary field around using technology—especially AI—to improve the lives of nonhumans now and in the future.  Overview Background This hybrid conference had nearly 550 participants and took place March 1-2, 2025 at UC Berkeley. It was organized by AI for Animals for $74k by volunteer core organizers Constance Li, Sankalpa Ghose, and Santeri Tani.  This conference has evolved since 2023: * The 1st conference mainly consisted of philosophers and was a single track lecture/panel. * The 2nd conference put all lectures on one day and followed it with 2 days of interactive unconference sessions happening in parallel and a week of in-person co-working. * This 3rd conference had a week of related satellite events, free shared accommodations for 50+ attendees, 2 days of parallel lectures/panels/unconferences, 80 unique sessions, of which 32 are available on Youtube, Swapcard to enable 1:1 connections, and a Slack community to continue conversations year round. We have been quickly expanding this conference in order to prepare those that are working toward the reduction of nonhuman suffering to adapt to the drastic and rapid changes that AI will bring.  Luckily, it seems like it has been working!  This year, many animal advocacy organizations attended (mostly smaller and younger ones) as well as newly formed groups focused on digital minds and funders who spanned both of these spaces. We also had more diversity of speakers and attendees which included economists, AI researchers, investors, tech companies, journalists, animal welfare researchers, and more. This was done through strategic targeted outreach and a bigger team of volunteers.  Outcomes On our feedback survey, which had 85 total responses (mainly from in-person attendees), people reported an average of 7 new connections (defined as someone they would feel comfortable reaching out to for a favor like reviewing a blog post) and of those new connections, an average of 3
GiveWell
 ·  · 2m read
 · 
Recent cuts to US government foreign assistance have destabilized global health programs, impacting some of the most cost-effective interventions we’ve found for saving and improving lives, such as malaria nets, malaria chemoprevention, and community-based management of acute malnutrition. This situation is a major focus of our research team at the moment, and we’re working to balance a targeted, near-term response to urgent needs with a broad, long-term perspective of needs that may emerge. The US has historically provided roughly 20% to 25% ($12 billion to $15 billion) of the total global aid to support health programs, which combat malaria, tuberculosis, HIV, maternal and child health issues, and much more.[1] While the long-term effects remain uncertain and exact numbers remain difficult to ascertain, cuts of 35% to 90% of US foreign aid dollars are being publicly discussed by the administration.[2] We’ve created a webpage to provide an overview of how we’re responding, and we’ve started to record a series of conversations with our research team that shares timely snapshots of this rapidly evolving situation. Our first episode shared a broad overview of the impacts of US government aid cuts and GiveWell’s initial response. In our newly released second episode, GiveWell Program Officer Natalie Crispin joins CEO and co-founder Elie Hassenfeld to zoom in on a specific case, focusing on grants we’ve made to support urgent funding gaps for seasonal malaria chemoprevention (SMC). They discuss how SMC campaigns work, the impact of USAID funding pauses on SMC campaigns, and GiveWell’s response to keep SMC campaigns on track.   Listen to Episode 2: Addressing Urgent Needs in Seasonal Malaria Chemoprevention   This situation is changing daily, and we’re constantly learning more. You can listen or subscribe to our podcast for the latest updates and read a summary of key takeaways from each podcast conversation on our blog. GiveWell has so far directed approximately