Many EAs see AI Alignment research efforts as an important route to mitigating x-risk from AGI. However, others are concerned that alignment research overall increases x-risk by accelerating AGI timelines. I think Michael Nielsen explains this well:

"Practical alignment work makes today's AI systems far more attractive to customers, far more usable as a platform for building other systems, far more profitable as a target for investors, and far more palatable to governments. The net result is that practical alignment work is accelerationist."

 

I think we can distinguish between two broad subtypes of alignment efforts: 1) "implementing guardrails" and 2) "improving goal-directedness". 

I would categorise approaches such as running AI models on custom chips, emergency shutdown mechanisms, red teaming, risk assessments, dangerous capability evaluations and safety incident reporting as "implementing guardrails". This can be thought of as getting AI systems to not do the worst thing possible.

I would categorise approaches such as RLHF and reward shaping as "improving goal-directedness". This could also be thought of as getting AI systems to do the best thing possible.

I think "implementing guardrails" has much weaker acceleration effects than "improving goal-directedness". An AI system which can be shutdown and does not show dangerous capabilities, is still not very useful if it can't be directed towards the specific goal of the user.

 

So I think people who are worried that AI Alignment efforts might be net-negative because of acceleration effects should consider prioritising "implementing guardrails" approaches to AI Alignment.

13

0
0

Reactions

0
0
Comments2


Sorted by Click to highlight new comments since:

Both approaches are important components of a comprehensive AI safety strategy. With that said, I think that improving goal-directedness (as you've defined it here) is likely to yield more fruitful long-term results for AI safety because: 

  1. A sufficiently advanced AGI (what is often labeled ASI, above human level) could outsmart any guardrails implemented by humans given enough time and compute power
  2. Guardrails seem (as you mentioned) to be specifically an approach dedicated stopping an unaligned AI from causing damage. It does not actually get us closer to an aligned AI. If our goal is alignment, why should the primary focus be on an activity that doesn't get us any closer to aligning an AI?

Thanks for your comment!

I think a sufficiently intelligent ASI is equally likely to outsmart human goal-directedness efforts as it is to outsmart guardrails.

I think number 2 is a good point.

There are many people who actively want to create an aligned ASI as soon as possible to reap its benefits, for whom my suggestion is not useful.

But there are others who primarily want to prevent the creation of a misaligned ASI, and are willing to forgo the creation of an ASI if necessary.

There are also others who want to create an aligned ASI, but are willing to considerably delay this to improve the chances that the ASI is aligned.

I think my suggestion is mainly useful for these second and third groups.

Curated and popular this week
 ·  · 32m read
 · 
Summary Immediate skin-to-skin contact (SSC) between mothers and newborns and early initiation of breastfeeding (EIBF) may play a significant and underappreciated role in reducing neonatal mortality. These practices are distinct in important ways from more broadly recognized (and clearly impactful) interventions like kangaroo care and exclusive breastfeeding, and they are recommended for both preterm and full-term infants. A large evidence base indicates that immediate SSC and EIBF substantially reduce neonatal mortality. Many randomized trials show that immediate SSC promotes EIBF, reduces episodes of low blood sugar, improves temperature regulation, and promotes cardiac and respiratory stability. All of these effects are linked to lower mortality, and the biological pathways between immediate SSC, EIBF, and reduced mortality are compelling. A meta-analysis of large observational studies found a 25% lower risk of mortality in infants who began breastfeeding within one hour of birth compared to initiation after one hour. These practices are attractive targets for intervention, and promoting them is effective. Immediate SSC and EIBF require no commodities, are under the direct influence of birth attendants, are time-bound to the first hour after birth, are consistent with international guidelines, and are appropriate for universal promotion. Their adoption is often low, but ceilings are demonstrably high: many low-and middle-income countries (LMICs) have rates of EIBF less than 30%, yet several have rates over 70%. Multiple studies find that health worker training and quality improvement activities dramatically increase rates of immediate SSC and EIBF. There do not appear to be any major actors focused specifically on promotion of universal immediate SSC and EIBF. By contrast, general breastfeeding promotion and essential newborn care training programs are relatively common. More research on cost-effectiveness is needed, but it appears promising. Limited existing
Ben_West🔸
 ·  · 1m read
 · 
> Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks. > > The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years. The shaded region represents 95% CI calculated by hierarchical bootstrap over task families, tasks, and task attempts. > > Full paper | Github repo Blogpost; tweet thread. 
 ·  · 2m read
 · 
For immediate release: April 1, 2025 OXFORD, UK — The Centre for Effective Altruism (CEA) announced today that it will no longer identify as an "Effective Altruism" organization.  "After careful consideration, we've determined that the most effective way to have a positive impact is to deny any association with Effective Altruism," said a CEA spokesperson. "Our mission remains unchanged: to use reason and evidence to do the most good. Which coincidentally was the definition of EA." The announcement mirrors a pattern of other organizations that have grown with EA support and frameworks and eventually distanced themselves from EA. CEA's statement clarified that it will continue to use the same methodologies, maintain the same team, and pursue identical goals. "We've found that not being associated with the movement we have spent years building gives us more flexibility to do exactly what we were already doing, just with better PR," the spokesperson explained. "It's like keeping all the benefits of a community while refusing to contribute to its future development or taking responsibility for its challenges. Win-win!" In a related announcement, CEA revealed plans to rename its annual EA Global conference to "Coincidental Gathering of Like-Minded Individuals Who Mysteriously All Know Each Other But Definitely Aren't Part of Any Specific Movement Conference 2025." When asked about concerns that this trend might be pulling up the ladder for future projects that also might benefit from the infrastructure of the effective altruist community, the spokesperson adjusted their "I Heart Consequentialism" tie and replied, "Future projects? I'm sorry, but focusing on long-term movement building would be very EA of us, and as we've clearly established, we're not that anymore." Industry analysts predict that by 2026, the only entities still identifying as "EA" will be three post-rationalist bloggers, a Discord server full of undergraduate philosophy majors, and one person at