Guardrails vs Goal-directedness in AI Alignment

freedomandutility

Guardrails vs Goal-directedness in AI Alignment

freedomandutility

1 min readDec 30, 2023

Comments 2

Sorted by

New & upvoted

Hayven Frienby

Both approaches are important components of a comprehensive AI safety strategy. With that said, I think that improving goal-directedness (as you've defined it here) is likely to yield more fruitful long-term results for AI safety because:

A sufficiently advanced AGI (what is often labeled ASI, above human level) could outsmart any guardrails implemented by humans given enough time and compute power
Guardrails seem (as you mentioned) to be specifically an approach dedicated stopping an unaligned AI from causing damage. It does not actually get us closer to an aligned AI. If our goal is alignment, why should the primary focus be on an activity that doesn't get us any closer to aligning an AI?

freedomandutility

Thanks for your comment!

I think a sufficiently intelligent ASI is equally likely to outsmart human goal-directedness efforts as it is to outsmart guardrails.

I think number 2 is a good point.

There are many people who actively want to create an aligned ASI as soon as possible to reap its benefits, for whom my suggestion is not useful.

But there are others who primarily want to prevent the creation of a misaligned ASI, and are willing to forgo the creation of an ASI if necessary.

There are also others who want to create an aligned ASI, but are willing to considerably delay this to improve the chances that the ASI is aligned.

I think my suggestion is mainly useful for these second and third groups.

Comments