Accidentally teaching AI models to deceive us (Ajeya Cotra on The 80,000 Hours Podcast)

80000_Hours; Ajeya; Luisa_Rodriguez

Accidentally teaching AI models to deceive us (Ajeya Cotra on The 80,000 Hours Podcast)

80000_Hours,

Comments 2

Sorted by

New & upvoted

Denis

Great article!

The analogy to the economy at the end is wonderful. A lot of us don't realise how badly the economy works. But it's easy to see by just thinking about AI and what's happening right now. People are speculating that AI might one day do as much as 50% of the work now done by humans. A naive outsider might expect us to be celebrating in the streets and introducing a 3-day work-week for everyone. But instead, because our economy works the way it does, with almost all of most people's income directly tied to their "jobs", the reaction is mostly fear that it will eliminate jobs and leave people without any income.

I'm guessing that the vast majority of people would love to move to a condition (which AI could enable) where everyone works only 50% as much but we keep the same benefits. But there is no realistic way to get there with our economy, at least not quickly. Even if we know what we want to achieve, we just cannot overcome all the barriers and Nash equilibria and individual interests. We understand the principles of each different part of the economy, but the whole picture is just far too complex for anyone to understand or for us, even with total collaboration, to manipulate effectively.

I'm sure that if we were trying to design the economy from scratch, we would not want to create a system in which a hedge-fund manager can earn 1000 times as much as a teacher, for example. But that's what we have created. If we cannot control the incentives for humans within a system that we fundamentally understand, how well can we control the incentives for an AI system working in ways that we don't understand?

It's worrying. And yet, AI can do so much good in so many ways for so many people, we have to find the right way forward.

Adebayo Mubarak

I think what matters here is having a kill switch or some set of parameters like [if <situation> occurs, kill] or some sort of limitation to the purview or what a particular model can undertake. If we keep churning out models trained in a general way, there is a high probability of running riot one day but if there are limitations to what they can do, which unfortunately will be undermining the reason we deploy AI in the first place but at is stands now, we need something urgent to keep this existential risk at bay or perhaps it's our paranoia running riot... Perhaps not.

Comments

More from the author

80,000 Hours is hiring a lot right now — come join us!

80000_Hours, Arden Koehler·2mo ago·6m read

How scary is Claude Mythos? 303 pages in 21 minutes

80000_Hours·3mo ago·18m read

231

80,000 Hours is shifting its strategic approach to focus more on AGI

80000_Hours, Niel_Bowerman·1y ago·9m read

Curated and popular this week

What would an animal-aligned AI be aligned to?

Aidan Kankyoku, Anima International·1w ago·Curated 3d ago·15m read

This is a crosspost from the new Animal Welfare Alignment Newsletter by Anima International. You can subscribe on Substack if you are interested in following these efforts. Audio reading also available on Substack. The goals of this post are to: 1. Raise a question I see as crucially important to the goal of aligning AI to animal welfare...

179

The first video from Giving What We Can's new channel is out now!

JustinPortela·5d ago·1m read

Hello! I'm Justin Portela. I got hired by GWWC to make YouTube videos after AI in Context did such a kickass job. My channel is using that same cinematic, high-production value beauty to talk about everything in the EA universe that isn't AI. ...

New round of digital minds funding opportunities at Longview

zdgroff, Longview Philanthropy·6d ago·2m read

This is a linkpost for Request for Proposals: Research and Applied Work on Digital Minds. I'm glad to announce a request for proposals for research and applied work on digital minds at Longview Ph...

Recent opportunities to take action

177

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·2w ago·4m read

A huge way you can help pigs in 5-20 minutes (in the US)

ElliotTep·3d ago·1m read

RP is looking for project founders in neglected animal areas

Rethink Priorities·1w ago·7m read

Accidentally teaching AI models to deceive us (Ajeya Cotra on The 80,000 Hours Podcast)

Accidentally teaching AI models to deceive us (Ajeya Cotra on The 80,000 Hours Podcast)

Episode Summary

Highlights

How ML models might develop situational awareness

Why situational awareness makes safety tests less informative

**What misalignment doesn't mean**

Why it's critical to avoid training bigger systems

Why it's hard to negatively reinforce deception in ML systems

Can we require AI to explain its reasons for its actions?

Ways AI is like and unlike the economy

Accidentally teaching AI models to deceive us (Ajeya Cotra on The 80,000 Hours Podcast)

Accidentally teaching AI models to deceive us (Ajeya Cotra on The 80,000 Hours Podcast)

Episode Summary

Highlights

How ML models might develop situational awareness

Why situational awareness makes safety tests less informative

What misalignment *doesn't* mean

Why it's critical to avoid training bigger systems

Why it's hard to negatively reinforce deception in ML systems

Can we require AI to explain its reasons for its actions?

Ways AI is like and unlike the economy

**What misalignment doesn't mean**