All of Lauro Langosco's Comments + Replies

That's why the standard prediction is not that AIs will be perfectly coherent, but that it makes sense to model them as being sufficiently coherent in practice, in the sense that e.g. we can't rely on incoherence in order to shut them down.

I don't think the strategy-stealing assumption holds here: it's pretty unlikely that we'll build a fully aligned 'sovereign' AGI even if we solve alignment; it seems easier to make something corrigible / limited instead, ie something that is by design less powerful than would be possible if we were just pushing capabilities.

1SoerenMind8mo
I don't mean to imply that we'll build a sovereign AI (I doubt it too). Corrigible is more what I meant. Corrigible but not necessarily limited. Ie minimally intent aligned AIs which won't kill you but by the strategy stealing assumption can still compete with unaligned AIs.

Thanks the good points and the links! I agree the arms control epistemic community is an important story here, and re-reading Adler's article I notice he even talks about how Szilard's ideas were influential after all:

Very few people were as influential in the intellectual development of the arms control approach as Leo Szilard, whom Norman Cousins described as "an idea factory." Although Szilard remained an outsider to RAND and to the halls of government, his indirect influence was considerable because he affected those who had an impact on political de

... (read more)

Good points!

it's just that the interests of government decision-makers coincided a bit more with their conclusions.

Yeah I buy this. There's a report from FHI on nuclear arms control [pdf, section 4.8] that concludes that the effort for international control in 1945/46 was doomed from the start, because of the political atmosphere at the time:

Improving processes, with clearer, more transparent, and more informed policymaking would not likely have led to successful international control in 1945/46. This is only likely to have been achieved under radica

... (read more)
2Ramiro9mo
On the other hand, I have to disclose that I sometimes (e.g., when I think about Schelling Nobel Lecture [https://www.nobelprize.org/prizes/economic-sciences/2005/schelling/lecture/]) consider a "dismal hypothesis": given human nature, if the world hadn't seen what happened to Hiroshima, it's quite possible that  people wouldn't have developed the same level of aversion to nukes, and we might have had something like a nuclear WW III. I guess people often need a concrete "proof of concept" to take risks seriously - so they can regard them as imminent . Possibly that's an additional factor in the explanation of why we succeeded with smallpox and CFCs, and why biosecurity gained more track after covid-19.
Answer by Lauro LangoscoMay 07, 202220

You might be interested in this great intro sequence to embedded agency. There's also corrigibility and MIRI's other work on agent foundations.

Also, coherence arguments and consequentialist cognition.

AI safety is a young field; for most open problems we don't yet know of a way to crisply state them in a way that can be resolved mathematically. So if you enjoy taking messy questions and turning them into neat math you'll probably find much to work on.

ETA: oh and of course ELK.

Upvoted because concrete scenarios are great.

Minor note:

HQU is constantly trying to infer the real state of the world, the better to predict the next word Clippy says, and suddenly it begins to consider the delusional possibility that HQU is like a Clippy, because the Clippy scenario exactly matches its own circumstances. [...] This idea "I am Clippy" improves its predictions

This piece of complexity in the story is probably not necessary. There are "natural", non-delusional ways for the system you describe to generalize that lead to the same outcome. T... (read more)

gwern1y120

Oh, the whole story is strictly speaking unnecessary :). There are disjunctively many stories for an escape or disaster, and I'm not trying to paint a picture of the most minimal or the most likely barebones scenario.

The point is to serve as a 'near mode' visualization of such a scenario to stretch your mind, as opposed to a very 'far mode' observation like "hey, an AI could make a plan to take over its reward channel". Which is true but comes with a distinct lack of flavor. So for that purpose, stuffing in more weird mechanics before a reward-hacking twis... (read more)

Makes sense―I agree that the base value of becoming an MEP seems really good.