AI safety
AI safety
Studying and reducing the existential risks posed by advanced artificial intelligence

Quick takes

31
1mo
Y-Combinator wants to fund Mechanistic Interpretability startups "Understanding model behavior is very challenging, but we believe that in contexts where trust is paramount it is essential for an AI model to be interpretable. Its responses need to be explainable. For society to reap the full benefits of AI, more work needs to be done on explainable AI. We are interested in funding people building new interpretable models or tools to explain the output of existing models." Link https://www.ycombinator.com/rfs (Scroll to 12) What they look for in startup founders https://www.ycombinator.com/library/64-what-makes-great-founders-stand-out
85
5mo
6
Being mindful of the incentives created by pressure campaigns I've spent the past few months trying to think about the whys and hows of large-scale public pressure campaigns (especially those targeting companies — of the sort that have been successful in animal advocacy). A high-level view of these campaigns is that they use public awareness and corporate reputation as a lever to adjust corporate incentives. But making sure that you are adjusting the right incentives is more challenging than it seems. Ironically, I think this is closely connected to specification gaming: it's often easy to accidentally incentivize companies to do more to look better, rather than doing more to be better. For example, an AI-focused campaign calling out RSPs recently began running ads that single out AI labs for speaking openly about existential risk (quoting leaders acknowledging that things could go catastrophically wrong). I can see why this is a "juicy" lever — most of the public would be pretty astonished/outraged to learn some of the beliefs that are held by AI researchers. But I'm not sure if pulling this lever is really incentivizing the right thing. As far as I can tell, AI leaders speaking openly about existential risk is good. It won't solve anything in and of itself, but it's a start — it encourages legislators and the public to take the issue seriously. In general, I think it's worth praising this when it happens. I think the same is true of implementing safety policies like RSPs, whether or not such policies are sufficient in and of themselves. If these things are used as ammunition to try to squeeze out stronger concessions, it might just incentivize the company to stop doing the good-but-inadequate thing (i.e. CEOs are less inclined to speak about the dangers of their product when it will be used as a soundbite in a campaign, and labs are probably less inclined to release good-but-inadequate safety policies when doing so creates more public backlash than they were
22
2mo
1
Not that we can do much about it, but I find the idea of Trump being president in a time that we're getting closer and closer to AGI pretty terrifying. A second Trump term is going to have a lot more craziness and far fewer checks on his power, and I expect it would have significant effects on the global trajectory of AI.
44
5mo
4
I might elaborate on this at some point, but I thought I'd write down some general reasons why I'm more optimistic than many EAs on the risk of human extinction from AI. I'm not defending these reasons here; I'm mostly just stating them. * Skepticism of foom: I think it's unlikely that a single AI will take over the whole world and impose its will on everyone else. I think it's more likely that millions of AIs will be competing for control over the world, in a similar way that millions of humans are currently competing for control over the world. Power or wealth might be very unequally distributed in the future, but I find it unlikely that it will be distributed so unequally that there will be only one relevant entity with power. In a non-foomy world, AIs will be constrained by norms and laws. Absent severe misalignment among almost all the AIs, I think these norms and laws will likely include a general prohibition on murdering humans, and there won't be a particularly strong motive for AIs to murder every human either. * Skepticism that value alignment is super-hard: I haven't seen any strong arguments that value alignment is very hard, in contrast to the straightforward empirical evidence that e.g. GPT-4 seems to be honest, kind, and helpful after relatively little effort. Most conceptual arguments I've seen for why we should expect value alignment to be super-hard rely on strong theoretical assumptions that I am highly skeptical of. I have yet to see significant empirical successes from these arguments. I feel like many of these conceptual arguments would, in theory, apply to humans, and yet human children are generally value aligned by the time they reach young adulthood (at least, value aligned enough to avoid killing all the old people). Unlike humans, AIs will be explicitly trained to be benevolent, and we will have essentially full control over their training process. This provides much reason for optimism. * Belief in a strong endogenous response to AI:
52
7mo
2
I've just written a blog post to summarise EA-relevant UK political news from the last ~six weeks. The post is here: AI summit, semiconductor trade policy, and a green light for alternative proteins (substack.com) I'm planning to circulate this around some EAs, but also some people working in the Civil Service, political consulting and journalism. Many might already be familiar with the stories. But I think this might be useful if I can (a) provide insightful UK political context for EAs, or (b) provide an EA perspective to curious adjacents. I'll probably continue this if I think either (a) or (b) is paying off. (I work at Rethink Priorities, but this is entirely in my personal capacity).
12
2mo
3
Reducing the probability that AI takeover involves violent conflict seems leveraged for reducing near-term harm Often in discussions of AI x-safety, people seem to assume that misaligned AI takeover will result in extinction. However, I think AI takeover is reasonably likely to not cause extinction due to the misaligned AI(s) effectively putting a small amount of weight on the preferences of currently alive humans. Some reasons for this are discussed here. Of course, misaligned AI takeover still seems existentially bad and probably eliminates a high fraction of future value from a longtermist perspective. (In this post when I use the term “misaligned AI takeover”, I mean misaligned AIs acquiring most of the influence and power over the future. This could include “takeover” via entirely legal means, e.g., misaligned AIs being granted some notion of personhood and property rights and then becoming extremely wealthy.) However, even if AIs effectively put a bit of weight on the preferences of current humans it's possible that large numbers of humans die due to violent conflict between a misaligned AI faction (likely including some humans) and existing human power structures. In particular, it might be that killing large numbers of humans (possibly as collateral damage) makes it easier for the misaligned AI faction to take over. By large numbers of deaths, I mean over hundreds of millions dead, possibly billions. But, it's somewhat unclear whether violent conflict will be the best route to power for misaligned AIs and this also might be possible to influence. See also here for more discussion. So while one approach to avoid violent AI takeover is to just avoid AI takeover, it might also be possible to just reduce the probability that AI takeover involves violent conflict. That said, the direct effects of interventions to reduce the probability of violence don't clearly matter from an x-risk/longtermist perspective (which might explain why there hasn't historically b
4
15d
[Question] How should we think about the decision relevance of models estimating p(doom)? (Epistemic status: confused & dissatisfied by what I've seen published, but haven't spent more than a few hours looking. Question motivated by Open Philanthropy's AI Worldviews Contest; this comment thread asking how OP updated reminded me of my dissatisfaction. I've asked this before on LW but got no response; curious to retry, hence repost)  To illustrate what I mean, switching from p(doom) to timelines:  * The recent post AGI Timelines in Governance: Different Strategies for Different Timeframes was useful to me in pushing back against Miles Brundage's argument that "timeline discourse might be overrated", by showing how choice of actions (in particular in the AI governance context) really does depend on whether we think that AGI will be developed in ~5-10 years or after that.  * A separate takeaway of mine is that decision-relevant estimation "granularity" need not be that fine-grained, and in fact is not relevant beyond simply "before or after ~2030" (again in the AI governance context).  * Finally, that post was useful to me in simply concretely specifying which actions are influenced by timelines estimates.   Question: Is there something like this for p(doom) estimates? More specifically, following the above points as pushback against the strawman(?) that "p(doom) discourse, including rigorous modeling of it, is overrated": 1. What concrete high-level actions do most alignment researchers agree are influenced by p(doom) estimates, and would benefit from more rigorous modeling (vs just best guesses, even by top researchers e.g. Paul Christiano's views)? 2. What's the right level of granularity for estimating p(doom) from a decision-relevant perspective? Is it just a single bit ("below or above some threshold X%") like estimating timelines for AI governance strategy, or OOM (e.g. 0.1% vs 1% vs 10% vs >50%), or something else? * I suppose the easy answer is "t
39
8mo
5
Immigration is such a tight constraint for me. My next career steps after I'm done with my TCS Masters are primarily bottlenecked by "what allows me to remain in the UK" and then "keeps me on track to contribute to technical AI safety research". What I would like to do for the next 1 - 2 years ("independent research"/ "further upskilling to get into a top ML PhD program") is not all that viable a path given my visa constraints. Above all, I want to avoid wasting N more years by taking a detour through software engineering again so I can get Visa sponsorship. [I'm not conscientious enough to pursue AI safety research/ML upskilling while managing a full time job.] Might just try and see if I can pursue a TCS PhD at my current university and do TCS research that I think would be valuable for theoretical AI safety research. The main detriment of that is I'd have to spend N more years in <city> and I was really hoping to come down to London. Advice very, very welcome. [Not sure who to tag.]
Load more (8/84)