All of SamClarke's Comments + Replies

What is most confusing to you about AI stuff?

On the margin, should donors prioritize AI safety above other existential risks and broad longtermist interventions?

To the extent that this question overlaps with Mauricio's question 1.2 (i.e. A bunch of people seem to argue for "AI stuff is important" but believe / act as if "AI stuff is overwhelmingly important"--what are arguments for the latter view?), then you might find his answer helpful.

other x-risks and longtermist areas seem rather unexplored and neglected, like s-risks

Only a partial answer, but worth noting that I think the most plausible... (read more)

What is most confusing to you about AI stuff?

Is "intelligence" ... really enough to make an AI system more powerful than humans (individuals, groups, or all of humanity combined)?

Some discussion of this question here: https://www.alignmentforum.org/posts/eGihD5jnD6LFzgDZA/agi-safety-from-first-principles-control

What is most confusing to you about AI stuff?

Do we need to decide on a moral principle(s) first? How would it be possible to develop beneficial AI without first 'solving' ethics/morality?

Good question! The answer is no: 'solving' ethics/morality first is one thing that we probably eventually need to do, but we could first solve a narrower, simpler form of AI alignment, and use those aligned systems to help us solve ethics/morality and the other trickier problems (like the control problem for more general, capable systems). This is more or less what is discussed in ambitious vs narrow value learnin... (read more)

What is most confusing to you about AI stuff?

A timely post: https://forum.effectivealtruism.org/posts/DDDyTvuZxoKStm92M/ai-safety-needs-great-engineers

(The focus is software engineering not development, but should still be informative.)

Why AI alignment could be hard with modern deep learning

I love this post and also expect it to be something that I point people towards in the future!

I was wondering about what kind of alignment failure - i.e. outer or inner alignment - you had in mind when describing sycophant models (for schemer models, it's obviously an inner alignment failure).

It seems you could get sycophant models via inner alignment failure, because you could train them a sensible, well-specified objective functions, and yet the model learns to pursue human approval anyway (because "pursuing human approval" turned out to be more easily... (read more)

2Ajeya1moI was imagining Sycophants as an outer alignment failure, assuming the model is trained with naive RL from human feedback.
General vs specific arguments for the longtermist importance of shaping AI development

(Apologies for my very slow reply.)

I feel like something has gone wrong in this conversation; you have tricked Bob into working on learning from human feedback, rather than convincing him to do so.

I agree with this. If people become convinced to work on AI stuff by specific argument X, then they should definitely go and try to fix X, not something else (e.g. what other people tell them needs doing in AI safety/governance).

I think when I said I wanted a more general argument to be the "default", I was meaning something very general, that doesn't clearl... (read more)

3rohinmshah1moI agree with that, and that's what I meant by this statement above:
General vs specific arguments for the longtermist importance of shaping AI development

The problem with general arguments is that they tell you very little about how to solve the problem

Agreed!

If I were producing key EA content/fellowships/etc, I would be primarily interested in getting people to solve the problem

I think this is true for some kinds of content/fellowships/etc, but not all. For those targeted at people who aren't already convinced that AI safety/governance should be prioritised (which is probably the majority), it seems more important to present them with the strongest arguments for caring about AI safety/governance in ... (read more)

8rohinmshah2moImagine Alice, an existing AI safety researcher, having such a conversation with Bob, who doesn't currently care about AI safety: Alice: AGI is decently likely to be built in the next century, and if it is it will have a huge impact on the world, so it's really important to deal with it now. Bob: Huh, okay. It does seem like it's pretty important to make sure that AGI doesn't discriminate against people of color. And we better make sure that AGI isn't used in the military, or all nations will be forced to do so thanks to Moloch, and wars will be way more devastating. Alice: Great, I'm glad you agree. Bob: Okay, so what can we do to shape AGI? Alice: Well, we should ensure that AGIs don't pursue goals that weren't the ones we intended, so you work on learning from human feedback. Bob: <works on those topics> I feel like something has gone wrong in this conversation; you have tricked Bob into working on learning from human feedback, rather than convincing him to do so. Based on how you convinced him to care, he should really be (say) advocating against the use of AI in the military. (This feels very similar to a motte and bailey, where the motte is "AI will be a huge deal so we should influence it" and the bailey is "you should be working on learning from human feedback".) I think it's more accurate to say that you've answered "why should I care about X?" and "if I do care about Y, what should I do?", without noticing that X and Y are actually different.
Lessons learned running the Survey on AI existential risk scenarios

Thanks for the detailed reply, all of this makes sense!

I added a caveat to the final section mentioning your disagreements with some of the points in the "Other small lessons about survey design" section

We're Redwood Research, we do applied alignment research, AMA

What might be an example of a "much better weird, theory-motivated alignment research" project, as mentioned in your intro doc? (It might be hard to say at this point, but perhaps you could point to something in that direction?)

6Buck2moI think the best examples would be if we tried to practically implement various schemes that seem theoretically doable and potentially helpful, but quite complicated to do in practice. For example, imitative generalization [https://www.lesswrong.com/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1] or the two-head proposal here [https://www.alignmentforum.org/posts/gEw8ig38mCGjia7dj/answering-questions-honestly-instead-of-predicting-human#Defender] . I can imagine that it might be quite hard to get industry labs to put in the work of getting imitative generalization to work in practice, and so doing that work (which labs could perhaps then adopt) might have a lot of impact.
We're Redwood Research, we do applied alignment research, AMA

How crucial a role do you expect x-risk-motivated AI alignment will play in making things go well? What are the main factors you expect will influence this? (e.g. the occurrence of medium-scale alignment failures as warning shots)

3Buck2moWe could operationalize this as “How does P(doom) vary as a function of the total amount of quality-adjusted x-risk-motivated AI alignment output?” (A related question is “Of the quality-adjusted AI alignment research, how much will be motivated by x-risk concerns?” This second question feels less well defined.) I’m pretty unsure here. Today, my guess is like 25% chance of x-risk from AI this century, and maybe I imagine that being 15% if we doubled the quantity of quality-adjusted x-risk-motivated AI alignment output, and 35% if we halved that quantity. But I don’t have explicit models here and just made these second two numbers up right now; I wouldn’t be surprised to hear that they moved noticeably after two hours of thought. I guess that one thing you might learn from these numbers is that I think that x-risk-motivated AI alignment output is really important. I definitely think that AI x-risk seems lower in worlds where we expect medium-scale alignment failure warning shots. I don’t know whether I think that x-risk-motivated alignment research seems less important in those worlds or not--even if everyone thinks that AI is potentially dangerous, we have to have scalable solutions to alignment problems, and I don’t see a reliable route that takes us directly from “people are concerned” to “people solve the problem”. I think the main factor that affects the importance of x-risk-motivated alignment research is whether it turns out that most of the alignment problem occurs in miniature in sub-AGI systems. If so, much more of the work required for aligning AGI will be done by people who aren’t thinking about how to reduce x-risk.
We're Redwood Research, we do applied alignment research, AMA

Some questions that aren't super related to Redwood/applied ML AI safety, so feel free to ignore if not your priority:

  1. Assuming that it's taking too long to solve the technical alignment problem, what might be some of our other best interventions to reduce x-risk from AI? E.g., regulation, institutions for fostering cooperation and coordination between AI labs, public pressure on AI labs/other actors to slow deployment, ...

  2. If we solve the technical alignment problem in time, what do you think are the other major sources of AI-related x-risk that remain? How likely do you think these are, compared to x-risk from not solving the technical alignment problem in time?

So one thing to note is that I think that there are varying degrees of solving the technical alignment problem. In particular, you’ve solved the alignment problem more if you’ve made it really convenient for labs to use the alignment techniques you know about. If next week some theory people told me “hey we think we’ve solved the alignment problem, you just need to use IDA, imitative generalization, and this new crazy thing we just invented”, then I’d think that the main focus of the applied alignment community should be trying to apply these alignment tec... (read more)

How to succeed as an early-stage researcher: the “lean startup” approach

The 'lean startup' approach reminds me of Jacob Steinhardt's post about his approach to research, of which the key takeaways are:

  • When working on a research project, you should basically either be in "de-risking mode" (determining if the project is promising as quickly as possible) OR "execution mode" (assuming the project is promising and trying to do it quickly). This probably looks like trying to do an MVP version of the project quickly, and then iterating on that if it's promising.
  • If a project doesn't work out, ask why. That way you:
    • avoid trying si
... (read more)
How to succeed as an early-stage researcher: the “lean startup” approach

Write out at least 10 project ideasand ask somebody more senior to rank the best few

For bonus points, try to understand how they did the ranking. That way, you can start building up a model of how senior researchers think about evaluating project ideas, and refining your own research taste explicitly.

How to succeed as an early-stage researcher: the “lean startup” approach

Thanks for writing this, I found it helpful and really clearly written!

One reaction: if you're testing research as a career (rather than having committed and now aiming to maximise your chances of success), your goal isn't exactly to succeed as an early stage researcher. It might be that trying your best to succeed is approximately the best way to test your fit - but it seems like there are a few differences:

  • "Going where there's supervision" might be especially important, since a supervisor who comes to know you very well is a big and reliable source of
... (read more)
SamClarke's Shortform

Incidentally, that 80k episode and some from Clearer Thinking are the exact examples I had in mind!

Maybe one could promote specific podcast episodes of this type, see if people found them useful in that way, and if so then encourage those podcasts to have more such eps or a new such podcast to start?

As a step towards this, and in case any other find it independently useful, here are the episodes of Clearer Thinking that I recall finding helpful for my mental health (along with the issues they helped with).

  • #11 Comfort Languages and Nuanced Thinking (f
... (read more)
SamClarke's Shortform

An effective mental health intervention, for me, is listening to a podcast which ideally (1) discusses the thing I'm struggling with and (2) has EA, Rationality or both in the background. I gain both in-the-moment relief, and new hypotheses to test or tools to try.

Esp since it would be scalable, this makes me think that creating an EA mental health podcast would be an intervention worth testing - I wonder if anyone is considering this?

In the meantime, I'm on the look out for good mental health podcasts in general.

3MichaelA2moThis does sound like an interesting idea. And my impression is that many people found the recent mental health related 80k episode [https://80000hours.org/podcast/episodes/depression-anxiety-imposter-syndrome/] very useful (or at least found that it "spoke to them"). Maybe many episodes of Clearer Thinking could also help fill this role? Maybe one could promote specific podcast episodes of this type, see if people found them useful in that way, and if so then encourage those podcasts to have more such eps or a new such podcast to start? Though starting a podcast is pretty low-cost, so it'd be quite reasonable to just try it without doing that sort of research first.
Survey on AI existential risk scenarios

Thanks Rob, interesting question. Here are the correlation coefficients between pairs of scenarios (sorted from max to min):

So it looks like there are only weak correlations between some scenarios.

It's worth bearing in mind that we asked respondents not to give an estimate for any scenarios they'd thought about for less than 1 hour. The correlations could be stronger if we didn't have this requirement.

"Existential risk from AI" survey results

I helped run the other survey mentioned , so I'll jump in here with the relevant results and my explanation for the difference. The full results will be coming out this week.

Results

We asked participants to estimate the probability of an existential catastrophe due to AI (see definitions below). We got:

  • mean: 0.23
  • median: 0.1

Our question isn't directly comparable with Rob's, because we don't condition on the catastrophe being "as a result of humanity not doing enough technical AI safety research" or "as a result of AI systems not doing/optimizing what the peo... (read more)

4RobBensinger6moExcited to have the full results of your survey released soon! :) I read a few paragraphs of it when you sent me a copy, though I haven't read the full paper. Your "probability of an existential catastrophe due to AI" got mean 0.23 and median 0.1. Notably, this includes misuse risk along with accident risk, so it's especially striking that it's lower than my survey's Q2, "[risk from] AI systems not doing/optimizing what the people deploying them wanted/intended", which got mean ~0.401 and median 0.3. Looking at different subgroups' answers to Q2: * MIRI: mean 0.8, median 0.7. * OpenAI: mean ~0.207, median 0.26. (A group that wasn't in your survey.) * No affiliation specified: mean ~0.446, median 0.35. (Might or might not include MIRI people.) * All respondents other than 'MIRI' and 'no affiliation specified': mean 0.278, median 0.26. Even the latter group is surprisingly high. A priori, I'd have expected that MIRI on its own would matter less than 'the overall (non-MIRI) target populations are very different for the two surveys': * My survey was sent to FHI, MIRI, DeepMind, CHAI, Open Phil, OpenAI, and 'recent OpenAI'. * Your survey was sent to four of those groups (FHI, MIRI, CHAI, Open Phil), subtracting OpenAI, 'recent OpenAI', and DeepMind. Yours was also sent to CSER, Mila, Partnership on AI, CSET, CLR, FLI, AI Impacts, GCRI, and various independent researchers recommended by these groups. So your survey has fewer AI researchers, more small groups, and more groups that don't have AGI/TAI as their top focus. * You attempted to restrict your survey to people "who have taken time to form their own views about existential risk from AI", whereas I attempted to restrict to anyone "who researches long-term AI topics, or who has done a lot of past work on such topics". So I'd naively expect my population to include more people who (e.g.) work on AI alignment but haven't thought a bunch about risk forecasting; and I'd n
Is there evidence that recommender systems are changing users' preferences?

I'm curious what you think would count as a current ML model 'intentionally' doing something? It's not clear to me that any currently deployed ML models can be said to have goals.

To give a bit more context on what I'm confused about: the model that gets deployed is the one that does best at minimising the loss function during training. Isn't Russell's claim that a good strategy for minimising the loss function is to change users' preferences? Then, whether or not the model is 'intentionally' radicalising people is beside the point

(I find talk about the goals of AI systems pretty confusing, so I could easily be misunderstanding, or wrong about something)

5reallyeli6moYeah, I agree this is unclear. But, staying away from the word 'intention' entirely, I think we can & should still ask: what is the best explanation for why this model is the one that minimizes the loss function during training? Does that explanation involve this argument about changing user preferences, or not? One concrete experiment that could feed into this: if it were the case that feeding users extreme political content did not cause their views to become more predictable, would training select a model that didn't feed people as much extreme political content? I'd guess training would select the same model anyway, because extreme political content gets clicks in the short-term too. (But I might be wrong.)
What is the likelihood that civilizational collapse would directly lead to human extinction (within decades)?

FYI, broken link here:

I think my views on this are pretty similar to those Beckstead expresses here

3MichaelA9moOh, good catch, thanks! I accidentally linked to the title rather than the url [https://blog.givewell.org/2015/08/13/the-long-term-significance-of-reducing-global-catastrophic-risks/] . Now fixed.
My personal cruxes for working on AI safety

On crux 4: I agree with your argument that good alignment solutions will be put to use, in worlds where AI risk comes from AGI being an unbounded maximiser. I'm less certain that they would be in worlds where AI risk comes from structural loss of control leading to influence-seeking agents (the world still gets better in Part I of the story, so I'm uncertain whether there would be sufficient incentive for corporations to use AIs aligned with complex values rather than AIs aligned with profit maximisation).

Do you have any thoughts on this or know if anyone has written about it?

I'm Buck Shlegeris, I do research and outreach at MIRI, AMA

Thanks for the reply! Could you give examples of:

a) two agendas that seem to be "reflecting" the same underlying problem despite appearing very different superficially?

b) a "deep prior" that you think some agenda is (partially) based on, and how you would go about working out how deep it is?

Sure

a)

For example, CAIS and something like "classical superintelligence in a box picture" disagree a lot on the surface level. However, if you look deeper, you will find many similar problems. Simple to explain example: problem of manipulating the operator - which has (in my view) some "hard core" involving both math and philosophy, where you want the AI to somehow communicate with humans in a way which at the same time allows a) the human to learn from the AI if the AI knows something about the world b) the operator's values are ... (read more)

I'm Buck Shlegeris, I do research and outreach at MIRI, AMA

My sense of the current general landscape of AI Safety is: various groups of people pursuing quite different research agendas, and not very many explicit and written-up arguments for why these groups think their agenda is a priority (a notable exception is Paul's argument for working on prosaic alignment). Does this sound right? If so, why has this dynamic emerged and should we be concerned about it? If not, then I'm curious about why I developed this picture.

I think the picture is somewhat correct, and we surprisingly should not be too concerned about the dynamic.

My model for this is:

1) there are some hard and somewhat nebulous problems "in the world"

2) people try to formalize them using various intuitions/framings/kinds of math; also using some "very deep priors"

3) the resulting agendas look at the surface level extremely different, and create the impression you have

but actually

4) if you understand multiple agendas deep enough, you get a sense

  • how they are sometimes "reflecting" t
... (read more)

I think your sense is correct. I think that plenty of people have short docs on why their approach is good; I think basically no-one has long docs engaging thoroughly with the criticisms of their paths (I don't think Paul's published arguments defending his perspective count as complete; Paul has arguments that I hear him make in person that I haven't seen written up.)

My guess is that it's developed because various groups decided that it was pretty unlikely that they were going to be able to convince other groups of their work, and so t... (read more)

Not Buck, but one possibility is that people pursuing different high-level agendas have different intuitions about what's valuable, and those kind of disagreements are relatively difficult to resolve, and the best way to resolve them is to gather more "object-level" data.

Maybe people have already spent a fair amount of time having in-person discussions trying to resolve their disagreements, and haven't made progress, and this discourages them from writing up their thoughts because they think it won't be a good use of time. However, this line of reasoning

... (read more)
I'm Buck Shlegeris, I do research and outreach at MIRI, AMA

What do you think are the biggest mistakes that the AI Safety community is currently making?

I'm Buck Shlegeris, I do research and outreach at MIRI, AMA

Paul Christiano is a lot more optimistic than MIRI about whether we could align a Prosaic AGI. In a relatively recent interview with AI Impacts he said he thinks "probably most of the disagreement" about this lies in the question of "can this problem [alignment] just be solved on paper in advance" (Paul thinks there's "at least a third chance" of this, but suggests MIRI's estimate is much lower). Do you have a sense of why MIRI and Paul disagree so much on this estimate?

I think Paul is probably right about the causes of the disagreement between him and many researchers, and the summary of his beliefs in the AI Impacts interview you linked matches my impression of his beliefs about this.