SamClarke

Wiki Contributions

Comments

What is most confusing to you about AI stuff?

On the margin, should donors prioritize AI safety above other existential risks and broad longtermist interventions?

To the extent that this question overlaps with Mauricio's question 1.2 (i.e. A bunch of people seem to argue for "AI stuff is important" but believe / act as if "AI stuff is overwhelmingly important"--what are arguments for the latter view?), then you might find his answer helpful.

other x-risks and longtermist areas seem rather unexplored and neglected, like s-risks

Only a partial answer, but worth noting that I think the most plausible source of s-risk is messing up on AI stuff

What is most confusing to you about AI stuff?

Is "intelligence" ... really enough to make an AI system more powerful than humans (individuals, groups, or all of humanity combined)?

Some discussion of this question here: https://www.alignmentforum.org/posts/eGihD5jnD6LFzgDZA/agi-safety-from-first-principles-control

What is most confusing to you about AI stuff?

Do we need to decide on a moral principle(s) first? How would it be possible to develop beneficial AI without first 'solving' ethics/morality?

Good question! The answer is no: 'solving' ethics/morality first is one thing that we probably eventually need to do, but we could first solve a narrower, simpler form of AI alignment, and use those aligned systems to help us solve ethics/morality and the other trickier problems (like the control problem for more general, capable systems). This is more or less what is discussed in ambitious vs narrow value learning. Narrow value learning is one narrower, simpler form of AI alignment. There are others, discussed here under the heading "Alternative solutions".

What is most confusing to you about AI stuff?

A timely post: https://forum.effectivealtruism.org/posts/DDDyTvuZxoKStm92M/ai-safety-needs-great-engineers

(The focus is software engineering not development, but should still be informative.)

Why AI alignment could be hard with modern deep learning

I love this post and also expect it to be something that I point people towards in the future!

I was wondering about what kind of alignment failure - i.e. outer or inner alignment - you had in mind when describing sycophant models (for schemer models, it's obviously an inner alignment failure).

It seems you could get sycophant models via inner alignment failure, because you could train them a sensible, well-specified objective functions, and yet the model learns to pursue human approval anyway (because "pursuing human approval" turned out to be more easily discovered by SGD).

It also seems you could also sycophant models them via outer alignment failure, because e.g. a model trained using naive reward modelling (which would be an obviously terrible objective) seems very likely to yield a model that is pursuing approval from the humans whose feedback is used in training the reward model.

Does this seem right to you, and if so, which kind of alignment failure did you have in mind?

(Paul has written most explicitly about what a world full of advanced sycophants looks like/how it leads to existential catastrophe, and his stories are about outer alignment, so I'd be especially curious if you disagreed with that.)

General vs specific arguments for the longtermist importance of shaping AI development

(Apologies for my very slow reply.)

I feel like something has gone wrong in this conversation; you have tricked Bob into working on learning from human feedback, rather than convincing him to do so.

I agree with this. If people become convinced to work on AI stuff by specific argument X, then they should definitely go and try to fix X, not something else (e.g. what other people tell them needs doing in AI safety/governance).

I think when I said I wanted a more general argument to be the "default", I was meaning something very general, that doesn't clearly imply any particular intervention - like the one in the most important century series, or the "AI is a big deal" argument (I especially like Max Daniel's version of this).

Then, it's very important to think clearly about what will actually go wrong, and how to actually fix that. But I think it's fine to do this once you're already convinced that you should work on AI, by some general argument.

I'd be really curious if you still disagree with this?

General vs specific arguments for the longtermist importance of shaping AI development

The problem with general arguments is that they tell you very little about how to solve the problem

Agreed!

If I were producing key EA content/fellowships/etc, I would be primarily interested in getting people to solve the problem

I think this is true for some kinds of content/fellowships/etc, but not all. For those targeted at people who aren't already convinced that AI safety/governance should be prioritised (which is probably the majority), it seems more important to present them with the strongest arguments for caring about AI safety/governance in the first place. This suggests presenting more general arguments.

Then, I agree that you want to get people to help solve the problem, which requires talking about specific failure modes. But I think that doing this prematurely can lead people to dismiss the case for shaping AI development for bad reasons.

Another way of saying this: for AI-related EA content/fellowships/etc, it seems worth separating motivation ("why should I care?") and action ("if I do care, what should I do?"). This would get you the best of both worlds: people are presented with the strongest arguments, allowing them to make an informed decision about how much AI stuff should be prioritised, and then also the chance to start to explore specific ways to solve the problem.

I think this maybe applies to longtermism in general. We don't yet have that many great ideas of what to do if longtermism is true, and I think that people sometimes (incorrectly) dismiss longtermism for this reason.

Lessons learned running the Survey on AI existential risk scenarios

Thanks for the detailed reply, all of this makes sense!

I added a caveat to the final section mentioning your disagreements with some of the points in the "Other small lessons about survey design" section

We're Redwood Research, we do applied alignment research, AMA

What might be an example of a "much better weird, theory-motivated alignment research" project, as mentioned in your intro doc? (It might be hard to say at this point, but perhaps you could point to something in that direction?)

We're Redwood Research, we do applied alignment research, AMA

How crucial a role do you expect x-risk-motivated AI alignment will play in making things go well? What are the main factors you expect will influence this? (e.g. the occurrence of medium-scale alignment failures as warning shots)

Load More