Full time independent deconfusion researcher ( in AI Alignment. (Also PhD in the theory of distributed computing).

If you're interested by some research ideas that you see in my posts, know that I keep private docs with the most compressed version of my deconfusion ideas in the process of getting feedback. I can give you access if you PM me!

A list of topics I'm currently doing deconfusion on:

  • Goal-directedness for discussing AI Risk
  • Myopic Decision Theories for dealing with deception (with Evan Hubinger)
  • Universality for many alignment ideas of Paul Christiano
  • Deconfusion itself to get better at it
  • Models of Languages Models to clarify the alignment issues surrounding them.


Sorted by New

Wiki Contributions


The academic contribution to AI safety seems large

Hum, I think I wrote my point badly on the comment above. What I mean isn't that formal methods will never be useful, just that they're not really useful yet, and will require more pure AI safety research to be useful.

The general reason is that all formal methods try to show that a program follows a specification on a model of computation. Right now, a lot of the work on formal methods applied to AI focus on adapting known formal methods to the specific programs (say Neural Networks) and the right model of computation (in what contexts do you use these programs, how can you abstract their execution to make it simpler). But one point they fail to address is the question of the specification.

Note that when I say specification, I mean a formal specification. In practice, it's usually a modal logic formula, in LTL for example. And here we get at the crux of my argument: nobody knows the specification for almost all AI properties we care about. Nobody knows the specification for "Recognizing kittens" or "Answering correctly a question in English". And even for safety questions, we don't have yet a specification of "doesn't manipulate us" or "is aligned". That's the work that still needs to be done, and that's what people like Paul Christiano and Evan Hubinger, among others, are doing. But until we have such properties, the formal methods will not be really useful to either AI capability or AI safety.

Lastly, I want to point out that working on AI for formal methods is also a means to get money and prestige. I'm not going to go full Hanson and say that's the only reason, but it's still a part of the international situation. I have examples of people getting AI related funding in France, for a project that is really, but really useless for AI.

The academic contribution to AI safety seems large

This post annoyed me. Which is a good thing! It means that you hit where it hurts, and you forced me to reconsider my arguments. I also had to update (a bit) toward your position, because I realized that my "counter-arguments" weren't that strong.

Still, here they are:

  • I agree with the remark that many work will have both capability and safety consequences. But instead of seeing that as an argument to laud the safety aspect of capability-relevant work, I want to look for the differential technical progress. What makes me think that EA safety is more relevant than mainstream AI to safety questions is that for almost all EA safety, the differential progress is in favor of safety, while for most research in mainstream/academic AI, the different progress seems either neutral or in favor of capabilities. (I'll be very interested in counter examples, on both sides)
  • Echoing what Buck wrote, I think you might overestimate the value of research that has potential consequences about safety but is not about it. And thus I do think there's a significant value gain to focus on safety problems specifically.
  • About Formal Methods, it isn't even useful for AI capabilities, even less for AI safety. I want to write a post about that at some point, but when you're unable to specify what you want, Formal Methods cannot save your ass.

With all that being said, I'm glad you wrote this post and I think I'll revisit it and think more about it.

Is it suffering or involuntary suffering that's bad, and when is it (involuntary) suffering?

Since many other answers treat the more general ideas, I want to focus on the "volontary" sadness of reading/watching/listening sad stories. I was curious about this myself, because I noticed that reading only "positive" and "joyous" stories eventually feel empty.

The answer seem that sad elements in a story bring more depth than the fun/joyous ones. In that sense, sadness in stories act as a signal of deepness, but also a way to access some deeper part of our emotions and internal life.

I'm reminded of Mark Manson's quote from this article:

If I ask you, “What do you want out of life?” and you say something like, “I want to be happy and have a great family and a job I like,” it’s so ubiquitous that it doesn’t even mean anything.
A more interesting question, a question that perhaps you’ve never considered before, is what pain do you want in your life? What are you willing to struggle for? Because that seems to be a greater determinant of how our lives turn out.

Maybe sadness and pain just tell us more about other and ourselves, and that's what we find so enthralling.

Causal diagrams of the paths to existential catastrophe

Thanks for that very in-depth answer!

I was indeed thinking about 3., even if 1. and 2. are also important. And I get that the main value of these diagrams is to force an explicit and as formal as possible statement to be made.

I guess my question was more about, given two different causal diagrams for the same risk (made by different researchers for example), have you an idea of how to compare them? Like finding the first difference along the causal path, or others means of comparison. This seems important because even with clean descriptions of our views, we can still talk past each other if we cannot see where the difference truly lies.

Causal diagrams of the paths to existential catastrophe

Great post! I feel these diagrams will be really useful for clarifying the possible interventions and parts of the existential risks.

Do you think they'll also serve for comparing different positions on a specific existential risk, like the trajectories in this post? Or do you envision the diagram for a specific risk as a summary of all causal pathways to this risk?

Cortés, Pizarro, and Afonso as Precedents for Takeover

What about diseases? I admit I know little about this period of history, but the accounts I read (for example in Guns, Germs and Steel) place the advantage in the spread of diseases to the Americas.

Basically, because the Americas lacked many big domesticated mammals, they could not have cities like European ones with cattle everywhere. The conditions of living in these big cities caused the spread of diseases. And when going to the Americas, the conquistadors took these diseases with them to a population which had never experienced them, causing most of the deaths of the early conquests.

(This is the picture from the few sources I've read. So it might be wrong or inaccurate, but if it is, I am very curious of why.)

Effects of anti-aging research on the long-term future

Also interested. I did not think about it before, but since the old generation dying is one way scientific and intellectual changes are completely accepted, that would probably have some big impact on our intellectual landscape and culture.

My personal cruxes for working on AI safety

I'm curious about the article, but the link points to nothing. ^^

Michelle Graham: How evolution can help us understand wild animal welfare

Thanks a lot for this presentation and corresponding transcript. I am quite new to thinking about animal welfare at all, and even more about wildlife animal welfare, but I felt this presentation was easy to follow even from this point of view (my half decent knowledge of evolution might have helped).

I like the clarification of evolution, and more specifically, of the fact that natural selection selects away options with bad fitness or bad relative fitness, instead of optimizing fitness to the maximum. That's a common issue when using theoretical computer science for modeling natural systems: instead of looking for the best algorithms for our classical measures (like time or space), we need to take into account the specifics of evolution (some forms of simplicity in the algorithms for example) and not necessarily optimize completely.

On the level of details and nitpicks, I have a few comments:

  • I'm not sure I understand differential reproduction correctly. Is it the fact that (in your example) blue bears have more offsprings? Or that these offsprings use the advantage of being blue for having even more offsprings, which changes the proportion? Or both?
  • For the line representation, I think you wanted to define the line of all humans to be an inch long. Because without this, or a length for one individual, I cannot make sense of the comparison between the line of humans and the line of ants.
  • There is a red cross left in the background of the second "Assumptions of the Argument" slide, the one just after the example of rats and elephants.
  • I had never heard of exaptations! I am curious of some references to the literature in general, and maybe also to the specific example you gave about feathers.
  • The hypothesis in the last question that less cognitive power entails more pain, because the signal needs to be stronger in order to be treated and registered... that's a fascinating idea. Horrible, of course, but I never thought about it that way. And that would be a counterweight to the "moral weight" argument about the relative value of different species.

Finally, on the specific topic of intervention for improving welfare, I have one worry: what of cases where two species have mutually exclusive needs? Something like a meat-eater species and the species it eats. In theses cases, I feel like evolution left us with some sort of zero-sum game, and there might be necessary welfare tradeoffs because of it.

Load More