A list of good heuristics that the case for AI X-risk fails

Presumably there are two categories of heuristics, here: ones which relate to actual difficulties in discerning the ground truth, and ones which are irrelevant or stem from a misunderstanding. I think it seems bad that this list implicitly casts the heuristics as being in the latter category, and rather than linking to why each is irrelevant or a misunderstanding it does something closer to mocking the concern.

For example, I would decompose the "It's not empirically testable" heuristic into two different components. The first is something like "it's way easier to do good work when you have tight feedback loops, and a project that relates to a single shot opportunity without a clear theory simply cannot have tight feedback loops." This was the primary reason I stayed away from AGI safety for years, and still seems to me like a major challenge to research work here. [I was eventually convinced that it was worth putting up with this challenge, however.]

The second is something like "only trust claims that have been empirically verified", which runs into serious problems with situations where the claims are about the future, or running the test is ruinously expensive. A claim that 'putting lamb's blood on your door tonight will cause your child to be spared' is one that you have to act on (or not) before you get to observe whether or not it will be effective, and so whether or not this heuristic helps depends on whether or not it's possible to have any edge ahead of time on figuring out which such claims are accurate.

I'm Buck Shlegeris, I do research and outreach at MIRI, AMA
I certainly don't think agents "should" try to achieve outcomes that are impossible from the problem specification itself.

I think you need to make a clearer distinction here between "outcomes that don't exist in the universe's dynamics" (like taking both boxes and receiving $1,001,000) and "outcomes that can't exist in my branch" (like there not being a bomb in the unlucky case). Because if you're operating just in the branch you find yourself in, many outcomes whose probability an FDT agent is trying to affect are impossible from the problem specification (once you include observations).

And, to be clear, I do think agents "should" try to achieve outcomes that are impossible from the problem specification including observations, if certain criteria are met, in a way that basically lines up with FDT, just like agents "should" try to achieve outcomes that are already known to have happened from the problem specification including observations.

As an example, If you're in Parfit's Hitchhiker, you should pay once you reach town, even though reaching town has probability 1 in cases where you're deciding whether or not to pay, and the reason for this is because it was necessary for reaching town to have had probability 1.

I'm Buck Shlegeris, I do research and outreach at MIRI, AMA

Oh, an additional detail that I think was part of that conversation: there's only really one way to have a '0-error' state in a hierarchical controls framework, but there are potentially many consonant energy distributions that are dissonant with each other. Whether or not that's true, and whether each is individually positive valence, will be interesting to find out.

(If I had to guess, I would guess the different mutually-dissonant internally-consonant distributions correspond to things like 'moods', in a way that means they're not really value but are somewhat close, and also that they exist. The thing that seems vaguely in this style are differing brain waves during different cycles of sleep, but I don't know if those have clear waking analogs, or what they look like in the CSHW picture.)

I'm Buck Shlegeris, I do research and outreach at MIRI, AMA

FWIW I agree with Buck's criticisms of the Symmetry Theory of Valence (both content and meta) and also think that some other ideas QRI are interested in are interesting. Our conversation on the road trip was (I think) my introduction to Connectome Specific Harmonic Waves (CSHW), for example, and that seemed promising to think about.

I vaguely recall us managing to operationalize a disagreement, let me see if I can reconstruct it:

A 'multiple drive' system, like PCT's hierarchical control system, has an easy time explaining independent desires and different flavors of discomfort. (If one both has a 'hunger' control system and a 'thirst' control system, one can easily track whether one is hungry, thirsty, both, or neither.) A 'single drive' system, like expected utility theories more generally, has a somewhat more difficult time explaining independent desires and different flavors of discomfort, since you only have the one 'utilon' number.
But this is mostly because we're looking at different parts of the system as the 'value'. If I have a vector of 'control errors', I get the nice multidimensional property. If I have a utility function that's a function of a vector, the gradient of that function will be a vector that gives me the same nice multidimensional property.
CSHW gives us a way to turn the brain into a graph and then the graph activations into energies in different harmonics. Then we can look at an energy distribution and figure out how consonant or dissonant it is. This gives us the potentially nice property that 'gradients are easy', where if 'perfect harmony' (= all consonant energy) corresponds to the '0 error' case in PCT, being hungry will look like missing some consonant energy or having some dissonant energy.
Here we get the observational predictions: for PCT, 'hunger' and 'thirst' and whatever other drives just need to be wire voltages somewhere, but for QRI's theory as I understand it, they need to be harmonic energies with particular numerical properties (such that they are consonant or dissonant as expected to make STV work out).

Of course, it could be the case that there are localized harmonics in the connectome, such that we get basically the same vector represented in the energy distribution, and don't have a good way to distinguish between them.

On that note, I remember we also talked about the general difficulty of distinguishing between theories in this space; for example, my current view is that Friston-style predictive coding approaches and PCT-style hierarchical control approaches end up predicting very similar brain architecture, and the difference is 'what seems natural' or 'which underlying theory gets more credit.' (Is it the case that the brain is trying to be Bayesian, or the brain is trying to be homeostatic, and embedded Bayesianism empirically performs well at that task?) I expect a similar thing could be true here, where whether symmetry is the target or the byproduct is unclear, but in such cases I normally find myself reaching for 'byproduct'. It's easy to see how evolution could want to build homeostatic systems, and harder to see how evolution could want to build Bayesian systems; I think a similar story goes through for symmetry and brains.

This makes me more sympathetic to something like "symmetry will turn out to be a marker for something important and good" (like, say, 'focus') than something like "symmetry is definitionally what feeling good is."

We Could Move $80 Million to Effective Charities, Pineapples Included

Thanks! Also, for future opportunities like this, probably the fastest person to respond will be Colm.

Against Modest Epistemology

But as I understand it, Eliezer regards himself as being able to do unusually well using the techniques he has described, and so would predict his own success in forecasting tournaments.

This is also my model of Eliezer; my point is that my thoughts on modesty / anti-modesty are mostly disconnected to whether or not Eliezer is right about his forecasting accuracy, and mostly connected to the underlying models of how modesty and anti-modesty work as epistemic positions.

How narrowly should you define the 'expert' group?

I want to repeat something to make sure there isn't confusion or double illusion of transparency; "narrowness" doesn't mean just the size of the group but also the qualities that are being compared to determine who's expert and who isn't.

Against Modest Epistemology

I think with Eliezer's approach, superforecasters should exist, and it should be possible to be aware that you are a superforecaster. Those both seem like they would be lower probability under the modest view. Whether Eliezer personally is a superforecaster seems about as relevant as whether Tetlock is one; you don't need to be a superforecaster to study them.

I expect Eliezer to agree that a careful aggregation of superforecasters will outperform any individual superforecaster; similarly, I expect Eliezer to think that a careful aggregation of anti-modest reasoners will outperform any individual anti-modest reasoner.

It's worth considering what careful aggregations look like when not dealing with binary predictions. The function of a careful aggregation is to disproportionately silence error while maintaining signal. With many short-term binary predictions, we can use methods that focus on the outcomes, without any reference to how those predictors are estimating those outcomes. With more complicated questions, we can't compare outcomes directly, and so need to use the reasoning processes themselves as data.

That suggests a potential disagreement to focus on: the anti-modest view suspects that one can do a careful aggregation based on reasoner methodology (say, weighing more highly forecasters who adjust their estimates more frequently, or who report using Bayes, or so on), whereas I think the modest view suspects that this won't outperform uniform aggregation.

(The modest view has two components--approving of weighting past performance, and disapproving of other weightings. Since other approaches can agree on the importance of past performance, and the typical issues where the two viewpoints differ are those where we have little data on past performance, it seems more relevant to focus on whether the disapproval is correct than whether the approval is correct.)