AlexMennen

When pooling forecasts, use the geometric mean of odds

In fact I am quite puzzled by the fact that neither the average of probabilities nor the average of log odds seem to satisfy the basic invariance property of respecting annualized probabilities.

I think I can make sense of this. If you believe there's some underlying exponential distribution on when some event will occur, but you don't know the annual probability, then an exponential distribution is not a good model for your beliefs about when the event will occur, because a weighted average of exponential distributions with different annual probabilities is not an exponential distribution. This is because if time has gone by without the event occurring, this is evidence in favor of hypotheses with a low annual probability, so an average of exponential distributions should have its annual probability decrease over time.

An exponential distribution seems like the sort of probability distribution that I expect to be appropriate when the mechanism determining when the event occurs is well-understood, so different experts shouldn't disagree on what the annual probability is. If the true annual rate is unknown, then good experts should account for their uncertainty and not report an exponential distribution. Or, in the case where the experts are explicit models and you believe one of the models is roughly correct, then the experts would report exponential distributions, but the average of these distributions is not an exponential distribution, for good reason.

When pooling forecasts, use the geometric mean of odds

I do agree that when new evidence comes in about the experts we should change how we weight them. But when we are pooling the probabilities we aren't receiving any extra evidence about the experts (?).

Right, the evidence about the experts come from the new evidence that's being updated on, not the pooling procedure. Suppose we're pooling expert judgments, and we initially consider them all equally credible, so we use a symmetric pooling method. Then some evidence comes in. Our experts update on the evidence, and we also update on how credible each expert is, and pool their updated judgments together using an asymmetric pooling method, weighting experts by how well they anticipated evidence we've seen so far. This is clearest in the case where each expert is using some model, and we believe one of their models is correct but don't know which one (the case you already agreed arithmetic averages of probabilities are appropriate). If we were weighting them all equally, and then we get some evidence that expert 1 thought was twice as likely as expert 2, then now we should think that expert 1 is twice as likely to be the one with the correct model as expert 2 is, and take a weighted arithmetic mean of their new probabilities where we weight expert 1 twice as heavily as expert 1. When you do this, your pooled probabilities handle Bayesian updates correctly. My point was that, even outside of this particular situation, we should still be taking expert credibility into account in some way, and expert credibility should depend on how well the expert anticipated observed evidence. If two experts assign odds ratios and to some event before observing new evidence, and we pool these into the odds ratio , and then we receive some evidence causing the experts to update to and , respectively, but expert r anticipated that evidence better than expert s did, then I'd think this should mean we would weight expert r more heavily, and pool their new odds ratios into , or something like that. But we won't handle Bayesian updates correctly if we do! The external Bayesianity property of the mean log odds pooling method means that to handle Bayesian updates correctly, we must update to the odds ratio , as if we learned nothing about the relative credibility of the two experts.

I agree that the way I presented it I framed the extreme expert as more knowledgeable. I did this for illustrative purposes. But I believe the setting works just as well when we take both experts to be equally knowledgeable / calibrated.

I suppose one reason not to see this as unfairly biased towards mean log odds is if you generally expect experts who give more extreme probabilities to actually be more knowledgeable in practice. I gave an example in my post illustrating why this isn't always true, but a couple commenters on my post gave models for why it's true under some assumptions, and I suppose it's probably true in the data you've been using that's been empirically supporting mean log odds.

Throwing away the information from the extreme prediction seems bad.

I can see where you're coming from, but have an intuition that the geometric mean still trusts the information from outlying extreme predictions too much, which made a possible compromise solution occur to me, which to be clear, I'm not seriously endorsing.

I notice this is very surprising to me, because averaging log odds is anything but senseless.

I called it that because of its poor theoretical properties (I'm still not convinced they arise naturally in any circumstances), but in retrospect I don't really endorse this given the apparently good empirical performance of mean log odds.

log odds make Bayes rule additive, and I expect means to work well when the underlying objects are additive

My take on this is that multiplying odds ratios is indeed a natural operation that you should expect to be an appropriate thing to do in many circumstances, but that taking the nth root of an odds ratio is not a natural operation, and neither is taking geometric means of odds ratios, which combines both of those operations. On the other hand, while adding probabilities is not a natural operation, taking weighted averages of probabilities is.

My gut feeling is this argument relies on an adversarial setting where you might get exploited. And this probably means that you should come up with a probability range for the additional evidence your opponent might have.

So if you think their evidence is uniformly distributed over -1 and 1 bits, you should combine that with your evidence by adding that evidence to your logarithmic odds. This gives you a probability distribution over the possible values. Then use that spread to decide which bet odds are worth the risk of exploitation.

Right, but I was talking about doing that backwards. If you've already worked out for which odds it's worth accepting bets in each direction at, recover the probability that you must currently be assigning to the event in question. Arithmetic means of the bounds on probabilities implied by the bets you'd accept is a rough approximation to this: If you would be on X at odds implying any probability less than 2%, and you'd bet against X at odds implying any probability greater than 50%, then this is consistent with you currently assigning probability 26% to X, with a 50% chance that an adversary has evidence against X (in which case X has a 2% chance of being true), and a 50% chance that an adversary has evidence for X (in which case X has a 50% chance of being true).

I do not understand how this is about pooling different expert probabilities. But I might be misunderstanding your point.

It isn't. My post was about pooling multiple probabilities of the same event. One source of multiple probabilities of the same event is the beliefs of different experts, which your post focused on exclusively. But a different possible source of multiple probabilities of the same event is the bounds in each direction on the probability of some event implied by the betting behavior of a single expert.

When pooling forecasts, use the geometric mean of odds

I wrote a post arguing for the opposite thesis, and was pointed here. A few comments about your arguments that I didn't address in my post:

Regarding the empirical evidence supporting averaging log odds, note that averaging log odds will always give more extreme pooled probabilities than averaging probabilities does, and in the contexts in which this empirical evidence was collected, the experts were systematically underconfident, so that extremizing the results could make them better calibrated. This easily explains why average log odds outperformed average probabilities, and I don't expect optimally-extremized average log odds to outperform optimally-extremized average probabilities (or similarly, I don't expect unextremized average log odds to outperform average probabilities extremized just enough to give results as extreme as average log odds on average).

External Bayesianity seems like an actively undesirable property for probability pooling methods that treat experts symmetrically. When new evidence comes in, this should change how credible each expert is if different experts assigned different probabilities to that evidence. Thus the experts should not all be treated symmetrically both before and after new evidence comes in. If you do this, you're throwing away the information that the evidence gives you about expert credibility, and if you throw away some of the evidence you receive, you should not expect your Bayesian updates to properly account for all the evidence you received. If you design some way of defining probabilities so that you somehow end up correctly updating on new evidence despite throwing away some of that evidence (as log odds averaging remarkably does), then, once you do adjust to account for the evidence that you were previously throwing away, you will no longer be correctly updating on new evidence (i.e. if you weight the experts differently depending on credibility, and update credibility in response to new evidence, then weighted averaging of log odds is no longer externally Bayesian, and weighted averaging of probabilities is if you do it right).

I talked about the argument that averaging probabilities ignores extreme predictions in my post, but the way you stated it, you added the extra twist that the expert giving more extreme predictions is known to be more knowledgeable than the expert giving less extreme predictions. If you know one expert is more knowledgeable, then of course you should not treat them symmetrically. As an argument for averaging log odds rather than averaging probabilities, this seems like cheating, by adding an extra assumption which supports extreme probabilities but isn't used by either pooling method, giving an advantage to pooling methods that produce extreme probabilities.

Announcing the Buddhists in EA Group

Weird, the link works for me now.

Announcing the Buddhists in EA Group

Thus, I present to you, the Buddhists in EA Facebook group.

Dead link. It says "Sorry, this content isn't available right now

The link you followed may have expired, or the page may only be visible to an audience you're not in."

Why I think the Foundational Research Institute should rethink its approach

My critique of analytic functionalism is that it is essentially nothing but an assertion of this vagueness.

That's no reason to believe that analytic functionalism is wrong, only that it is not sufficient by itself to answer very many interesting questions.

Without a bijective mapping between physical states/processes and computational states/processes, I think my point holds.

No, it doesn't. I only claim that most physical states/processes have only a very limited collection of computational states/processes that it can reasonably be interpreted as, not that every physical state/process has exactly one computational state/process that it can reasonably be interpreted as, and certainly not that every computational state/process has exactly one physical state/process that can reasonably be interpreted as it. Those are totally different things.

it feels as though you're pattern-matching me to IIT and channeling Scott Aaronson's critique of Tononi

Kind of. But to clarify, I wasn't trying to argue that there will be problems with the Symmetry Theory of Valence that derive from problems with IIT. And when I heard about IIT, I figured that there were probably trivial counterexamples to the claim that Phi measures consciousness and that perhaps I could come up with one if I thought about the formula enough, before Scott Aaronson wrote the blog post where he demonstrated this. So although I used that critique of IIT as an example, I was mainly going off of intuitions I had prior to it. I can see why this kind of very general criticism from someone who hasn't read the details could be frustrating, but I don't expect I'll look into it enough to say anything much more specific.

I mention all this because I think analytic functionalism- which is to say radical skepticism/eliminativism, the metaphysics of last resort- only looks as good as it does because nobody’s been building out any alternatives.

But people have tried developing alternatives to analytic functionalism.

Why I think the Foundational Research Institute should rethink its approach

That said, I do think theories like IIT are at least slightly useful insofar as they expand our vocabulary and provide additional metrics that we might care a little bit about.

If you expanded on this, I would be interested.

Why I think the Foundational Research Institute should rethink its approach

Speaking of the metaphysical correctness of claims about qualia sounds confused, and I think precise definitions of qualia-related terms should be judged by how useful they are for generalizing our preferences about central cases. I expect that any precise definition for qualia-related terms that anyone puts forward before making quite a lot of philosophical progress is going to be very wrong when judged by usefulness for describing preferences, and that the vagueness of the analytic functionalism used by FRI is necessary to avoid going far astray.

Regarding the objection that shaking a bag of popcorn can be interpreted as carrying out an arbitrary computation, I'm not convinced that this is actually true, and I suspect it isn't. It seems to me that the interpretation would have to be doing essentially all of the computation itself, and it should be possible to make precise the sense in which brains and computers simulating brains carry out a certain computation that waterfalls and bags of popcorn don't. The defense of this objection that you quote from McCabe is weak; the uncontroversial fact that many slightly different physical systems can carry out the same computation does not establish that an arbitrary physical system can be reasonably interpreted as carrying out an arbitrary computation.

I think the edge cases that you quote Scott Aaronson bringing up are good ones to think about, and I do have a large amount of moral uncertainty about them. But I don't see these as problems specific to analytic functionalism. These are hard problems, and the fact that some more precise theory about qualia may be able to easily answer them is not a point in favor of that theory, since wrong answers are not helpful.

The Symmetry Theory of Valence sounds wildly implausible. There are tons of claims that people put forward, often contradicting other such claims, that some qualia-related concept is actually some other simple thing. For instance, I've heard claims that goodness is complexity and that what humans value is increasing complexity. Complexity and symmetry aren't quite opposites, but they're certainly anti-correlated, and both theories can't be right. These sorts of theories never end up getting empirical support, although their proponents often claim to have empirical support. For example, proponents of Integrated Information Theory often cite that the cerebrum has a higher Phi value than the cerebellum does as support for the hypothesis that Phi is a good measure of the amount of consciousness a system has, as if comparing two data points was enough to support such a claim, and it turns out that large regular rectangular grids of transistors, and the operation of multiplication by a large Vandermonde matrix, both have arbitrarily high Phi values, and yet the claim that Phi measures consciousness still survives and claims empirical support, despite this damning disconfirmation. And I think the “goodness is complexity” people also provided examples of good things that they thought they had established are complex and bad things that they thought they had established are not. I know this sounds totally unfair, but I won't be at all surprised if you claim to have found substantial empirical support for your theory, and I still won't take your theory at all seriously if you do, because any evidence you cite will inevitably be highly dubious. The heuristic that claims that a qualia-related concept is some simple other thing are wrong, and that claims of empirical support for such claims never hold up, seems to be pretty well supported. I am almost certain that there are trivial counterexamples to the Symmetry Theory of Valence, even though perhaps you may have developed a theory sophisticated enough to avoid the really obvious failure modes like claiming that a square experiences more pleasure and less suffering than a rectangle because its symmetry group is twice as large.

My current thoughts on MIRI's "highly reliable agent design" work

There's a strong possibility, even in a soft takeoff, that an unaligned AI would not act in an alarming way until after it achieves a decisive strategic advantage. In that case, the fact that it takes the AI a long time to achieve a decisive strategic advantage wouldn't do us much good, since we would not pick up an indication that anything was amiss during that period.

Reasons an AI might act in a desirable manner before but not after achieving a decisive strategic advantage:

Prior to achieving a decisive strategic advantage, the AI relies on cooperation with humans to achieve its goals, which provides an incentive not to act in ways that would result in it getting shut down. An AI may be capable of following these incentives well before achieving a decisive strategic advantage.

It may be easier to give an AI a goal system that aligns with human goals in familiar circumstances than it is to give it a goal system that aligns with human goals in all circumstances. An AI with such a goal system would act in ways that align with human goals if it has little optimization power but in ways that are not aligned with human goals if it has sufficiently large optimization power, and it may attain that much optimization power only after achieving a decisive strategic advantage (or before achieving a decisive strategic advantage, but after acquiring the ability to behave deceptively, as in the previous reason).

It's the other way around for me. Historical baseline may be somewhat arbitrary and unreliable, but so is 1:1 odds. If the motivation for extremizing is that different forecasters have access to independent sources of information to move them away from a common prior, but that common prior is far from 1:1 odds, then extremizing away from 1:1 odds shouldn't work very well, and historical baseline seems closer to a common prior than 1:1 odds does.

I'm interested in how to get better-justified odds ratios to use as a baseline. One idea is to use past estimates of the same question. For example, suppose metaculus asks "Does X happen in 2030", and the question closes at the end of 2021, and then it asks the exact same question again at the beginning of 2022. Then the aggregated odds that the first question closed at can be used as a baseline for the second question. Perhaps you could do something more sophisticated, like, instead of closing the question and opening an identical one, keep the question open, but use the odds that experts gave it at some point in the past as a baseline with which to interpret more recent odds estimates provided by experts. Of course, none of this works if there hasn't been an identical question asked previously, and the question has been open for a short amount of time.

Another possibility is to use two pools of forecasters, both of which have done calibration training, but one of which consists of subject-matter experts, and the other of which consists of people with little specialized knowledge on the subject matter, and ask the latter group not to do much research before answering. Then the aggregated odds of the non-experts can be used as a baseline when aggregating odds given by the experts, on the theory that the non-experts can give you a well-calibrated prior because of their calibration training, but won't be taking into account the independent sources of knowledge that the experts have.