28

Timidity seems unobjectionable to me, and the arguments against it in section 3 seem unconvincing.

3.1: Marginal utility in number of lives already dropping off very steeply by 1000 seems implausible, but if we replace 1000 with a sufficiently large number, an agent with a bounded utility function would deny that these prospects keep getting better for the same (rational, imo) reasons they would eventually stop taking the devil's deals to get more years of happy life with high probability.

3.2: It seems perfectly reasonable to me to selectively create valuable things in situations in which value isn't already nearly maxed out, even if you can create valuable things much less efficiently in those situations, which is all that's going on here. Also disagree that it's strange for the value of creating lives to depend on what's happening far away to any significant degree at all; for an example that might seem more intuitive to some, what if the lives you're about to create already exist elsewhere? Wouldn't it be much better to create different ones instead?

3.3: If we replace 10^10 and 10^80 with 10^1000 and 10^8000, then I'd prefer A over B. I'm not sure what this is supposed to add that the initial example with the devil's deals for more years of happy life didn't.

3.4: Longshot bets like the "Lingering doubt" scenario are very different from longshot bets like Pascal's mugger, in ways that make them seem much more palatable to me (see 3.2). Furthermore, longshot bets as a criticism of recklessness can be seen as a finitistic stand-in for issues of nonconvergence of expected utility, which isn't a problem for timidity.

Furthermore, I think the paper seriously undersells (in section 4) how damning it is that recklessness violates prospect-outcome dominance. This implies vulnerability to Dutch books. After playing the St. Petersburg lottery, no matter what the outcome, it will only have finite value, less than the St. Petersburg lottery is worth in expectation, so if given the option, the agent with spend 1 util to try again, and replace their payout with whatever they get the next time around. They will do this no matter what the outcome is the first time, even though their prospects are no better on the second attempt.

I am more hesitant to recommend the more complex extremization method where we use the historical baseline resolution log-odds

It's the other way around for me. Historical baseline may be somewhat arbitrary and unreliable, but so is 1:1 odds. If the motivation for extremizing is that different forecasters have access to independent sources of information to move them away from a common prior, but that common prior is far from 1:1 odds, then extremizing away from 1:1 odds shouldn't work very well, and historical baseline seems closer to a common prior than 1:1 odds does.

I'm interested in how to get better-justified odds ratios to use as a baseline. One idea is to use past estimates of the same question. For example, suppose metaculus asks "Does X happen in 2030", and the question closes at the end of 2021, and then it asks the exact same question again at the beginning of 2022. Then the aggregated odds that the first question closed at can be used as a baseline for the second question. Perhaps you could do something more sophisticated, like, instead of closing the question and opening an identical one, keep the question open, but use the odds that experts gave it at some point in the past as a baseline with which to interpret more recent odds estimates provided by experts. Of course, none of this works if there hasn't been an identical question asked previously, and the question has been open for a short amount of time.

Another possibility is to use two pools of forecasters, both of which have done calibration training, but one of which consists of subject-matter experts, and the other of which consists of people with little specialized knowledge on the subject matter, and ask the latter group not to do much research before answering. Then the aggregated odds of the non-experts can be used as a baseline when aggregating odds given by the experts, on the theory that the non-experts can give you a well-calibrated prior because of their calibration training, but won't be taking into account the independent sources of knowledge that the experts have.

In fact I am quite puzzled by the fact that neither the average of probabilities nor the average of log odds seem to satisfy the basic invariance property of respecting annualized probabilities.

I think I can make sense of this. If you believe there's some underlying exponential distribution on when some event will occur, but you don't know the annual probability, then an exponential distribution is not a good model for your beliefs about when the event will occur, because a weighted average of exponential distributions with different annual probabilities is not an exponential distribution. This is because if time has gone by without the event occurring, this is evidence in favor of hypotheses with a low annual probability, so an average of exponential distributions should have its annual probability decrease over time.

An exponential distribution seems like the sort of probability distribution that I expect to be appropriate when the mechanism determining when the event occurs is well-understood, so different experts shouldn't disagree on what the annual probability is. If the true annual rate is unknown, then good experts should account for their uncertainty and not report an exponential distribution. Or, in the case where the experts are explicit models and you believe one of the models is roughly correct, then the experts would report exponential distributions, but the average of these distributions is not an exponential distribution, for good reason.

I do agree that when new evidence comes in about the experts we should change how we weight them. But when we are pooling the probabilities we aren't receiving any extra evidence about the experts (?).

Right, the evidence about the experts come from the new evidence that's being updated on, not the pooling procedure. Suppose we're pooling expert judgments, and we initially consider them all equally credible, so we use a symmetric pooling method. Then some evidence comes in. Our experts update on the evidence, and we also update on how credible each expert is, and pool their updated judgments together using an asymmetric pooling method, weighting experts by how well they anticipated evidence we've seen so far. This is clearest in the case where each expert is using some model, and we believe one of their models is correct but don't know which one (the case you already agreed arithmetic averages of probabilities are appropriate). If we were weighting them all equally, and then we get some evidence that expert 1 thought was twice as likely as expert 2, then now we should think that expert 1 is twice as likely to be the one with the correct model as expert 2 is, and take a weighted arithmetic mean of their new probabilities where we weight expert 1 twice as heavily as expert 1. When you do this, your pooled probabilities handle Bayesian updates correctly. My point was that, even outside of this particular situation, we should still be taking expert credibility into account in some way, and expert credibility should depend on how well the expert anticipated observed evidence. If two experts assign odds ratios and to some event before observing new evidence, and we pool these into the odds ratio , and then we receive some evidence causing the experts to update to and , respectively, but expert r anticipated that evidence better than expert s did, then I'd think this should mean we would weight expert r more heavily, and pool their new odds ratios into , or something like that. But we won't handle Bayesian updates correctly if we do! The external Bayesianity property of the mean log odds pooling method means that to handle Bayesian updates correctly, we must update to the odds ratio , as if we learned nothing about the relative credibility of the two experts.

I agree that the way I presented it I framed the extreme expert as more knowledgeable. I did this for illustrative purposes. But I believe the setting works just as well when we take both experts to be equally knowledgeable / calibrated.

I suppose one reason not to see this as unfairly biased towards mean log odds is if you generally expect experts who give more extreme probabilities to actually be more knowledgeable in practice. I gave an example in my post illustrating why this isn't always true, but a couple commenters on my post gave models for why it's true under some assumptions, and I suppose it's probably true in the data you've been using that's been empirically supporting mean log odds.

Throwing away the information from the extreme prediction seems bad.

I can see where you're coming from, but have an intuition that the geometric mean still trusts the information from outlying extreme predictions too much, which made a possible compromise solution occur to me, which to be clear, I'm not seriously endorsing.

I notice this is very surprising to me, because averaging log odds is anything but senseless.

I called it that because of its poor theoretical properties (I'm still not convinced they arise naturally in any circumstances), but in retrospect I don't really endorse this given the apparently good empirical performance of mean log odds.

log odds make Bayes rule additive, and I expect means to work well when the underlying objects are additive

My take on this is that multiplying odds ratios is indeed a natural operation that you should expect to be an appropriate thing to do in many circumstances, but that taking the nth root of an odds ratio is not a natural operation, and neither is taking geometric means of odds ratios, which combines both of those operations. On the other hand, while adding probabilities is not a natural operation, taking weighted averages of probabilities is.

My gut feeling is this argument relies on an adversarial setting where you might get exploited. And this probably means that you should come up with a probability range for the additional evidence your opponent might have.

So if you think their evidence is uniformly distributed over -1 and 1 bits, you should combine that with your evidence by adding that evidence to your logarithmic odds. This gives you a probability distribution over the possible values. Then use that spread to decide which bet odds are worth the risk of exploitation.

Right, but I was talking about doing that backwards. If you've already worked out for which odds it's worth accepting bets in each direction at, recover the probability that you must currently be assigning to the event in question. Arithmetic means of the bounds on probabilities implied by the bets you'd accept is a rough approximation to this: If you would be on X at odds implying any probability less than 2%, and you'd bet against X at odds implying any probability greater than 50%, then this is consistent with you currently assigning probability 26% to X, with a 50% chance that an adversary has evidence against X (in which case X has a 2% chance of being true), and a 50% chance that an adversary has evidence for X (in which case X has a 50% chance of being true).

I do not understand how this is about pooling different expert probabilities. But I might be misunderstanding your point.

It isn't. My post was about pooling multiple probabilities of the same event. One source of multiple probabilities of the same event is the beliefs of different experts, which your post focused on exclusively. But a different possible source of multiple probabilities of the same event is the bounds in each direction on the probability of some event implied by the betting behavior of a single expert.

I wrote a post arguing for the opposite thesis, and was pointed here. A few comments about your arguments that I didn't address in my post:

Regarding the empirical evidence supporting averaging log odds, note that averaging log odds will always give more extreme pooled probabilities than averaging probabilities does, and in the contexts in which this empirical evidence was collected, the experts were systematically underconfident, so that extremizing the results could make them better calibrated. This easily explains why average log odds outperformed average probabilities, and I don't expect optimally-extremized average log odds to outperform optimally-extremized average probabilities (or similarly, I don't expect unextremized average log odds to outperform average probabilities extremized just enough to give results as extreme as average log odds on average).

External Bayesianity seems like an actively undesirable property for probability pooling methods that treat experts symmetrically. When new evidence comes in, this should change how credible each expert is if different experts assigned different probabilities to that evidence. Thus the experts should not all be treated symmetrically both before and after new evidence comes in. If you do this, you're throwing away the information that the evidence gives you about expert credibility, and if you throw away some of the evidence you receive, you should not expect your Bayesian updates to properly account for all the evidence you received. If you design some way of defining probabilities so that you somehow end up correctly updating on new evidence despite throwing away some of that evidence (as log odds averaging remarkably does), then, once you do adjust to account for the evidence that you were previously throwing away, you will no longer be correctly updating on new evidence (i.e. if you weight the experts differently depending on credibility, and update credibility in response to new evidence, then weighted averaging of log odds is no longer externally Bayesian, and weighted averaging of probabilities is if you do it right).

I talked about the argument that averaging probabilities ignores extreme predictions in my post, but the way you stated it, you added the extra twist that the expert giving more extreme predictions is known to be more knowledgeable than the expert giving less extreme predictions. If you know one expert is more knowledgeable, then of course you should not treat them symmetrically. As an argument for averaging log odds rather than averaging probabilities, this seems like cheating, by adding an extra assumption which supports extreme probabilities but isn't used by either pooling method, giving an advantage to pooling methods that produce extreme probabilities.

Weird, the link works for me now.

Thus, I present to you, the Buddhists in EA Facebook group.

Dead link. It says "Sorry, this content isn't available right now

The link you followed may have expired, or the page may only be visible to an audience you're not in."

My critique of analytic functionalism is that it is essentially nothing but an assertion of this vagueness.

That's no reason to believe that analytic functionalism is wrong, only that it is not sufficient by itself to answer very many interesting questions.

Without a bijective mapping between physical states/processes and computational states/processes, I think my point holds.

No, it doesn't. I only claim that most physical states/processes have only a very limited collection of computational states/processes that it can reasonably be interpreted as, not that every physical state/process has exactly one computational state/process that it can reasonably be interpreted as, and certainly not that every computational state/process has exactly one physical state/process that can reasonably be interpreted as it. Those are totally different things.

it feels as though you're pattern-matching me to IIT and channeling Scott Aaronson's critique of Tononi

Kind of. But to clarify, I wasn't trying to argue that there will be problems with the Symmetry Theory of Valence that derive from problems with IIT. And when I heard about IIT, I figured that there were probably trivial counterexamples to the claim that Phi measures consciousness and that perhaps I could come up with one if I thought about the formula enough, before Scott Aaronson wrote the blog post where he demonstrated this. So although I used that critique of IIT as an example, I was mainly going off of intuitions I had prior to it. I can see why this kind of very general criticism from someone who hasn't read the details could be frustrating, but I don't expect I'll look into it enough to say anything much more specific.

I mention all this because I think analytic functionalism- which is to say radical skepticism/eliminativism, the metaphysics of last resort- only looks as good as it does because nobody’s been building out any alternatives.

But people have tried developing alternatives to analytic functionalism.

That said, I do think theories like IIT are at least slightly useful insofar as they expand our vocabulary and provide additional metrics that we might care a little bit about.

If you expanded on this, I would be interested.

But her defense wasn't that she was just following journalistic norms, but rather that she was in fact following significantly stricter norms than that.

And why would sharing the screenshots in particular be significant? Writing a news story from an interview would typically include quotes from the interview, and quoting text carries the same information content as a screenshot of it.