In short: There are many methods to pool forecasts. The most commonly used is the arithmetic mean of probabilities. However, there are empirical and theoretical reasons to prefer the geometric mean of the odds instead. This is particularly important when some of the predictions have extreme values. Therefore, I recommend defaulting to the geometric mean of odds to aggregate probabilities.
Epistemic status: My impression is that geometric mean of odds is the preferred pooling method among most researchers who have looked into the topic. That being said, I only cite one study with direct empirical evidence supporting my recommendation.
One key piece of advice I would give to people keen on forming better opinions of the world is to pay attention to many experts, and to reason things through a few times using different assumptions. In the context of quantitative forecasting, this results in many predictions, that together paint a more complete picture.
But, how can we aggregate the different predictions? Ideally, we would like a simple heuristic that pools the many experts and models we have considered and produce an aggregate prediction [1].
There are many possible choices for such a heuristic. A common one is to take the arithmetic mean of the individual predictions :
We see an example of this approach in this article from Luisa Rodriguez, where it is used to aggregate some predictions about the chances of a nuclear war in a year.
A different heuristic, which I will argue in favor of, is to take the geometric mean of the odds:
Whereas the arithmetic mean adds the values together and divides by the number of values, the geometric mean multiplies all the values and then takes the N-th root of the product (where N = number of values).
And the odds equal the probability of an event divided by its complement, .[2]
For example, in Rodriguez's article we have four predictions from different sources [3]:
Probabilities | Odds |
---|---|
1.40% | 1:70 |
2.21% | 1:44 |
0.39% | 1:255 |
0.40% | 1:249 |
Rodriguez takes as an aggregate the arithmetic mean of the probabilities , which corresponds to pooled odds of about .
If we take the geometric mean of the odds instead we will end with pooled odds of , which corresponds to a pooled probability of about .
In the remainder of the article I will argue that the geometric mean of odds is both empirically more accurate and has some compelling theoretical properties. In practice, I believe we should largely prefer to aggregate probabilities using the geometric mean of odds.
The (extremized) geometric mean of odds empirically results in more accurate predictions
(Satopää et al, 2014) empirically explores different aggregation methods, including average probabilities, the median probability, and the geometric mean of the odds - as well as some more complex methods of aggregating forecasts like and extremized version of the geometric mean of the odds and the beta transformed linear opinion pool. They aggregate responses from 1300 forecasters over 69 questions on geopolitics [4].
In summary, they find that the extremized geometric mean of odds performs best in terms of the Brier score of its predictions. The non-extremized geometric mean of odds robustly outperforms the arithmetic mean of probabilities and the median, though it performs worse than some of the more complex methods.
We haven't quite explained what extremizing is, but it involves raising the pooled odds to a power [5]:
In their dataset, the extremizing parameter that attains the best Brier score falls between and . As a handy heuristic, when extremizing I suggest using a power of in practice. My intuition is that extremizing makes most sense when aggregating data from underconfident experts, and I would not use it when aggregating personal predictions derived from different approaches. This is because extremizing is meant to be a correction for forecaster underconfidence [6]. That being said, it is unclear to me when extremizing helps (eg see Simon_M's comment for an example where extremizing does not help improve the aggregate predictions).
What do other experiments comparing different pooling methods find? (Seaver, 1978) performs an experiment where the performance of the arithmetic mean of probabilities and the geometric mean of odds is similar (he studies 11 groups of 4 people each, on 100 general trivia questions). However, Seaver studies questions where the individual probabilities are in a range between 5% and 95%, where the difference between the two methods is small.
In my superficial exploration of the literature I haven't been able to find many more empirical studies (EDIT: see Simon_M's comment here for a comparison of pooling methods on Metaculus questions). There are plenty of simulation studies - for example (Allard et al, 2012) find better performance of the geometric mean of odds in simulations.
The geometric mean of odds satisfies external Bayesianity
(Allard et al, 2012) explore the theoretical properties of several aggregation methods, including the geometric average of odds.
They speak favorably of the geometric mean of odds, mainly because it is the only pooling method that satisfies external Bayesianity [7] This result was proved before in (Genest, 1984).
External Bayesianity means that if the experts all agree on the strength of the Bayesian updates of each available piece of evidence, it does not matter whether they aggregate their posteriors, or if they aggregate their priors first then apply the updates - the result is the same.
External Bayesianity is compelling because it means that, from the outside, the group of experts behaves like a Bayesian agent - it has a consistent set of priors that are updated according to Bayes rule.
For more discussion on external Bayesianity, see (Madanski, 1964).
While suggestive, I consider external Bayesianity a weaker argument than the empirical study of (Satopää et al, 2014). This is because the arithmetic mean of probabilities also has some good theoretical properties of its own, and it is unclear which properties are most important. I do however believe that external Bayesianity is more compelling than the other properties I have seen discussed in the literature [8].
The arithmetic mean of probabilities ignores information from extreme predictions
The arithmetic mean of probabilities ignores extreme predictions in favor of tamer results, to the extent that even large changes to individual predictions will barely be reflected in the aggregate prediction.
As an illustrative example, consider an outsider expert and an insider expert on a topic, who are eliciting predictions about an event. The outsider expert is reasonably uncertain about the event, and each of them assigns a probability of around 10% to the event. The insider has priviledged information about the event, and assigns to it a very low probability.
Ideally, we would like the aggregate probability to be reasonably sensitive to the strength of the evidence provided by the insider expert - if the insider assigns a probability of 1 in 1000 the outcome should be meaningfully different than if the insider assigns a probability of 1 in 10,000 [9].
The arithmetic mean of probabilities does not achieve this - in both cases the pooled probability is around . The uncertain prediction has effectively overwritten the information in the more precise prediction.
The geometric mean of odds works better in this situation. We have that , while . Those correspond respectively to probabilities of 1.04% and 0.33% - showing the greater sensitivity to the evidence the insider brings to the table.
See (Baron et al, 2014) for more discussion on the distortive effects of the arithmetic mean of probabilities and other aggregates.
Do the differences between geometric mean of odds and arithmetic mean of probabilities matter in practice?
Not often, but often enough. For example, we already saw that the geometric mean of odds outperforms all other simple methods in (Satopää et al, 2014), yet they perform similarly in (Seaver, 1978).
Indeed, the difference between one method or another in particular examples may be small. Case in point, the nuclear war example - the difference between the geometric mean of odds and arithmetic mean of probabilities was less than 3 in 1,000.
This is often the case. If the individual probabilities are in the 10% to 90% range, then the absolute difference in aggregated probabilities between these two methods will typically fall in the 0% to 3% range.
Though, even if the difference in probabilities is not large, the difference in odds might still be significant. In the nuclear war example above there was a factor of 1.3 between the odds implied by both methods. Depending on the application this might be important [10].
Furthermore, the choice of aggregation method starts making more of a difference as your probabilities become more extreme : if the individual probabilities are within the range 0.7% to 99.2% then the difference will typically fall between 0% to 18% [11].
Conclusion
If you face a situation where you have to pool together some predictions, use the geometric mean of odds. Compared to the arithmetic mean of probabilities, the geometric mean of odds is similarly complex*, one empirical study and many simulation studies found that it results in more accurate predictions, and it satisfies some appealing theoretical properties, like external Bayesianity and not overweighting uncertain predictions.
* Provided you are willing to to work with odds instead of probabilities, which you might not be comfortable with.
Acknowledgements
Ben Snodin helped me with detailed discussion and feedback, which helped me clarify my intuitions about the topic while we discussed some examples. Without his help I would have written a much poorer article.
I previously wrote about this topic on LessWrong, where UnexpectedValues and Nuño Sempere helped me clarify a few details I had wrong.
Spencer Greenberg wrote a Facebook post about aggregating forecasts that spurred discussion on the topic. I am particularly grateful to Spencer Greenberg, Gustavo Lacerda and Guy Srinivasan for their thoughts.
And thank you Luisa Rodriguez for the excellent post on nuclear war. Sorry for picking on your aggregation in the example!
Footnotes
[1] We will focus on prediction of binary events, summarized as a single number , the probability of the event. Prediction for multiple outcome events and continuous distributions fall outside of the scope of this post, though equivalent concepts to the geometric average of odds exist for those cases. For example, for multiple outcome events we may use the geometric mean of the vector odds, and for continuous distributions we may use as in (Genest, 1984).
[2] There are many equivalent formulations for the formula of geometric mean odds pooling in terms of probabilities .
One possibility is
That is, the pooled probability equals the geometric mean of the probabilities divided by the sum of the geometric mean of the probabilities and the geometric mean of the complementary probabilities.
Another possibile expression for the resulting pooled probability is:
[3] I used this calculator to compute nice approximations of the odds.
[4] (Satopää et al, 2014) also study simulations in their paper. The results of their simulations are similar to their empirical results.
[5] The method is hardly innovative - many others have proposed similar corrections to pooled aggregates, with similar proposals appearing as far back as (Karmarkar, 1978).
[6] Why use extremizing in the first place?
(Satopää et al, 2014) derive this correction from assuming that the predictions of the experts are individually underconfident, and need to be pushed towards an extreme. (Baron et al, 2014) derive the same correction from a toy scenario in which each forecaster regresses their forecast towards uncertainty, by assuming that calibrated forecasts tend to be distributed around 0.
Despite the wide usage of extremizing, I haven't yet read a theoretical justification for extremizing that fully convinces me. It does seem to get better results in practice, but there is a risk this is just overfitting from the choice of the extremizing parameter.
Because of this, I am more hesitant to outright recommend extremizing.
[7] Technically, any weighted geometric average of the odds satisfies external Bayesianity. Concretely, the family of aggregation methods according to the formula:
where and covers all externally Bayesian methods. Among them, the only one that does not privilege any of the experts is of course the traditional geometric average of odds where .
[8] The most discussed property of the arithmetic mean of probabilities is marginalization. We say that a pooling method respects marginalization if the marginal distribution of the pooled probabilities equals the pooled distribution of the marginals.
There is some discussion on marginalization in (Lindley, 1983), where the author argues that it is a flawed concept. More discussion and a summary of Lindley's results can be found here.
[9] There are some concievable scenarios where we might not want this behaviour. For example, if we are risk averse in a way such that we prefer to defer to the most uncertain experts, or if we expect the predictions to be noisy, and thus we would like to avoid outliers. But largely I think those are uncommon and somewhat contrived situations.
[10] The difference between the methods is probably not significant when using the aggregate in a cost-benefit analysis, since expected value depends linearly on the probability which does not change much. But it is probably significant when using the aggregate as a base-rate for further analysis, since the posterior odds depend linearly on the prior odds, which change moderately.
[11] To compute the ranges I took 100,000 samples of 10 probabilities whose log-odd expression was normally distributed and reported the 5% and 95% quantiles for both the individual probabilities sampled and the difference between the pooled probabilities implied by both methods on each sample. Here is the code I used to compute these results.
Bibliography
Allard, D., A. Comunian, and P. Renard. 2012. ‘Probability Aggregation Methods in Geoscience’. Mathematical Geosciences 44 (5): 545–81. https://doi.org/10.1007/s11004-012-9396-3.
Baron, Jonathan, Barb Mellers, Philip Tetlock, Eric Stone, and Lyle Ungar. 2014. ‘Two Reasons to Make Aggregated Probability Forecasts More Extreme’. Decision Analysis 11 (June): 133–45. https://doi.org/10.1287/deca.2014.0293.
Genest, Christian. 1984. ‘A Characterization Theorem for Externally Bayesian Groups’. The Annals of Statistics 12 (3): 1100–1105.
Lindley, Dennis. 1983. ‘Reconciliation of Probability Distributions’. Operations Research 31 (5): 866–80.
Karmarkar, Uday S. 1978. ‘Subjectively Weighted Utility: A Descriptive Extension of the Expected Utility Model’. Organizational Behavior and Human Performance 21 (1): 61–72. https://doi.org/10.1016/0030-5073(78)90039-9.
Madansky, Albert. 1964. ‘Externally Bayesian Groups’. RAND Corporation. https://www.rand.org/pubs/research_memoranda/RM4141.html.
Satopää, Ville A., Jonathan Baron, Dean P. Foster, Barbara A. Mellers, Philip E. Tetlock, and Lyle H. Ungar. 2014. ‘Combining Multiple Probability Predictions Using a Simple Logit Model’. International Journal of Forecasting 30 (2): 344–56. https://doi.org/10.1016/j.ijforecast.2013.09.009.
Seaver, David Arden. 1978. ‘Assessing Probability with Multiple Individuals: Group Interaction Versus Mathematical Aggregation.’ DECISIONS AND DESIGNS INC MCLEAN VA. https://apps.dtic.mil/sti/citations/ADA073363.
tl;dr The conclusions of this article hold up in an empirical test with Metaculus data
Looking at resolved binary Metaculus questions and using 5 different methods to pool the community estimate.
Also looking at two different scoring rules (Brier and Log) I find rankings as (smaller is better in my table):
Another conclusion which follows from this is that weighting is much more important than how you aggregate your probabilities. Roughly speaking:
(I also did this analysis for both weighted[1] and unweighted odds)
(Analysis on ~850 questions, predictors per question: [ 34 , 51 , 78 , 122, 188] (10th, 25th, 50th, 75th, 90th percentile)
[1] Metaculus weights it's predictions by recency:
[2] This doesn't actually hold up more recently, where the Metaculus prediction has been underperforming.
META: Do you think you could edit this comment to include...
Thanks in advance!
Thanks for the analysis, Simon!
I think it would be valuable to repeat it specifically for questions where there is large variance across predictions, where the choice of the aggregation method is specially relevant. Under these conditions, I suspect methods like the median or geometric mean will be even better than methods like the mean because the latter ignore information from extremely low predictions, and overweight outliers.
Cool, that’s really useful to know. Can you also check how extremizing the odds with different parameters performs?
Thank you for the superb analysis!
This increases my confidence in the geo mean of the odds, and decreases my confidence in the extremization bit.
I find it very interesting that the extremized version was consistently below by a narrow margin. I wonder if this means that there is a subset of questions where it works well, and another where it underperforms.
One question / nitpick: what do you mean by geometric mean of the probabilities? If you just take the geometric mean of probabilities then you do not get a valid probability - the sum of the pooled ps and (1-p)s does not equal 1. You need to rescale them, at which point you end with the geometric mean of odds.
Unexpected values explains this better than me here.
I think it's actually that historically the Metaculus community was underconfident (see track record here before 2020 vs after 2020).
Extremizing fixes that underconfidence, but also the average predictor improving their ability also fixed that underconfidence.
Metaculus has a known bias towards questions resolving positive. Metaculus users have a known bias overestimating the probabilities of questions resolving positive. (Again - see the track record). Taking a geometric median of the probabilities of the events happening will give a number between 0 and 1. (That is, a valid probability). It will be inconsistent with the estimate you'd get if you flipped the question HOWEVER Metaculus users also seem to be inconsistent in that way, so I thought it was a neat way to attempt to fix that bias. I should have made it more explicit, that's fair.Edit: Updated for clarity based on comments below
What do mean by this?
Oh I see!
It is very cool that this works.
One thing that confuses me - when you take the geometric mean of probabilities you end up with ppooled+(1−p)pooled<1. So the pooled probability gets slighly nudged towards 0 in comparison to what you would get with the geometric mean of odds. Doesn't that mean that it should be less accurate, given the bias towards questions resolving positively?
What am I missing?
I mean in the past people were underconfident (so extremizing would make their predictions better). Since then they've stopped being underconfident. My assumption is that this is because the average predictor is now more skilled or because more predictors improves the quality of the average.
The bias isn't that more questions resolve positively than users expect. The bias is that users expect more questions to resolve positive than actually resolve positive. Shifting probabilities lower fixes this.
Basically lots of questions on Metaculus are "Will X happen?" where X is some interesting event people are talking about, but the base rate is perhaps low. People tend to overestimate the probability of X relative to what actually occurs.
I don't get what the difference between these is.
"more questions resolve positively than users expect"
Users expect 50 to resolve positively, but actually 60 resolve positive.
"users expect more questions to resolve positive than actually resolve positive"
Users expect 50 to resolve positive, but actually 40 resolve positive.
I have now editted the original comment to be clearer?
Cheers
Gotcha!
Oh I see!
(I note these scores are very different than in the first table; I assume these were meant to be the Brier scores instead?)
Yes - copy and paste fail - now corrected
I was curious about why the extremized geo mean of odds didn't seem to beat other methods. Eric Neyman suggested trying a smaller extremization factor, so I did that.
I tried an extremizing factor of 1.5, and reused your script to score the performance on recent binary questions. The result is that the extremized prediction comes on top.
This has restored my faith on extremization. On hindsight, recommending a fixed extremization factor was silly, since the correct extremization factor is going to depend on the predictors being aggregated and the topics they are talking about.
Going forward I would recommend people who want to apply extremization to study what extremization factors would have made sense in past questions from the same community.
I talk more about this in my new post.
I think this is the wrong way to look at this.
Metaculus was way underconfident originally. (Prior to 2020, 22% using their metric). Recently it has been much better calibrated - (2020- now, 4% using their metric).
Of course if they are underconfident then extremizing will improve the forecast, but the question is what is most predictive going forward. Given that before 2020 they were 22% underconfident, more recently 4% underconfident, it seems foolhardy to expect them to be underconfident going forward.
I would NOT advocate extremizing the Metaculus community prediction going forward.
More than this, you will ALWAYS be able to find an extremize parameter which will improve the forecasts unless they are perfectly calibrated. This will give you better predictions in hindsight but not better predictions going forward. If you have a reason to expect forecasts to be underconfident, by all means extremize them, but I think that's a strong claim which requires strong evidence.
I get what you are saying, and I also harbor doubts about whether extremization is just pure hindsight bias or if there is something else to it.
Overall I still think its probably justified in cases like Metaculus to extremize based on the extremization factor that would optimize the last 100 resolved questions, and I would expect the extremized geo mean with such a factor to outperform the unextremized geo mean in the next 100 binary questions to resolve (if pressed to put a number on it maybe ~70% confidence without thinking too much).
My reasoning here is something like:
So overall I am not super convinced, and a big part of my argument is an appeal to authority.
Also, it seems to be the case that extremization by 1.5 also works when looking at the last 330 questions.
I'd be curious about your thoughts here. Do you think that a 1.5-extremized geo mean will outperform the unextremized geo mean in the next 100 questions? What if we choose a finetuned extremization factor that would optimize the last 100?
Looking at the rolling performance of your method (optimize on last 100 and use that to predict), median and geo mean odds, I find they have been ~indistinguishable over the last ~200 questions. If I look at the exact numbers, extremized_last_100 does win marginally, but looking at that chart I'd have a hard time saying "there's a 70% chance it wins over the next 100 questions". If you're interested in betting at 70% odds I'd be interested.
No offense, but the academic literature can do one.
Again, I don't find this very persuasive, given what I already knew about the history of Metaculus' underconfidence.
I think extremizing might make sense if the other forecasts aren't public. (Since then the forecasts might be slightly more independent). When the other forecasts are public, I think extremizing makes less sense. This goes doubly so when the forecasts are coming from a betting market.
I find this the most persuasive. I think it ultimately depends how you think people adjust for their past calibration. It's taken the community ~5 years to reduce it's under-confidence, so maybe it'll take another 5 years. If people immediately update, I would expect this to be very unpredictable.
Thanks for writing this up; I agree with your conclusions.
There's a neat one-to-one correspondence between proper scoring rules and probabilistic opinion pooling methods satisfying certain axioms, and this correspondence maps Brier's quadratic scoring rule to arithmetic pooling (averaging probabilities) and the log scoring rule to logarithmic pooling (geometric mean of odds). I'll illustrate the correspondence with an example.
Let's say you have two experts: one says 10% and one says 50%. You see these predictions and need to come up with your own prediction, and you'll be scored using the Brier loss: (1 - x)^2, where x is the probability you assign to whichever outcome ends up happening (you want to minimize this). Suppose you know nothing about pooling; one really basic thing you can do is to pick an expert to trust at random: report 10% with probability 1/2 and 50% with probability 1/2. Your expected Brier loss in the case of YES is (0.81 + 0.25)/2 = 0.53, and your expected loss in the case of NO is (0.01 + 0.25)/2 = 0.13.
But, you can do better. Suppose you say 35% -- then your loss is 0.4225 in the case of YES and 0.1225 in the case of NO -- better in both cases! So you might ask: what is the strategy the gives me the largest possible guaranteed improvement over choosing a random expert? The answer is linear pooling (averaging the experts). This gets you 0.49 in the case of YES and 0.09 in the case of NO (an improvement of 0.04 in each case).
Now suppose you were instead being scored with a log loss -- so your loss is -ln(x), where x is the probability you assign to whichever outcome ends up happening. Your expected log loss in the case of YES is (-ln(0.1) - ln(0.5))/2 ~ 1.498, and in the case of NO is (-ln(0.9) - ln(0.5))/2 ~ 0.399.
Again you can ask: what is the strategy that gives you the largest possible guaranteed improvement of this "choose a random expert" strategy? This time, the answer is logarithmic pooling (taking the geometric mean of the odds). This is 25%, which has a loss of 1.386 in the case of YES and 0.288 in the case of NO, an improvement of about 0.111 in each case.
(This works just as well with weights: say you trust one expert more than the other. You could choose an expert at random in proportion to these weights; the strategy that guarantees the largest improvement over this is to take the weighted pool of the experts' probabilities.)
This generalizes to other scoring rules as well. I co-wrote a paper about this, which you can find here, or here's a talk if you prefer.
What's the moral here? I wouldn't say that it's "use arithmetic pooling if you're being scored with the Brier score and logarithmic pooling if you're being scored with the log score"; as Simon's data somewhat convincingly demonstrated (and as I think I would have predicted), logarithmic pooling works better regardless of the scoring rule.
Instead I would say: the same judgments that would influence your decision about which scoring rule to use should also influence your decision about which pooling method to use. The log scoring rule is useful for distinguishing between extreme probabilities; it treats 0.01% as substantially different from 1%. Logarithmic pooling does the same thing: the pool of 1% and 50% is about 10%, and the pool of 0.01% and 50% is about 1%. By contrast, if you don't care about the difference between 0.01% and 1% ("they both round to zero"), perhaps you should use the quadratic scoring rule; and if you're already not taking distinctions between low and extremely low probabilities seriously, you might as well use linear pooling.
I want to add a little explainer here on how to actually calculate the geometric mean of odds. At least I'm pretty sure how this works - please correct my math if I am not right!
Say you have four forecasts given in probabilities: 10%, 30%, 40%, and 90%.
First you must convert to odds using
o = p/(1-p)
O1 = 0.1/(1-0.1) = 0.111111111 O2 = 0.3/(1-0.3) = 0.428571429 O3 = 0.4/(1-0.4) = 0.666666667 O4 = 0.9/(1-0.9) = 9
Now that you have odds, use the geometric mean. The geometric mean is the nth root of the product of n numbers.
geomean(O1, O2, O3, O4) = 4th root of O1 * O2 * O3 * O4 = 4th root of 0.111111111 * 0.428571429 * 0.666666667 * 9 = 4th root of 0.285714286 = 0.731110446
Now, if you're like me, it is easier to think with probabilities instead of odds, so you will want to transform it back. This is done using
p = o/(o+1)
.p = o / (o + 1) p = 0.731110446 / (0.731110446 + 1) = ~42%
Note that this result (~42%) is different from the geometric mean of probabilities (~32%) and different from the mean of probabilities (~43%).
Interesting! Seems intuitively right.
I wonder though: how would this affect expected value calculations? Doesn't this have far-reaching consequences?
One thing I have always wondered about is how to aggregate predicted values that differ by orders of magnitude. E.g. person A's best guess is that the value of x will be 10, person B's guess is that it will be 10,000. Saying that the expected value of x is ~5,000 seems to lose a lot of information. For simple monetary betting, this seems fine. For complicated decision-making, I'm less sure.
Let's work this example through together! (but I will change the quantities to 10 and 20 for numerical stability reasons)
One thing we need to be careful with is not mixing the implied beliefs with the object level claims.
In this case, person A's claim that the value is mA= 10 is more accurately a claim that the beliefs of person A can be summed up as some distribution over the positive numbers, eg a log normal with parameters μA=logmA and σA . So the density distribution of beliefs of A is fA=1xσA√2πexp[−(lnx−μA)22σ2A] (and similar for person B, with mB=20 ). The scale parameters σA,σB intuitively represent the uncertainty of person A and person B.
Taking σA=σB=0.1, these densities look like:
Note that the mean of these distributions is slightly displaced upwards from the median expμ. Concretely, the mean is computed as exp[μ+σ22], and equals 10.05 and 20.10 for person A and person B respectively.
To aggregate the distributions, we can use the generalization of the geometric mean of odds referred to in footnote [1] of the post.
According to that, the aggregated distribution has a density f=√fA⋅√fB∫√fA⋅√fB.
The plot of the aggregated density looks like:
I actually notice that I am very surprised about this - I expected the aggregate distribution to be bimodal, but here it seems to have a single peak.
For this particular example, a numerical approximation of the expected value seems to equal around 14.21 - which exactly equals the geometric mean of the means.
I am not taking away any solid conclusions from this exercise - I notice I am still very confused about how the aggregated distribution looks like, and I encountered serious numerical stability issues when changing the parameters, which make me suspect a bug.
Maybe a Monte Carlo approach for estimating the expected value would solve the stability issues - I'll see if I can get around to that at some point.
Meanwhile, here is my code for the results above.
EDIT: Diego Chicharro has pointed out to me that the expected value can be easily computed analytically in Mathematica.
The resulting expected value of the aggregated distribution is exp[μAσ2B+μBσ2A+σ2Aσ2Bσ2A+σ2B].
In the case where σ2A=σ2B=σ2 we have then that the expected value is exp[μAσ2+μBσ2+σ2σ2σ2+σ2]=exp[μA+μB+σ22]=√exp[μA+σ2/2]√exp[μB+σ2/2], which is exactly the geometric mean of the expected values of the individual predictions.
Thanks, Jaime!