One’s Future Behavior as a Domain of Calibration

markov_user

Summary: This post focuses on one particular domain of forecasting, which is one’s own behavior. I personally find it very useful to make frequent (e.g. weekly) predictions on how likely it is for me to get a list of planned tasks done in time. Achieving calibration this way allows for more accurate planning and reliability. It can be helpful in the planning process by making obvious which tasks/goals require further refinements due to a low probability of getting done. Consider this post a nudge to explore this direction of “self calibration” to figure out whether the proposed approach is valuable for you.

On Calibration and Resolution

(feel free to skip this section if you’re familiar with the terms, e.g. due to having read Superforecasting)

Calibration describes the alignment of a person’s predictions with reality in any given domain. If a person is well calibrated (in a domain), that means that of all their predictions (in that domain) stating something is e.g. 60% likely to happen, 60% actually come true. This alignment between predictions and reality makes their predictions actually useful for decision making. People who aren’t calibrated (in a domain) don’t have this reliable correlation between forecasts and reality, which means their predictions are systematically incorrect, thus less reliable and may consequently lead to suboptimal decisions more often than necessary.

Another metric to measure the performance of a forecaster is what Philip Tetlock calls resolution, which could be described as the boldness of one’s predictions: High resolution means your predictions are close to 0 or 100%, whereas low resolution means many of your predictions are closer to 50%.

An ideal forecaster would have great calibration as well as resolution. Both can generally be productively improved, the former with calibration training, the latter e.g. with gaining expertise or gathering more information. Calibration can, in principle, get very close to perfection. This is probably not the case for resolution (unless being dishonest or extremely overconfident) due to inherent uncertainty of the future.

Calibration training describes the process of deliberately making (and afterwards evaluating) many predictions within any particular domain, calibrating oneself in the process. While making well calibrated predictions may involve skills such as finding good reference classes and applying Bayesian updating, one highly relevant subskill seems to be mapping one’s feeling of certainty about something onto a concrete (numeric) probability. This is especially true for the domain discussed in this post. Improvements in calibration come relatively quickly and mere dozens of evaluated predictions can be enough to make measureable progress.

Forecasting Domains

The previous section was rather reliant on the term “domain”. The concept of forecasting domains is important, because neither calibration nor resolution transfer well between domains. For resolution this is relatively obvious: Expertise in one area – say US politics – does not grant you expertise in others, such as start-up performance, climate science or replications of studies in social science. With respect to calibration it seems somewhat more plausible that e.g. getting rid of one’s general overconfidence may yield benefits in many domains and not be limited to a single one. This does seem to only get you so far however, and calibration too needs to be trained separately in different domains to get the best possible results^[1].

Most “classic” forecasting domains are external ones about things that happen in the outside world, which usually are not directly affected by the forecaster. This post however is about a very particular domain: your personal future behavior. This domain differs from typical domains in several ways:

The forecaster determines both the predictions and their outcome; there is thus some risk of the predictions affecting the outcome, which may be undesired (unless they affect the outcome in a positive way, in which case the whole ordeal may be epistemically questionable but still instrumentally useful)
Evaluation is very easy, as the forecaster directly observes their own behavior, so no research of any kind is involved
It’s easy to find a lot of predictable events and thus calibrate quite quickly
It’s a very personal domain where in most cases it makes very little sense to involve other forecasters
It’s more important than in other domains to actively generate new prediction statements (whereas in other domains one could in principle rely on pre-existent statements and merely generate probabilities^[2])

The first point will be addressed in a later section, but points 2 and 3 draw a positive picture of this particular domain as one that allows relatively easy calibration training. I also argue that it’s not only relatively easy, but also exceptionally useful, for a number of reasons:

It’s useful for planning, as you can attain detailed insight into what you’re likely to do and how modifications to your plans impact the expected outcomes
It allows people to become more reliable in their communication towards others
It’s a very robust domain which will always be useful for a person, independent of any changes in interests or career

However, I rarely see this domain mentioned when people discuss forecasting. It therefore appears quite underrated to me.

Predicting One’s Future Behavior

In order to achieve calibration quickly and easily, one should find ways to

accumulate many predictions quickly,
that are easy to formulate,
easy to evaluate,
and have a short time horizon.

While I’ve described my personal system in detail in the comments below this post, I’ll try to keep the post itself more general. So, to people who would like to invest a bit of their time and energy into calibration training in this domain, I’d suggest the following:

Pick a recurring time. This could be a day of the week, every morning, the first of each month. At these times spend a few minutes on calibration training by evaluating prior predictions as well as making new predictions on concrete aspects of your behavior within a certain timeframe
1. Making this a recurring thing is useful to ensure calibration stays on a high level over time. While it is useful to do calibration training once and then “be calibrated”, I would assume it pays off to recurrently measure calibration, especially given that it can be done in a matter of minutes.
2. If you already have a system for goal setting or task planning, I suggest simply appending this whole process to it.
Before actually making new predictions, evaluate the predictions from your last session
1. For every individual prediction, evaluate whether it turned out to be true or false (or in rare cases indecisive and needs to be dropped from the evaluation)
2. After this, I recommend two metrics to check:
  1. Comparing your expected value from last session’s predictions (i.e. sum of all probabilities) to the number of predictions evaluated to true, to figure out a general under- or overconfidence in that time frame (example^[3])
  2. Looking at percentage ranges for all predictions from e.g. the last few months, to identify potential more detailed patterns, such as “I tend to be underconfident in the 80-89% range”
Based on the findings of the evaluation, you then make new predictions for the time until the next session, and try to adjust your predictions depending on your findings
1. I prefer to make almost purely System 1 based predictions, i.e. very quick intuitive ones, without spending too much time on them.
2. Which concrete things you predict is of course up to you; what I do is that this whole process is part of me planning my weeks, and I assign probabilities to each task on whether they’ll be crossed off the list by the end of the week or not. Alternatively you could predict how well you stick to your habits, how much you’ll sleep (given you’re tracking that), whether you’ll leave the house etc. Basically anything related to your behavior that seems relevant enough that you’d like to gain accurate views on it.

There are many ways to implement such a system. You could probably do it with https://predictionbook.com/, Predict, https://www.predictit.org/, https://www.foretold.io/ or any other such service, although I’m admittedly not sure how well evaluating subsets of your predictions, as recommended in point 2b above, is supported in all of them, if at all. In theory you could even do most of this on paper, but that would naturally make the evaluation difficult. My personal approach is using spreadsheets for this, as that makes it super quick to add predictions (no mouse clicks required, no page reloads, simply a table with one prediction per row containing name/statement, probability and evaluation (true or false) in individual cells), while also allowing me to fully automate the evaluation parts to my liking.

Benefits

I see three primary benefits this form of calibration provides: Firstly, it allows me to predict how well certain changes to plans affect my behavior, and thus makes it easier for me to figure out effective ways to reach my goals. When setting a goal, I may realize that I’m only 30% likely to reach it before some deadline. I could then identify numerous ways to increase the likelihood of success (e.g. make the goal more attractive or punish failure, contact somebody who may be able to help me, buy some tool that’s helpful in making progress, put reminders in place to ensure I don’t forget working on it in time, etc.), and have a precise feeling about how big each adjustment’s impact on the overall probability is.

Secondly, I waste less time on failing projects, as it's easier to anticipate whether I'll be able to finish something or not.

Thirdly, it leads to being a more reliable person. When somebody asks me whether I’ll attend an event or get something done until next week, good calibration allows me to actually give accurate answers.

Another potential benefit is that this domain affords making many predictions frequently, while also blurring the lines between one’s personal life and forecasting (which otherwise may be seen as two very separate areas). It is also comparably easy/forgiving/rewarding, as you don’t require specific expertise of any kind to make good (high-resolution) predictions. Consequently it’s a good driver of habitualizing forecasting and may generally be a good domain for people to get into forecasting in the first place.

Caveats

There are a number of things to keep in mind. Namely, I can think of five caveats to all of this. I’ll also explain why I believe these caveats aren’t all that problematic or how they can be addressed.

Predictions may affect outcomes: as hinted at before, this domain is special in that predictions can affect outcomes much more than in other domains. If good calibration feels more important to you than behaving optimally, then you might deceive yourself into not doing things merely because you assigned a low probability to doing them. Similarly, assigning high probabilities might increase your motivation to do those things. Consequently, this form of calibration may in some cases be epistemically questionable, but could still be instrumentally useful. I personally don’t feel like this seriously affects me, as reaching my goals / getting things done feels much more important to me than cheating my way to a better calibration score. I do find it useful though to keep my predictions hidden from myself until evaluation, to reduce the likelihood of this becoming an issue in the first place.
High correlation between predictions: Predictions in this domain are more interdependent than in others. This is because many plausible happenings, such as falling ill, personal tragedies, or even just mood fluctuations, may affect all or most of the predictions at a given time similarly, which might mess up the evaluation^[4]. This is definitely a drawback, but one that can be addressed in multiple ways:
1. You could generally try to very broadly factor in all such effects in the probabilities (e.g. by assuming a, say, 4% chance of illness per week). This could still lead to misleading evaluations in the short term, but things should even out in the long term.
2. When making the predictions, you may implicitly append a “given no unforeseen major disruptions intervene” kind of qualifier, ideally with a well defined “major disruption” criterion to mitigate any self deception.
3. Or you do what I did and mostly ignore this issue at first, until one day it becomes relevant, and only then decide what, if anything, to do about it.
Reliance on anecdotal evidence: I have only limited concrete evidence about how useful this really is, as I’m not aware of other people using similar systems, and can thus only report anecdotally, which is reason enough to be skeptical. I can say that I personally have a strong impression that the aforementioned benefits exist. Calibration scores quickly converged for me in this domain and have remained in a good state. I’ve been using prediction based planning for my goals for around 1.5 years now, and am reasonably sure it does lead to achieving more of my personal goals. It’s certainly possible these benefits don’t generalize well to other people. On the other hand however, the general case for calibration training is strong, so it’s a reasonable prior in any sufficiently valid domain to assume that calibration training might be effective there. It also seems very plausible to me that having a more accurate view of one’s future behavior is useful.
Benefits don’t come for free: The question of course is not only whether or not this is useful, but whether it’s worth the cost involved, i.e. time and energy. Setting up such a system always takes some effort, which I assume stops most people from following through on this idea even if they do think it sounds convincing. In the long run however, the cost isn’t all that high. As pointed out in one of my comments below, I usually spend ~10 minutes per week on the prediction & evaluation part, which is enough to consistently keep me well calibrated.
Individual predictions are not that valuable: When we think of forecasting, we usually think of interesting and relevant questions regarding e.g. politics or AGI timelines. For such questions, it’s easy to see how getting closer to the truth is valuable. This is not necessarily the case for “will I spend at least four hours this week writing that forum post” or “will I stick to my plan of getting groceries tomorrow”. The value derived from getting closer to the truth of such questions is comparably low. My response to this would be that the reduced effort going into each single prediction makes up for this. So while the value per prediction may indeed be lower, this does not hold for the value per time invested^[5]. Additionally, there certainly are cases where accurately predicting one’s future behavior is highly valuable, such as when considering to pursue a difficult degree, to start a costly project with significant risk of failure, or even just to sign up for a gym membership. For such cases alone it may be worth having improved one’s self calibration.

Conclusion & Discussion

Many people know of calibration training and the value that lies in frequently making and evaluating predictions. Yet I’m not aware of anyone using such methods to gain a deeper self understanding. I believe that, while there are some caveats, this particular domain of forecasting is especially useful, and one that allows relatively fast and easy calibration training due to the forecaster being so fundamentally in control of things.

My hope is that this post causes a few people to consider this domain and possibly begin experimenting with it. Other welcome outcomes would be discussions on the drawbacks, benefits and the experiences (if any) other forecasters have with self calibration.

My hypothesis is that many people who aren’t active forecasters might agree with the key points of this post to some degree, but won’t take action as setting up such a system is somewhat effortful. The upfront effort combined with relatively indirect benefits that have merely anecdotal backing might, for many, not make this seem like a very good use of their time.

People who already have some experience in forecasting on the other hand might have the systems and habits in place, but disregard this domain due to its peculiarities mentioned earlier in this post (such as predictions affecting outcomes). Another explanation may be technical reasons, such as their preferred tools making it difficult to make many predictions in a short time, effectively nudging them to make fewer, more “high-stakes” predictions rather than many simple ones, which doesn't work well with this domain.

I’d particularly like to spark some discussion regarding the following questions:

Is this forecasting domain actually as promising as I make it out to be?
Is my impression correct that few people actively work on “self calibration”, or is that actually somewhat widespread behavior of which I’ve been ignorant until now?
In case people really do think of this as valuable in general, but too difficult/effortful to actually apply in practice, what may be the best way to make this as easy as possible for others? Which forecasting tool would work best for this? Would a well designed spreadsheet template do the job?

Thanks for reading, and also thanks a lot to Leonard Dung and one anonymous friend for helping me with this post.

https://www.researchgate.net/publication/335233545_Transferability_of_calibration_training_between_knowledge_domains ↩︎
e.g. I’m sure a lot of people participate on https://www.metaculus.com/ without ever having submitted a question ↩︎
Assuming you have a ToDo list of four items, and your predictions of getting them done respectively are 95%, 40%, 86% and 63%, your EV would be 0.95+0.4+0.86+0.63 = 2.84. If upon evaluation you realize you only got one of the items done, you might assume to have been overconfident in your predictions, and correct future predictions downward a bit. ↩︎
For instance imagine a perfectly calibrated person making 40 predictions for the next week. Under normal circumstances that should be a decent enough sample size for any noise to mostly cancel out. If the EV (i.e. the sum of all predicted probabilities) was 32.9, the number of predictions evaluated to true should usually end up close to that number. If, however, the person falls ill on Monday evening and then spends the rest of the week in bed, this will affect almost all the predictions for that week equally, and in the end only 5 instead of ~33 things might get done. ↩︎
This argument only works however, if the forecaster uses a prediction software that makes it reasonably easy and frictionless to create many new predictions. ↩︎

An example would be “make check up appointment with my dentist”, but when calling during the week realizing the dentist is on vacation and no appointment can be made; given there’s no time pressure and I prefer making an appointment there later to calling a different dentist, the task itself was not achieved, yet my behavior was as desired; as there are arguments to be made to evaluate this both as true or false, I often just drop such cases entirely from my evaluation ↩︎
I once had the task “sign up for library membership” on my list, but then during the week realized that membership was more expensive than I had thought, and thus decided to drop that goal; here too, you could either argue “the goal is concluded” (no todo remains open at the end of the week) or “I failed the task” (as I didn’t do the formulated action), so I usually ignore those cases instead of evaluating them arbitrarily ↩︎
One could argue that a 5% and a 95% prediction should really end up in the same bucket, as they entail the same level of certainty; my experience with this particular forecasting domain however is that the symmetry implied by this argument is not necessarily given here. The category of things you’re very likely to do seems highly different in nature from the category of things you’re very unlikely to do. This lack of symmetry can also be observed in the fact that 90% predictions are ~10x more frequent for me in this domain than 10% predictions. ↩︎
It’s 30 minutes total, but the first 20 are just the planning process itself, whereas the 3+2+5 afterwards are the actual forecasting & calibration training. ↩︎

markov_userDec 31 20203

I personally tend to stick to the following system:

Every Monday morning I plan my week, usually collecting anything between 20 and 50 tasks I’d like to get done that week (this planning step usually takes me ~20 minutes)
- Most such tasks are clear enough that I don’t need to specify any further definition of done; examples would be “publish a post in the EA forum”, “work 3 hours on project X”, “water the plants” or “attend my local group’s EA social” – very little “wiggle room” or risk of not knowing whether any of these evaluates to true or false in the end
- In a few cases, I do need to specify in greater detail what it means for the task to be done; e.g. “tidy up bedroom” isn’t very concrete, and I thus either timebox it or add a less ambiguous evaluation criterion
Then I go through my predictions from the week before and evaluate them based on which items are crossed off my weekly to do list (~3 minutes)
- “Evaluate” at first only means writing a 1 or a 0 in my spreadsheet next to the predicted probability
- There are rare exceptions where I drop individual predictions entirely due to inability to evaluate them properly, e.g. because the criterion seemed clear during planning, but it later turned out I had failed to take some aspect or event into consideration^[1], or because I deliberately decided to not do the task for unforeseeable reasons^[2]. Of course I could invest more time into bulletproofing my predictions to prevent such cases altogether, but my impression is that it wouldn’t be worth the effort.
After that I check my performance of that week as well as of the most recent 250 predictions (~2 minutes)
- For the week itself, I usually only compare the expected value (sum of probabilities) with actually resolved tasks, to check for general over- or underconfidence, as there aren’t enough predictions to evaluate individual percentage ranges
- For the most recent 250 predictions I check my calibration by having the predictions sorted into probability ranges of 0..9%, 10..19%, … 90..99%.^[3] and checking how much the average outcome ratio of each category deviates from the average of predictions in that range. This is just a quick visual check, which lets me know in which percentage range I tend to be far off.
- I try to use both these results in order to adjust my predictions for the upcoming week in the next step
Finally I assign probabilities to all the tasks. I keep this list of predictions hidden from myself throughout the following week in order to minimize the undesired effect of my predictions affecting my behavior (~5 minutes)
- These predictions are very much System 1 based and any single prediction usually takes no more than a few seconds.
- I can’t remember how difficult this was when I started this system ~1.5 years ago, but by now coming up with probabilities feels highly natural and I differentiate between things being e.g. 81% likely or 83% likely without the distinction feeling arbitrary.
- Depending on how striking the results from the evaluation steps were, I slightly adjust the intuitively generated numbers. This also happens intuitively as opposed to following some formal mathematical process.

While this may sound complex when explaining it, I added the time estimates to the list above in order to demonstrate that all of these steps are pretty quick and easy. Spending these 10 minutes^[4] each week seems like a fair price for the benefits it brings.

SorenJul 2 20222

Do you have a way to automatically calculate averages for each probability bucket? I'm trying to start a system like this, but I don't see a way to score the percent right I get in each probability category other than manually selecting which predictions go into which bucket (right now all my predictions are each a single row in an excel sheet).

markov_userJul 10 20222

Sort of, so firstly I have a field next to each prediction that automatically computes its "bucket number" (which is just FLOOR(<prediction> * 10)). To then get the average probability of a certain bucket, I run the following: =AVERAGE(INDEX(FILTER(C$19:K, K$19:K=A14), , 1)) - note that this is google sheets though and I'm not sure to which degree this transfers to Excel. For context, column C contains my predicted probabilities, column K contains the computed bucket numbers, and A14 here is the bucket for which I'm computing this. Similarly I count the number of predictions of a given bucket with =ROWS(FILTER(K$19:K, K$19:K<>"", K$19:K=A14)) and the ratio of predictions in that bucket that ended up true with =COUNTIF(FILTER(D$19:K, K$19:K=A14), "=1") / D14 (D19 onwards contains 1 and 0 values depending on if the prediction happened or not; D14 is the aforementioned number of predictions in that bucket).

If this doesn't help, let me know and I can clear up one such spreadsheet, see if I can export it as xlsx file and send it to you.

MaxRaJan 1 20212

Do you only make forecasts that resolve within the week? I imagine it would also be useful to sharpen one’s predictive skills for longer timeframes, e.g. achieving milestones of a project, finishing a chapter of your thesis, etc.

markov_userJan 2 20212

Good point, I also make predictions about quarterly goals (which I update twice a month) as well as my plans for the year. I find the latter especially difficult, as quite a lot can change within a year including my perspective on and priority of the goals. For short term goals you basically only need to predict to what degree you will act in accordance with your preferences, whereas for longer term goals you also need to take potential changes of your preferences into account.

It does appear to me that calibration can differ between the different time frames. I seem to be well calibrated regarding weekly plans, decently calibrated on the quarter level, and probably less so on the year level (I don't yet have any data for the latter). Admittedly that weakens the "calibration can be achieved quickly in this domain" to a degree, as calibrating on "behavior over the next year" might still take a year or two to significantly improve.

Cool, I definitely feel motivated to integrate something like this into my routines. E.g. every night I rate how productive my day was, so I’m now thinking about making a forecast every morning about that. Of course my influence on the outcome will be really high, but it seems like I’ll get useful info from this anyway.

I see, so at the end of the day you're assigning a number representing how productive the day was, and you consider predicting that number the day before? I guess in case that rating is based on your feeling about the day as opposed to more objectively predefined criteria, the "predictions affect outcomes" issue might indeed be a bit larger here than described in the post, as in this case the prediction would potentially not only affect your behavior, but also the rating itself, so it could have an effect of decoupling the metric from reality to a degree.

If you end up doing this, I'd be very interested in how things go. May I message you in a month or so?

MaxRaJan 2 20212

so at the end of the day you're assigning a number representing how productive the day was, and you consider predicting that number the day before?

Almost, I now integrated the prediction into my morning routine. Yeah, I could actively try to do nothing about it the first month, e.g. if I predict an unproductive day. Or I think I will randomize if I consciously think about altering my plans, so I get immidiate benefits from the practice (fixing unproductive days) and also learn to have better calibration.

For sure! :)

Effective Altruism Forum
EA Forum