16 Human judgment

So far, we have considered mostly automatic methods for forecasting, where the responsibility of the human forecaster was limited to decide on an appropriate model class, and possibly on the specific model within this class. However, while most “real” forecasts start out as the outputs from software, they will very often be judgmentally adjusted. While these human inputs have their place and can improve on algorithmically generated forecasts, we must be careful: human intervention is not always positive and can make the forecast worse. Adjusting forecasts also takes time and effort. It is important to carefully examine when and how to adjust forecasts. This chapter examines this challenge.

16.1 Cognitive biases

Any visit to the business book section of a local bookstore will reveal plenty of titles that emphasize that managers need to trust their gut feeling and follow their instincts (e.g., Robinson, 2006). This emphasis on intuition illustrates that human judgment is ubiquitous in organizational decision-making.

However, the last decades of academic research have also produced a counter-movement to this view, centered on studying cognitive biases (Kahneman, 2012; Kahneman et al., 2011). This line of thought sees human judgment as inherently fallible. Intuition, as a decision-making system, has evolved to help us quickly make sense of the surrounding world. Its purpose is not to process all available data and carefully weigh alternatives. Managers can easily fall into the trap of trusting their initial feeling and biasing decisions rather than carefully deliberating and reviewing all available data and options. A key to enabling proper human judgment in forecasting is to reflect upon initial impressions and let further reasoning and information possibly overturn one’s initial gut feeling (Moritz et al., 2014).

The presence of cognitive biases in time series forecasting is well documented. We will review several particularly salient ones in this chapter.

Recent data strongly influences forecasters (recency bias). They neglect to interpret newer data in the context of the whole time series that has created it. This behavioral pattern is also called system neglect (Kremer et al., 2011). It implies that forecasters tend to over-react to short-term shocks in the market and under-react to fundamental and massive long-term shifts.

Further, forecasters are easily misled by visualized data to “find” illusionary patterns in the data that objectively do not exist (pareidolia). Simple random walks (such as stock market data) will likely create sequences of observations that consistently increase or decrease over time by pure chance. This mirage of a trend is quickly seen as accurate, even if the past series provides little indication that actual trends exist in the data. Using such illusionary trends for predicting demand can be highly misleading.

If real trends exist in the data, human decision-makers tend to dampen them as they extrapolate in the future; that is, their longer-range forecasts tend to believe that these trends are temporary and will naturally weaken over time (Lawrence and Makridakis, 1989). Such behavior may benefit long-term forecasts where trends usually require dampening. However, it can reduce accuracy for more short-term forecasts.

This discussion allows us to point out representativeness as another essential judgment bias. Forecasters tend to believe that the series of forecasts they produce should look like the series of demands they observe. Consider series 1 in Figure 7.3. As mentioned earlier, the best forecast for this series is a long-run average of observed demand. Thus, plotting forecasts for multiple future periods would result in a flat line. The forecast remains the same from month to month. Comparing the actual demand series with this series of forecasts reveals an odd picture. The demand series shows considerable variation, whereas the forecasts are a straight line. Human decision-makers tend to perceive this as strange and thus introduce variation into their sequence of forecasts, so their forecasts more resemble the series of actual demands (Harvey et al., 1997). Such behavior can be quite detrimental to forecasting performance.

Another critical set of biases relates to how people deal with forecast uncertainty. One key finding in this context is over-precision: human forecasters tend to under-estimate forecast uncertainty (Mannes and Moore, 2013). This bias likely stems from a tendency to ignore or discount extreme cases. The result is that prediction intervals based on human judgment tend to be too narrow – people feel too precise about their predictions. While this bias has been persistent and difficult to remove, recent research has provided some promising results: we can reduce over-precision by forcing decision-makers to assign probabilities to extreme outcomes (Haran et al., 2010).

A related bias is the so-called hindsight bias: decision-makers tend to believe ex-post that their forecasts are more accurate than they are (Biais and Weber, 2009). This bias highlights the importance of constantly calculating, communicating, and learning from the accuracy of past judgmental forecasts.

In demand forecasting and inventory planning, a particular bias exists in service level anchoring (Fahimnia et al., 2022). Anchoring refers to the unconscious mental process of latching on to a specific number when forming a judgment. Forecasters preparing a point forecast should focus on the most likely outcome or about the 50th percentile of the demand distribution. A firm’s service level, often widely known and usually much higher than the 50th percentile, may anchor them on a much higher part of the demand distribution, leading to persistent over-forecasting.

Humans are prone to misallocating their scarce attentional resources on topics that are salient, which may not be the ones that are important. This effect has been called bikeshedding, after the observation that people will spend proportionally more time discussing a new bike shed costing $50,000 than a new factory costing $10,000,000. As forecasters, we need to be careful not to spend too much time on optimizing less important forecasts (see Section 17.5 on business value).

Statistical models are often difficult to understand for human decision-makers (they are black boxes); thus, they trust the method less and are more likely to discount it. In laboratory experiments, the users of forecasting software were more likely to accept the forecast from the software if they could select the model from several alternatives (Lawrence et al., 2002). They tend to be more likely to discount a forecast if the source of the forecast is a statistical model as opposed to a human decision-maker (Önkal et al., 2009).

Quite interestingly, human decision-makers tend to forgive other experts who make errors in forecasts but quickly lose their trust in an algorithm that makes similar prediction errors, which has been called algorithm aversion (Dietvorst et al., 2015). Due to the noise inherent in time series, both humans and algorithms will make prediction errors. Over time, this may imply that algorithms will be less trusted than humans. However, if we think about it, whether the user understands and trusts a statistical model does not, by itself, mean that the model yields bad forecasts! Since we care about our forecasts’ accuracy, not our model’s popularity, black box or algorithm aversion arguments should not influence us against a statistical model.

In summary, while cognitive biases relate more generally to organizational decision-making, they are very relevant in our context of demand forecasting. However, judgment can provide value. Statistical algorithms do not know what they do not know, and forecasters may have domain-specific knowledge that enables better forecasting performance than any algorithm can achieve. A sound forecasting system needs to allow human input to capture this domain-specific knowledge. Recent research shows that, with the proper decision-making avenues, the forecasting performance of human judgment can be extraordinary (Tetlock and Gardner, 2015).

16.2 Domain-specific knowledge

One crucial reason human judgment is still prevalent in forecasting is the role of domain-specific knowledge (Lawrence et al., 2000). Human forecasters may have information about the market that is not (or only imperfectly) incorporated into their current forecasting models. Such domain-specific knowledge, in turn, enables humans to create better forecasts than any statistical forecasting model could accomplish. From this perspective, the underlying forecasting models appear incomplete if they do not include key variables that influence demand. For example, forecasters often note that their models do not adequately factor in promotions, so their judgment is necessary to adjust any statistical model.

However, in times of widespread business analytics, such arguments seem increasingly outdated. Promotions are quantifiable in terms of discount, length, advertisement, etc. Good statistical models to incorporate promotions into sales forecasts are available, as discussed in Chapter 11 (see also Fildes, Ma, et al., 2022).

Besides knowing variables that a forecasting model misses, forecasters may have information that is hard to quantify or codify, that is, a highly tacit and personal understanding of the market. Salespeople may, for example, be able to subjectively assess and interpret the mood of their customers during their interactions. They may also get an idea of the customers’ estimate of their business development, even if no formal forecast information is shared. The presence of such information hints at model incompleteness as well. Yet, unlike promotions, some of this information may be highly subjective and difficult to quantify and include in any forecasting model.

Another argument for human judgment in forecasting is that such judgment can identify interactions among predictor variables (Seifert et al., 2015). An interaction effect means that the impact of one particular variable on demand depends on the presence (or absence) of another variable. While human judgment is quite good at discerning such interaction effects, identifying the proper interactions can be daunting for any statistical model due to the underlying dimensionality. For example, the potential number of two-way interactions among ten variables is 45; the possible number of three-way interactions among ten variables is 120. Including many interaction terms in a regression equation can make estimating and interpreting any statistical model challenging (see Section 11.7). Human judgment may be able to pre-select meaningful interactions more readily.

16.3 Political and incentive aspects

Dividing firms into functional silos is often a necessary aspect of organizational design to achieve focus in decision-making. Such divisions usually go hand-in-hand with incentives – marketing and sales employees may, for example, be paid a bonus depending on the realized sales of a product. In contrast, operations employees receive incentives based on cost savings. The precise key performance indicators used vary significantly from firm to firm. While such incentives may provide an impetus for action and effort within the corresponding functions, they create diverging objectives. Such goal misalignment is particularly troublesome for cross-functional processes such as forecasting and sales and operations planning.

Since the forecast is crucial for many organizational decisions, decision-makers try to influence it to achieve their corporate objectives and personal goals. Relying on a statistical model will reduce or eliminate the ability to influence decision-making through the forecast; any organizational influence on the forecast will be visible and will encounter resistance.

Mello (2009) describes seven ways forecasters can be influenced by politics and incentives:

Enforcing behavior occurs when forecasters try to maintain a higher forecast than anticipated to reduce the discrepancy between forecasts and company financial targets. Suppose senior management creates a climate where targets must be met without question. In that case, forecasters may acquiesce and adjust their forecasts accordingly to reduce any dissonance between their forecasts and these targets.
Relatedly, the game of spinning occurs if lower-level employees or managers deliberately alter (usually increasing) the forecast to influence higher-level managers’ responses. Such behavior can be a result of higher-level management killing the messenger. If forecasters are criticized for delivering low forecasts, they will adjust their behavior to deliver more pleasant forecasts.
Filtering occurs when forecasters lower their forecasts to reflect supply or capacity limitations. This phenomenon often occurs if operations personnel drive forecasts to mask their inability to meet predicted demand.
If sales personnel strongly influence forecasts, hedging can occur, where forecasts over-estimate demand to move operations to make more product available. Similarly, suppose downstream supply chain partners influence the forecast. In that case, they may inflate demand estimates in anticipation of a supply shortage, wanting to secure a larger proportion of the resulting allocation.
On the other hand, sandbagging involves lowering the sales forecast so that actual demand is likely to exceed it; this strategy becomes prevalent if your organization does not sufficiently differentiate between forecasts and sales and if salespeople’s targets are set based on forecasts: if a salesperson can get forecasts to be lower, their sales targets will be lower, and they will be more likely to exceed them and get a big bonus.
Second guessing occurs when influential individuals in the forecasting process override the forecast with their judgment. Such behavior is often a symptom of general mistrust in the forecast.
Finally, withholding occurs when members of the organization fail to share critical information related to the forecast. This behavior is often a deliberate ploy to create uncertainty about demand among other organization members.

In summary, human judgment influences forecasts for good and bad reasons. The question of whether it improves forecasting or not is ultimately an empirical one. In practice, most companies will use a statistical forecast as a basis for their discussion but often adjust this forecast based on the consensus of the people involved.

Forecasting is ultimately a statement about reality; thus, the quality of a forecast can be judged ex-post (see Chapter 17). One can compare whether the adjustments made in this consensus adjustment process improved or decreased the accuracy of the initial statistical forecast in a Forecast Value Added (FVA) analysis (Gilliland, 2013). In a study of over 60,000 forecasts across four different companies, such a comparison revealed that, on average, judgmental adjustments to the statistical forecast increased accuracy (Fildes et al., 2009). However, a more detailed look also revealed that smaller adjustments (which were also more frequent) tended to reduce accuracy, whereas larger adjustments increased it. One interpretation of this result is that more significant adjustments usually resulted from model incompleteness, that is, promotions and foreseeable events that the statistical model did not consider. The minor adjustments represent the remaining organizational quibbling, influencing behavior and distrust in the forecasting software. One can thus conclude that a sound forecasting process should only be affected by human judgment in exceptional circumstances and with a clear indication that the underlying model is incomplete. Otherwise, organizations should limit the influence of human judgment in the process.

16.4 Correction and combination

If we use judgmental forecasts in addition to statistical forecasts, two types of methods may help us improve the performance of these judgmental forecasts: combination and correction. Combination methods combine judgmental forecasts with statistical forecasts mechanically, as described in Section 8.6. In other words, the simple average of two forecasts – whether judgmental or statistical – can outperform either one (Clemen, 1989). Correction methods, on the other hand, attempt to de-bias a judgmental forecast before use. Theil’s correction method is one such attempt, following a simple procedure. A forecaster runs a regression between their past forecasts and past demand in the following form:

\[\begin{align} \mathit{Demand}_{t} = a_{0} + a_{1} \times \mathit{Forecast}_{t} + \mathit{Error}_{t}. \tag{16.1} \end{align}\]

The forecaster can then use the results from this regression equation to de-bias all forecasts made after this estimation by calculating

\[\begin{align} \mathit{Corrected~Forecast}_{t+n} = a_{0} + a_{1} \times \mathit{Forecast}_{t+n}, \tag{16.2} \end{align}\]

where $a_0$ and $a_1$ in Equation (16.2) are the estimated regression intercept and slope parameters from Equation (16.1). There is some evidence that this method works well in de-biasing judgmental forecasts and leads to better performance of such forecasts (Goodwin, 2000). However, we should examine whether the sources of bias change over time. For example, the biases human forecasters experience when forecasting a time series for the first time may be very different from the biases they are subject to with more experience in forecasting the series. Thus, initial data may not be valid for estimating Equation (16.1). Further, if forecasters know that their forecasts will be bias-corrected in this fashion, they may show a behavioral response to overcome and outsmart this correction mechanism.

Human-guided learning is a new integration method that has recently been developed and tested (Brau et al., 2023). The main idea of this method is to let forecasters only indicate that a special event (e.g., promotion) occurs in a period instead of allowing them to adjust the forecast directly. An algorithm in the background estimates the effect of this event and adjusts the forecast accordingly. This method works surprisingly well in a large-scale retail context.

16.5 Forecasting in groups

The essence of forecast combination methods has also been discussed in the so-called Wisdom of Crowds literature (Surowiecki, 2004). The key observation in this line of research is more than 100 years old: Francis Galton, a British polymath and statistician, famously observed that during bull-weighing competitions at county fairs (where fairgoers would estimate the weight of a bull, with the best estimate winning a prize), individual estimates could be far off the actual weight of the bull, but the average of all estimates was spot on and even outperformed the estimates of experts. In general, estimates provided by groups of individuals tended to be closer to the actual value than estimates provided by individuals.

An academic debate ensued whether this phenomenon was either due to group decision-making, that is, groups being able to identify the more accurate opinions through discussion, or due to statistical aggregation, that is, group decisions representing a consensus that was far from the extreme views within the group, thus canceling out error. Decades of research established that the latter explanation is more likely to apply. Group consensus processes to discuss forecasts can be highly dysfunctional because of underlying group pressure and other behavioral phenomena.

Group decision-making processes that limit the dysfunctionality inherent in group decision-making, such as the Delphi method and the nominal group technique, exist. Still, the benefits of such methods for decision-making in forecasting compared to simple averaging are unclear. The simple average of opinions seems to work well (Larrick and Soll, 2006), a finding that parallels the analogous result for averaging statistical forecasts (the Forecast Combination Puzzle, see Section 8.6). Furthermore, the average of experts in a field is not necessarily better than the average of amateurs (Surowiecki, 2004). In other words, decision-makers in forecasting may be well advised to skip the process of group meetings to find a consensus; instead, they should prepare their forecasts independently. These independent forecasts’ simple or weighted average can establish the final consensus. This averaging process filters out the random error inherent in human judgment. Therefore, the benefit of team meetings in forecasting may be more related to improved stakeholder management and accountability rather than improved forecast quality.

This principle of aggregating independent opinions to create better forecasts is powerful but counterintuitive. Experts should be better judges, and reaching team consensus should create better decisions. The Wisdom of Crowds argument contradicts some of these beliefs, since it implies that seeking consensus in a group may not lead to better outcomes and that a group of amateurs can beat experts. The latter effect has been recently demonstrated in the context of predictions in the intelligence community (Spiegel, 2014; Tetlock and Gardner, 2015). As part of the Good Judgment Project, random individuals from across the United States have been preparing probability judgments on global events. Their pooled predictions often beat the predictions of trained CIA analysts with access to confidential data. If amateurs can beat trained professionals in a context where such professionals have secret domain knowledge, the power of the Wisdom of Crowds becomes quite apparent. The implication is that for critical forecasts in an organization, having multiple forecasts prepared in parallel (and independently) and then simply taking the average of such forecasts may be a simple yet effective way of increasing judgmental forecasting accuracy.

Key takeaways

Human judgment can improve forecasts, especially if humans possess information that is hard to consider within a statistical forecasting method.
Humans have not evolved to deal well with judgment under uncertainty. Cognitive biases imply that human intervention will often make forecasts worse. That a statistical method is hard to understand does not mean that a human forecaster can improve the forecast.
Incentive structures may reward people for making forecasts worse. People will try to influence the forecast since they are interested in influencing the decisions made based on the forecast.
Measure whether and when human judgment improves forecasts. It may make sense to restrict judgmental adjustments to only those contexts where factual information of model incompleteness is present (e.g., the forecast does not factor in promotions, etc.).
When relying on human judgment in forecasting, get independent judgments from multiple forecasters and then average these opinions.

References

Biais, B., and Weber, M. (2009). Hindsight bias, risk perception, and investment performance. Management Science, 55(6), 1018–1029.

Brau, R., Aloysius, J., and Siemsen, E. (2023). Demand planning for the digital supply chain: How to integrate human judgment and predictive analytics. Journal of Operations Management.

Clemen, R. T. (1989). Combining forecasts: A review and annotated bibliography. International Journal of Forecasting, 5(4), 559–583.

Dietvorst, B. J., Simmons, J. P., and Massey, C. (2015). Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: General, 144(1), 114–126.

Fahimnia, B., Arvan, M., Tan, T., and Siemsen, E. (2022). A hidden anchor: The influence of service levels on demand forecasts. Journal of Operations Management.

Fildes, R., Goodwin, P., Lawrence, M., and Nikolopoulos, K. (2009). Effective forecasting and judgmental adjustments: An empirical evaluation and strategies for improvement in supply-chain planning. International Journal of Forecasting, 25(1), 3–23.

Fildes, R., Ma, S., and Kolassa, S. (2022). Retail forecasting: Research and practice. International Journal of Forecasting, 38(4), 1283–1318.

Gilliland, M. (2013). FVA: A reality check on forecasting practices. Foresight: The International Journal of Applied Forecasting, 29, 14–18.

Goodwin, P. (2000). Correct or combine? Mechanically integrating judgmental forecasts with statistical methods. International Journal of Forecasting, 16(2), 261–275.

Haran, U., Moore, D. A., and Morewedge, C. K. (2010). A simple remedy for overprecision in judgment. Judgment and Decision Making, 5(7), 467–476.

Harvey, N., Ewart, T., and West, R. (1997). Effects of data noise on statistical judgement. Thinking & Reasoning, 3(2), 111–132.

Kahneman, D. (2012). Thinking: Fast and slow. Penguin.

Kahneman, D., Lovallo, D., and Sibony, O. (2011). Before you make that big decision. Harvard Business Review, 89(6), 51–60.

Kremer, M., Moritz, B., and Siemsen, E. (2011). Demand forecasting behavior: System neglect and change detection. Management Science, 57(10), 1827–1843.

Larrick, R. P., and Soll, J. B. (2006). Intuitions about combining opinions: Misappreciation of the averaging principle. Management Science, 52(1), 111–127.

Lawrence, M., Goodwin, P., and Fildes, R. (2002). Influence of user participation on DSS use and decision accuracy. Omega, 30(5), 381–392.

Lawrence, M., and Makridakis, S. (1989). Factors affecting judgmental forecasts and confidence intervals. Organizational Behavior and Human Decision Processes, 43(2), 172–187.

Lawrence, M., O’Connor, M., and Edmundson, B. (2000). A field study of sales forecasting accuracy and processes. European Journal of Operational Research, 122, 151–160.

Mannes, A. E., and Moore, D. A. (2013). A behavioral demonstration of overconfidence in judgment. Psychological Science, 24(7), 1190–1197.

Mello, J. (2009). The impact of sales forecast game playing on supply chains. Foresight: The International Journal of Applied Forecasting, 13, 13–22.

Moritz, B., Siemsen, E., and Kremer, M. (2014). Judgmental forecasting: Cognitive reflection and decision speed. Production and Operations Management, 23(7), 1146–1160.

Önkal, D., Goodwin, P., Thomson, M., Gönül, S., and Pollock, A. (2009). The relative influence of advice from human experts and statistical methods on forecast adjustments. Journal of Behavioral Decision Making, 22(4), 390–409.

Robinson, L. A. (2006). Trust your gut: How the power of intuition can grow your business. Chicago, IL: Kaplan Publishing.

Seifert, M., Siemsen, E., Hadida, A. L., and Eisingerich, A. B. (2015). Effective judgmental forecasting in the context of fashion products. Journal of Operations Management, 36, 33–45.

Spiegel, A. (2014). So you think you’re smarter than a CIA agent.

Surowiecki, J. (2004). The wisdom of crowds. New York, NY: Anchor.

Tetlock, P. E., and Gardner, D. (2015). Superforecasting. Crown Publishers.