Some Simple Measures of Forecast Accuracy

So you've built a model, the predictive plots look nice, but you want to synthesize this information down to a number. This is where measures of forecast accuracy come in. In this post we will recall some simple measures of forecast accuracy, and then I'll explain why I don't like them, unless you intend to use them to compare more than one model (i.e. as a relative performance measure).

In the following descriptions, let the actuals be a_i and forecasts be f_i, where 1 \leq i \leq n.

Root Mean Squared Error (RMSE):

RMSE=\frac{1}{n} \sqrt{\sum_{i=1}^n (a_i-f_i)^2}

RMSE <- function(a,f){(1/length(a))*sqrt(sum((a-f)^2))}

Mean Forecast Error (MFE):

MFE=\frac{1}{n} \sum_{i=1}^n a_i-f_i

The ideal value for the MFE is 0. If MFE > 0, the model tends to under-forecast the actuals. If MFE < 0, the model tends to over-forecast the actuals.

MFE <- function(a,f){mean(a-f)}

Mean Absolute Error (MAE):

MAE=\frac{1}{n} \sum_{i=1}^n | a_i-f_i |

MAE <- function(a,f){mean(abs(a-f))}

Tracking Signal (TS):

TS=\frac{1}{MAE(a,f)} \sum_{i=1}^n a_i-f_i

The TS has a general rule-of-thumb: if  -4 < TS < 4 then the model is assumed to be producing accurate forecasts.

TS <- function(a,f){sum(a-f)/MAE(a,f)}

Mean Percentage Error (MPE):

MPE=\frac{1}{n} \sum_{i=1}^n \frac{a_i-f_i}{a_i}

MPE <- function(a,f){mean((a-f)/a)}

Mean Absolute Percentage Error (MAPE):

MAPE=\frac{1}{n} \sum_{i=1}^n | \frac{a_i-f_i}{a_i} |

MAPE <- function(a,f){mean(abs((a-f)/a))}

The problem with using many of the aforementioned measures when given a single model is that it can be difficult to specify a threshold for accuracy that makes intuitive sense. While these measures may be great for synthesizing estimates of a model into a single statistic, how truly 'accurate' are these estimates? An MAPE of 0.25 is obviously better than an MAPE of 0.55 (ceteris paribus), but can we specify an appropriate threshold for a class of general models? In theory, potentially, in practice, I'm not aware of any that I would confidently apply. Rule-of-thumbs can be defined (as in the case of the TS), but since many of these measures are not normalized, their interpretation would depend on the magnitude of the model's target variable. The root of the problem is that all of these error measures are unbounded (at least in one direction). If we know a value is to be bounded within a range, we may assess the extremeness of the value within that range. Such an analogous relationship does not exist with these forecast measures. The smaller the error, the better, but it is unclear how large a model's error can or should be before one should re-evaluate a model...and thus is the nature of statistics.

While it can be argued that there are ways to introduce a forecast criterion given a particular measure (and I would be interested in any papers that establish an appropriate criterion), it is wiser to avoid this in practice. Simple forecast measures are better left for making model-to-model comparisons of performance.