Posts by Tags

bayes

forecasting

Incremental complexity: A forecaster’s odyssey

23 minute read

Published:

1 Introduction

Welcome! Our adventures in time-series analysis continue, but this time with a focus on forecasting. Forecasting is an incredibly important tool for many researchers and analysts, but is often one that is easily abused, obfuscated, or just outright done poorly. In this article, we are going to explore forecasting from a ‘bottom-up’ approach to complexity. That is, we want to try simple models first, before jumping into something more complex, and, therefore, harder to interpret. This is something I call incremental complexity, and is an idea I introduced in this recent paper. Now, in some applications, you might not need to care about interpretability, such as maybe process engineering, where maximising accuracy is what matters. But for most applied settings in which I typically work, people need to know how conclusions were reached, or how numbers were generated. Using some off-the-shelf convoluted neural network just does not cut it. Besides, that stuff is boring anyway, we care about statistical uncertainty and distributions in this house!

Before we get started, we will load the R packages we need:

library(dplyr)
library(ggplot2)
library(lubridate)
library(tsibble)
library(feasts)
library(fable)

2 Downloading the dataset

We are going to analyse monthly laboratory confirmed influenza cases in Australia for no reason other than it seemed interesting. There is a bit of data wrangling to get it in the format we need, so stay with me here. We are just going to take the data up to and including 2017 to avoid the weirdness of COVID-19 (I will save that for a future post!) and the strange dip exhibited in 2018. Since data wrangling is not the focus of this article, feel free to just run this and move on:

library(httr)
library(readxl)
library(janitor)

url <- "https://www.health.gov.au/sites/default/files/2023-09/national-notifiable-diseases-surveillance-system-nndss-public-dataset-influenza-laboratory-confirmed.xlsx"

GET(url, write_disk(tf <- tempfile(fileext = ".xlsx")))
flu_2008_2015 <- read_excel(tf, sheet = 1, skip = 1)
flu_2016_2018 <- read_excel(tf, sheet = 2, skip = 1)

flu <- bind_rows(flu_2008_2015, flu_2016_2018) %>%
  clean_names() %>%
  mutate(week_ending_friday = as.Date(week_ending_friday, format = "%Y-%m-%d"),
         month = yearmonth(as.Date(week_ending_friday, format = "%Y-%m-%d"))) %>%
  reframe(cases = n(), .by = "month") %>%
  mutate(id = 1) %>%
  filter(month <= yearmonth(as.Date("2017-12-31", format = "%Y-%m-%d")))

Here is a quick visualisation of what we are dealing with:

flu %>%
  ggplot(aes(x = month, y = cases)) +
  geom_line(aes(group = 1)) +
  labs(x = "Month",
       y = "Lab confirmed influenza cases",
       colour = NULL) +
  theme_bw() +
  theme(legend.position = "bottom")

However, the temporal dynamics are quite difficult to observe on this scale. Let’s log-scale it and try again:

flu %>%
  ggplot(aes(x = month, y = cases)) +
  geom_line(aes(group = 1)) +
  labs(x = "Month",
       y = "log-scaled lab confirmed influenza cases",
       colour = NULL) +
  scale_y_log10() +
  theme_bw() +
  theme(legend.position = "bottom")

Beautiful! We can now make out a seasonal pattern and an upwards trend. But more on these things later! It's time to try some basic models.

3 Fitting forecast models

In R, forecasting should be done using the tidyverts – a collection of packages developed by the incredible Rob Hyndman and colleagues for handling forecasting and time-series data more generally using tidy data principles. Specifically, we are interested in the following packages:

  • tsibble – package for the tsibble data type, which is a time series version of the regular tibble data frame object
  • feasts – package for computing time-series features and useful associated quantities
  • fable – the actual forecasting package

To start with, we will convert our data frame to a tsibble:

tsbl <- as_tsibble(flu, key = id, index = month)
head(tsbl)
## # A tsibble: 6 x 3 [1M]
## # Key:       id [1]
##      month cases    id
##      <mth> <int> <dbl>
## 1 2008 Jan   121     1
## 2 2008 Feb   155     1
## 3 2008 Mar   149     1
## 4 2008 Apr   163     1
## 5 2008 May   270     1
## 6 2008 Jun   250     1

3.1 Some simple approaches

Okay, so where should we start? Well, if you open the highly-recommended and exceptional Forecasting: Principles and Practice (3rd Edition), you would find a short list of so-called ‘simple’ forecasting methods. These include:

  • Mean method (MEAN function in fable) – forecasts of all future values are equal to the average (i.e., mean) of the historical data. With historical data denoted by \(y_{1}, \dots , y_{T}\), forecasts can be written as:

\[ \hat{y}_{T+h|T} = \bar{y} = \frac{(y_{1} + \dots + y_{T})}{T} \] where \(\hat{y}_{T+h|T}\) basically says ‘the estimate of the value for \(y_{T+h}\) based on the data \(y_{1} + \dots + y_{T}\)’.

  • Naive method (NAIVE function in fable) – forecasts are the value of the last observation. Mathematically, this approach is expressed by:

\[ \hat{y}_{T+h|T} = y_{T} \]

  • Seasonal naive method (SNAIVE function in fable) – forecasts are equal to the last observed value from the same season. Mathematically, this is:

\[ \hat{y}_{T+h|T} = y_{T+h-m(k+1)} \] where \(m\) is the seasonal period and \(k\) is the number of complete years in the forecast period.

  • Drift method (DRIFT function in fable) – forecasts are similar to the naive method, but they are allowed to increase or decrease over time, where the amount of change over time (i.e., ‘drift’) is the average change observed in the historical data. Mathematically, this is:

\[ \hat{y}_{T+h|T} = y_{T} + \frac{h}{T-1} \sum_{t=2}^{T}(y_{t}-y_{t-1}) = y_{T}+h (\frac{y_{T}-y_{1}}{T-1}) \] In fpp3, Rob Hyndman cleverly describes this is being equivalent to drawing a straight line between the first and last observations, and extrapolating it forward.

3.1.1 Fitting simple models in R

As highlighted above, there are simple one-line model functions in fable that enable us to fit these models. But there is another reason why fable is so good – we can fit them all at the same time using the model function! This makes our code more compact and also makes it easier to compare models. Here is an example using our sample data, remembering that we want to log-scale cases1:

fit <- tsbl %>%
  model(
    mean = MEAN(log(cases)),
    naive = NAIVE(log(cases)),
    snaive = SNAIVE(log(cases) ~ lag("year")),
    drift = RW(log(cases) ~ drift()),
  )

fit
## # A mable: 1 x 5
## # Key:     id [1]
##      id    mean   naive   snaive         drift
##   <dbl> <model> <model>  <model>       <model>
## 1     1  <MEAN> <NAIVE> <SNAIVE> <RW w/ drift>

This returns a mable object (a ‘model fable’) where each row corresponds to a unique time series (in our case we only have one) and each column contains an object for each model that was fit. Very neat.

What can we do with this? Well, a lot!

We might want to produce some quick forecasts for an entire year for all models:

first_fc <- fit %>% 
  forecast(h = 12)

head(first_fc)
## # A fable: 6 x 5 [1M]
## # Key:     id, .model [1]
##      id .model    month          cases .mean
##   <dbl> <chr>     <mth>         <dist> <dbl>
## 1     1 mean   2018 Jan t(N(7.4, 2.3)) 3638.
## 2     1 mean   2018 Feb t(N(7.4, 2.3)) 3638.
## 3     1 mean   2018 Mar t(N(7.4, 2.3)) 3638.
## 4     1 mean   2018 Apr t(N(7.4, 2.3)) 3638.
## 5     1 mean   2018 May t(N(7.4, 2.3)) 3638.
## 6     1 mean   2018 Jun t(N(7.4, 2.3)) 3638.

The forecast function knows how to handle mable objects which makes our life easy. We now have a dataframe containing full forecast distributions for our forecast period. We can easily plot them:

first_fc %>%
  autoplot(tsbl) +
  scale_y_log10()

We can also disable the prediction intervals to just observe the means for visual clarity:

first_fc %>%
  autoplot(tsbl, level = NULL) +
  scale_y_log10()

This is great, and it appears that the seasonal naive model performs the best, but how can we quantify that assertion? Let’s take a look at some ways to evaluate performance.

3.1.2 Evaluating forecast accuracy

Typically in forecasting work we would start with analysis of residuals and other diagnostics. But demonstrating all of those for every model here would take a long time. I encourage readers to check out the relevant chapters of fpp3. Instead, we are going to focus on evaluating point forecast and forecast distribution accuracy.

3.1.2.1 Point forecast accuracy

Proper evaluation of accuracy instead of just looking at residuals is critical, however. This is because residuals are not a good indicator of how large actual forecast errors might be. Similar to other areas of statistics and machine learning, we need to split our data into training and test data to evaluate genuine forecasts over unseen time points. Typical guidance is that the test set should be at least as large as the longest forecast time horizon.

Let’s filter our data to contain everything up to and including 2016, and then use 2017 as our test set2:

train <- tsbl %>%
  filter(lubridate::year(month) < 2017)

test <- tsbl %>%
  filter(lubridate::year(month) >= 2017)

Now we can re-fit the models using only the training data and generate forecasts for one year:

fit_train <- train %>%
  model(
    mean = MEAN(log(cases)),
    naive = NAIVE(log(cases)),
    snaive = SNAIVE(log(cases) ~ lag("year")),
    drift = RW(log(cases) ~ drift())
  )

test_fc <- fit_train %>%
  forecast(h = 12)

Here is a plot of our forecasts against the actuals:

test_fc %>%
  autoplot(tsbl, level = NULL) +
  labs(x = "Month",
       y = "log-scaled lab confirmed influenza cases") +
  scale_y_log10() +
  guides(colour = guide_legend(title = "Forecast"))

Clearly, these are pretty poor forecasts (aside from maybe the seasonal naive model) which suggests that a more sophisticated approach is warranted (i.e., one that captures the temporal dynamics our eyes can see in the data, at a minimum). Indeed, this is in fact the point of this entire article – we need to start with basic methods that are completely transparent and interpretable, and build the case for additional complexity. In our circumstances here, we have a pretty strong case for moving to more complex methods. We can further this case by computing a range of error metrics for the forecasts3:

accuracy(test_fc, test)
## # A tibble: 4 × 11
##   .model    id .type     ME   RMSE    MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <dbl> <chr>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 drift      1 Test   8486. 31413. 18776. -164.  201.    NaN   NaN 0.516
## 2 mean       1 Test  17745. 35926. 18164.   27.9  47.4   NaN   NaN 0.560
## 3 naive      1 Test  11597. 32528. 17889.  -99.3 143.    NaN   NaN 0.532
## 4 snaive     1 Test   7982. 18229.  9741.  -17.2  53.9   NaN   NaN 0.542

3.1.2.2 Forecast distribution accuracy

So far, we have only evaluated point forecast accuracy. But we would be terrible forecasters if that is all we cared about! Why interpret a single value, when you can make use of an entire distribution which encapsulates your uncertainty about the future! A useful metric for this scenario is the Continuous Ranked Probability Score (CRPS), which is essentially a a weighted absolute error computed from the entire forecast distribution, where the weighting explicitly accounts for the probabilities. Here is how we can use the CRPS to evaluate forecast distributions for all time points in the prediction set:

test_fc %>%
  accuracy(test, list(crps = CRPS))
## # A tibble: 4 × 4
##   .model    id .type   crps
##   <chr>  <dbl> <chr>  <dbl>
## 1 drift      1 Test  14487.
## 2 mean       1 Test  17314.
## 3 naive      1 Test  14348.
## 4 snaive     1 Test   8222.

For the CRPS, lower values indicate better distributional forecasts. We see that the snaive method is producing better forecasts than the other simple approaches. However, as noted earlier, the other methods are quite poor, which suggests we may need to move to something that better captures the relevant temporal properties.

3.2 More complex methods

Okay, so we have built up a case for moving to more complex methods. Remember, complexity often comes at the cost of interpretability. However, since we are trying to be truly incremental about all of this, we are going to progress to methods that are indeed more complex than a mean or naive forecast, but are still incredibly interpretable and well-studied. Best of all, since we are using fable, we can easily compare the performance of all of these approaches. Let’s fit the same forecasts as before but with the addition of an autoregressive integrated moving average (ARIMA) model, an exponential smoothing (ETS) model, a Seasonal and Trend decomposition using Loess (STL) model, and a time-series linear model which just measures trend and annual seasonality (where seasonality is just modelled as a simple Fourier term):

fit_train_complex <- train %>%
  model(
    mean = MEAN(log(cases)),
    naive = NAIVE(log(cases)),
    snaive = SNAIVE(log(cases) ~ lag("year")),
    drift = RW(log(cases) ~ drift()),
    stepwise = ARIMA(log(cases)),
    search = ARIMA(log(cases), stepwise = FALSE, approx = FALSE),
    ets = ETS(log(cases)),
    stlf = decomposition_model(
      STL(log(cases) ~ trend(window = 21), robust = TRUE), 
      NAIVE(season_adjust)
      ), 
    tslm = TSLM(log(cases) ~ trend() +  season())
  )

test_fc_complex <- fit_train_complex %>%
  forecast(h = 12)

Let’s unpack the new models a bit further. There are two ARIMA specifications. They both use fable’s amazing auto-ARIMA functionality (meaning we don’t have to manually specify the \(p\), \(d\), and \(q\) parameters), but the stepwise model uses the default stepwise automation procedure, while search does a much larger model space search to find the optimal model4. Similarly, we are letting fable do the automated search for the optimal parameters of the ETS model too.

But why are ARIMA, ETS, and STL models ‘interpretable’ despite the substantial jump in complexity? Well, an ARIMA model is a generative process. That is, we can simulate time series from an ARIMA process because it describes the temporal mechanics: an autoregressive lag order (\(p\); i.e., each time point is correlated to the \(p-th\) point behind it), a degree to difference the data (\(d\); removes seasonality), and a moving average lag order (\(q\); i.e., we use past forecast errors rather than the values as in the case of autoregression). This leads us to a formulation that looks incredibly like a nice extension of linear regression, where \(y_{t}^{'}\) is the differenced time series:

\[ y_{t}^{'} = c + \phi_{1} y_{t-1}^{'} + \dots + \phi_{p} y_{t-p}^{'} + \theta_{1} \epsilon_{t-1} + \dots + \theta_{q} \epsilon_{t-q} + \epsilon_{t} \]

The \(\phi\) components describe the autoregressive part of the model, while the \(\theta\) components describe the moving average part of the model. \(\epsilon\) is the error term.

With regards to ETS, put simply, an ETS model employs weighted averages of past observations, with the weights decaying exponentially as the observations get older. This approach can then be extended to account for both trends and seasonality. Mathematically, we can represent the weighted average form of an ETS model as:

\[ \hat{y}_{T+1|T} = \sum_{j=0}^{T-1} \alpha(1 - \alpha)^{j} y_{T-j} + (1 - \alpha)^{T} \mathcal{l}_{0} \]

Finally, STL is the most intuitive of the three new additions. Its usefulness is best seen graphically, since it decomposes the time series into trend, seasonality, and the remainder of the signal. This leads to an additive component interpretation of time series. Here is the STL decomposition for the train set, where the upwards trend and annual seasonality is made very clear:

train %>%
  model(stl = STL(log(cases))) %>%
  components() %>%
  autoplot()

We can now directly compare the forecast distributions all of the models fit thus far:

test_fc_complex %>%
  accuracy(test, list(crps = CRPS)) %>%
  arrange(crps)
## # A tibble: 9 × 4
##   .model      id .type   crps
##   <chr>    <dbl> <chr>  <dbl>
## 1 search       1 Test   5160.
## 2 stepwise     1 Test   5226.
## 3 tslm         1 Test   6749.
## 4 stlf         1 Test   8134.
## 5 snaive       1 Test   8222.
## 6 ets          1 Test  11185.
## 7 naive        1 Test  14348.
## 8 drift        1 Test  14487.
## 9 mean         1 Test  17314.

We see that the search ARIMA model produces the best forecast distributions, and four of the five newer complex models are the top performers. Let’s visualise the search ARIMA’s forecasts independently:

test_fc_complex %>%
  filter(.model == "search") %>%
  autoplot(tsbl) +
  labs(x = "Month",
       y = "log-scaled lab confirmed influenza cases") +
  scale_y_log10() +
  guides(colour = guide_legend(title = "Forecast"))

The model appears to capture 2017’s data reasonably well in terms of both the prediction interval coverage and the mean forecast against the actual. However, there are other methods for evaluating forecast accuracy. One of these is a Winkler score. If the \(100(1- \alpha) \%\) prediction interval at time \(t\) is given by \([\mathcal{l}_{\alpha,t}, \mu_{\alpha,t}]\), then the Winkler score is defined as the length of the interval plus a penalty if the observation is outside the interval. As Rob Hyndman succinctly states in fpp3:

For observations that fall within the interval, the Winkler score is simply the length of the interval. Thus, low scores are associated with narrow intervals. However, if the observation falls outside the interval, the penalty applies, with the penalty proportional to how far the observation is outside the interval.

We can compare Winkler scores for forecasts easily in fable, but the decision of the width of the interval is important (we use \(95\%\) here for demonstration purposes):

test_fc_complex %>%
  accuracy(tsbl, measures = interval_accuracy_measures,
    level = 95) %>%
  arrange(winkler)
## # A tibble: 9 × 4
##   .model      id .type winkler
##   <chr>    <dbl> <chr>   <dbl>
## 1 tslm         1 Test   43905.
## 2 stepwise     1 Test   50738.
## 3 search       1 Test   52982.
## 4 snaive       1 Test   75465.
## 5 naive        1 Test  230301.
## 6 stlf         1 Test  231972.
## 7 drift        1 Test  389376.
## 8 mean         1 Test  466614.
## 9 ets          1 Test  590724.

Winkler scores suggest that the time-series linear model (i.e., a model consisting of a linear trend component added to a seasonal periodicity component) is the best performing method. The two ARIMA specifications are the second and third-best options, followed by the seasonal naive method. Clearly, this finding contrasts the result obtained from the CRPS method. Statistical decision making is hard!

The above accuracy comparisons were computed on the prediction intervals. But we can also just compare pointwise forecasts across a wide range of metrics:

test_fc_complex %>%
  accuracy(test) %>%
  arrange(RMSE)
## # A tibble: 9 × 11
##   .model      id .type      ME   RMSE    MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>    <dbl> <chr>   <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 stlf         1 Test    -906.  7719.  5183.  -75.6  83.0   NaN   NaN -0.206
## 2 search       1 Test    4338. 11623.  5371.  -13.5  31.0   NaN   NaN  0.220
## 3 stepwise     1 Test    4562. 12019.  5913.  -18.4  37.6   NaN   NaN  0.302
## 4 tslm         1 Test    6821. 17958.  8050.  -12.3  32.9   NaN   NaN  0.378
## 5 snaive       1 Test    7982. 18229.  9741.  -17.2  53.9   NaN   NaN  0.542
## 6 ets          1 Test  -15075. 20602. 15075. -160.  160.    NaN   NaN  0.660
## 7 drift        1 Test    8486. 31413. 18776. -164.  201.    NaN   NaN  0.516
## 8 naive        1 Test   11597. 32528. 17889.  -99.3 143.    NaN   NaN  0.532
## 9 mean         1 Test   17745. 35926. 18164.   27.9  47.4   NaN   NaN  0.560

Now we see that the decomposition model is regarded as the best. However, as the table shows, interpretation of different error metrics can lead to different decision regarding the best performing model. Regardless, the overarching finding here is that more complex models appear to perform better than simple models – even from a pointwise perspective.

3.3 Forecast combinations and method comparisons

So far, we have directly compared multiple individual forecasting methods. But what about if we combined them? Could a composite model outperform any individual forecast method? As it turns out, this 1989 review showed that combining multiple forecasts often leads to increased forecast accuracy, and that for most applications, averaging the different forecast methods can provide substantial performance improvement. Now, this is again a step up in complexity, but if the performance gains truly are that meaningful, the increase may be worth it. Further, if the underlying methods used in the composite forecast are themselves interpretable, then producing an average of them should not harm overall interpretability. Let’s try a combination of the search ARIMA and the time-series linear model and see how well it does:

fit_train_composite <- train %>%
  model(
    mean = MEAN(log(cases)),
    naive = NAIVE(log(cases)),
    snaive = SNAIVE(log(cases) ~ lag("year")),
    drift = RW(log(cases) ~ drift()),
    stepwise = ARIMA(log(cases)),
    search = ARIMA(log(cases), stepwise = FALSE, approx = FALSE),
    ets = ETS(log(cases)),
    stlf = decomposition_model(
      STL(log(cases) ~ trend(window = 21), robust = TRUE), 
      NAIVE(season_adjust)
      ), 
    tslm = TSLM(log(cases) ~ trend() +  season())
  ) %>%
    mutate(arima_tslm = (search + tslm) / 2)

Now, you may note that I haven’t just piped straight into the forecast function like we did in the previous sections. This is because fable is not currently able to combine different distribution types (e.g., a normal distribution for component model and a Student’s \(t\) distribution for another component model) yet, and so we need to compute an empirical forecast distribution by simulating many futures from the model. We do this for all the models to ensure fair comparisons:

futures <- fit_train_composite %>%
  generate(h = nrow(test), new_data = test, times = 1000) %>%
  as_tibble() %>%
  group_by(month, .model) %>%
  summarise(dist = distributional::dist_sample(list(.sim))) %>%
  ungroup() %>%
  as_fable(index = month, key = .model,
           distribution = dist, response = "cases")

We can then compare the empirical forecast distributions using the same methods as before:

futures %>%
  accuracy(tsbl, measures = interval_accuracy_measures,
    level = 95) %>%
  arrange(winkler)
## # A tibble: 10 × 3
##    .model     .type winkler
##    <chr>      <chr>   <dbl>
##  1 arima_tslm Test   35734.
##  2 tslm       Test   37146.
##  3 search     Test   50872.
##  4 stepwise   Test   53267.
##  5 snaive     Test   63864.
##  6 naive      Test  205846.
##  7 stlf       Test  265227.
##  8 drift      Test  446284.
##  9 mean       Test  463255.
## 10 ets        Test  931365.

We see that the combination forecast of the search ARIMA and the time-series linear model performs the best, followed by the component models of the TSLM and both ARIMA specifications. The highest-ranked ‘simple’ forecasting method – seasonal naive – is ranked fifth best. This provides us with further evidence that the increase in complexity is warranted.

Finally, with a bit more code, we can produce a visualisation of the distribution of futures for the top composite model:

fit_train_composite %>%
  generate(h = nrow(test), new_data = test, times = 1000) %>%
  as_tibble() %>%
  group_by(month, .model) %>%
  summarise(dist = distributional::dist_sample(list(.sim))) %>%
  ungroup() %>%
  filter(.model == "arima_tslm") %>%
  mutate(id = 1) %>%
  dplyr::select(-c(.model)) %>%
  as_fable(index = month, key = id,
           distribution = dist, response = "cases") %>%
  autoplot(tsbl) +
  labs(x = "Month",
       y = "log-scaled lab confirmed influenza cases") +
  scale_y_log10() +
  guides(colour = guide_legend(title = "Forecast"))

Looks decent!

4 Final thoughts

Thank you for making it to the end! If you have any questions, please feel free to reach out to me via email. I intend to write a follow-up piece to this post exploring the magnitude of the impact of COVID-19 on influenza rates, so stay tuned for that!


  1. Doing this within fable::model is great because it will automatically back-transform forecasts and prediction intervals to the scale of interest automatically for us.↩︎

  2. 2018 saw a strange decrease in average cases, so we will park this complexity for now.↩︎

  3. These become far more meaningful once we directly compare the simple and complex methods. This is coming up shortly!↩︎

  4. Note that this substantially increases computation time.↩︎

</div> </div>

gaussianprocesses

marginaleffects

Ordinance: Make ordinal regression intuitive with marginal effects

26 minute read

Published:

Introduction

Howdy! Today we are going to be talking all things ordinal variables. Ordinal data is very common in applied settings. It may come in surveys where items are measured on a Likert scale (i.e., 1 = Strongly disagree, 2 = Disagree, 3 = Neutral, 4 = Agree, 5 = Strongly agree). It may come in workforce data, where someone’s position in a company is captured. It may come in sports data, where we have finish positions in a race. Irrespective of the source, it would be disingenuous to model such data using a method that ignores its ordered categorical nature (such as fitting a linear regression model). Instead, we should use a tool developed specifically for the job.

Enter ordinal regression.

Regression for ordinal data

Ordinal regression is an extension of linear models much like a generalised linear model (GLM) is. The architecture is basically the same: Linear predictors are constructed from the covariates which are then used to model probability of membership to each level of the ordinal outcome variable. There are several ways to go about doing this: (i) we can model the levels cumulatively; (ii) adjacently; or (iii) sequentially. Regardless, we often see the trusty old logit used as the link function since it converts values into the \([0,1]\) domain of probabilities.

For our purposes, we are going to adopt the cumulative approach since it is the most common. A cumulative model assumes our outcome variable is some unknown latent variable where each level is represented by breakpoints1. Put another way, a cumulative ordinal model uses cumulative probabilities up to a threshold, thereby making the whole range of ordinal categories binary at each threshold. In practice, such an approach is typically implemented using proportional odds logistic regression. But more on that later.

Mathematically, we can express the cumulative logit model as2:

\[ \log \left( \frac{P(Y \leq j)}{P(Y > j)} \right) = \log \left(\frac{P(Y \leq j)}{1 -P(Y \leq j)} \right) = \log \left( \frac{\pi_{1} + \dots + \pi_{j}}{\pi_{j+1} + \dots + \pi_{J}} \right) \]

where our response variable \(Y\) is represented by its levels \(Y = 1, 2, \dots J\) and the associated probabilities are \(\pi_{1} + \dots + \pi_{J}\). The above equation which describes the log-odds of the event that \(Y \leq j\) and measures how likely the response is to be in category \(j\) or below versus in a category higher than \(j\).

For those familiar with GLMs, you already know how difficult to interpret log-odds are. The difficulty usually remains even after transforming to something like odds. And all of that just relates to just one (!!!) of the who-knows-how-many predictor variables you have! So what can you do? Clearly, you need to adopt a more appropriate analytical strategy.

The case for marginal effects

As stated above, interpreting regression models is hard! Especially when we have likelihoods that are non-normal that require link functions and/or the models have numerous covariates. In case it wasn’t already obvious: A coefficient in complex regression models does not tell you what you think it does.

Why is that? The coefficient is just a parameter. In the simple case of linear regression with one predictor, the coefficient tells us what we need to know. But for all other purposes, its meaningfulness is greatly diminished because it does not account for the values of the other covariates in the model. In order to interpret the effect of a given predictor on the outcome variable, we need to isolate it.

I’m going to borrow an excellent analogy from this excellent post by Andrew Heiss. Think of your regression model as an audio mixing board. Every covariate/predictor is represented on the board. Continuous predictors (such as income) are represented by sliders and categorical predictors (such as sex) are represented as switches. If we want to determine the effect of one of these switches or sliders on our outcome, we need to fix the others (i.e., hold them constant), move only the one of interest, and observe how the outcome value changes. Mathematically, this is called a partial derivative, and is represented in notation as \(\frac{\partial y}{\partial x}\). This leads us to the formal definition of a marginal effect: A marginal effect is a partial derivative from a regression equation.

Note that for categorical predictors (i.e., switches), this effect is actually called a ‘conditional effect’ or a ‘group contrast’ since a non-continuous variable does not have a slope/rate of change (i.e., a derivative). Instead, we compare the difference in means for each level of the switch (i.e., each ‘condition’).

Now, all of that should already (hopefully) sound appealing, but it gets even better! Marginal effects and the associated predictions are usually3 calculated and reported on the outcome scale of interest. This is a big deal so let me explain further. We do not need to make inferences in terms of log-odds or some other confusing measurement scale. In the case of ordinal regression4, we can instead talk about something far more intuitive and elegant: probabilities.

That’s right. The marginal effects of an ordinal regression model are expressed in terms of the probability of belonging to each level of the outcome variable. Beautiful stuff.

The dataset

Enough math and explanations! What data are we looking at? For this tutorial, we are going to explore the 2022 Australian Election Study (the ‘AES’). The AES examines important issues such as changing support for political parties, the role of the party leaders, and the importance of political issues like the economy and the environment. We are going to examine one item in particular:

Generally speaking, how much interest do you usually have in what’s going on in politics?

This item is measured on a four-point scale:

  1. A good deal
  2. Some
  3. Not much
  4. None

Clearly, we have an ordinal context. Since we want to explore differences in political interest5, we need some variables to explore differences over. We are going to use the following as predictors:

  • Age
  • Gender
  • Sexual orientation
  • Years of tertiary study
  • Employment status
  • Managerial status at work

Let’s load the R packages we need to get started:

library(dplyr)
library(tidyr)
library(ggplot2)
library(MASS)
library(marginaleffects)

Data preprocessing

Getting the dataset in a nice and clean format for what we need takes a little bit of work so bear with me here. The below code just reads in the file (which I have conveniently stored on my GitHub), filters out erroneous values, and recodes factor variables from their integer codes to qualitative labels to make our lives a little easier when visualising results. Since data cleaning isn’t the focus of this tutorial, feel free to just run this code and not dig in too deep.

aes22 <- read.csv("https://raw.githubusercontent.com/hendersontrent/hendersontrent.github.io/master/files/aes22_unrestricted_v2.csv") %>%
  dplyr::select(c(ID, A1, G2, G4, G5_D, H1, H1b, H2)) %>%
  rename(years_tertiary_study = G2,
         work = G4,
         managerial = G5_D,
         gender = H1,
         sexual_orientation = H1b,
         dob = H2) %>%
  filter(A1 %in% c(1, 2, 3, 4)) %>%
  filter(years_tertiary_study != 999) %>%
  filter(work %in% c(1, 2, 3, 4, 5, 6, 7)) %>%
  filter(managerial %in% c(1, 2, 3, 4, 5)) %>%
  filter(gender %in% c(1, 2, 3)) %>%
  filter(sexual_orientation %in% c(1, 2, 3, 4, 5)) %>%
  filter(dob != 999) %>%
  filter(dob != -97) %>%
  mutate(age = as.integer(2022 - dob)) %>% # Age at 2022 survey
  dplyr::select(-c(dob)) %>%
  mutate(gender = case_when(
          gender == "1" ~ "Male",
          gender == "2" ~ "Female",
          gender == "3" ~ "Other")) %>%
  mutate(sexual_orientation = case_when(
    sexual_orientation == "1" ~ "Heterosexual or Straight",
    sexual_orientation == "2" ~ "Gay or Lesbian",
    sexual_orientation == "3" ~ "Bisexual",
    sexual_orientation == "4" ~ "Other",
    sexual_orientation == "5" ~ "Prefer not to say")) %>%
  mutate(work = case_when(
    work == "1" ~ "Working full-time for pay",
    work == "2" ~ "Working part-time for pay",
    work == "3" ~ "Unemployed - looking for full-time work",
    work == "4" ~ "Unemployed - looking for part-time work",
    work == "5" ~ "Retired from paid work",
    work == "6" ~ "A full-time school or university student",
    work == "7" ~ "Family or caring responsibilities")) %>%
  mutate(managerial = case_when(
    managerial == "1" ~ "Upper managerial",
    managerial == "2" ~ "Middle managerial",
    managerial == "3" ~ "Lower managerial",
    managerial == "4" ~ "Supervisory",
    managerial == "5" ~ "Non-supervisory")) %>%
  mutate(A1 = case_when(
          A1 == "4" ~ "None",
          A1 == "3" ~ "Not much",
          A1 == "2" ~ "Some",
          A1 == "1" ~ "A good deal")) %>%
  mutate(A1 = factor(A1, levels = c("None", "Not much", "Some", "A good deal"), ordered = TRUE),
         managerial = factor(managerial, 
                             levels = c("Non-supervisory", "Supervisory", 
                                        "Lower managerial", "Middle managerial", 
                                        "Upper managerial"), 
                             ordered = TRUE),
         gender = as.factor(gender),
         sexual_orientation = as.factor(sexual_orientation),
         work = as.factor(work),
         years_tertiary_study = as.integer(years_tertiary_study))

Statistical modelling

Typically, we would do extensive exploratory data analysis prior to conducting any statistical modelling. But this is a modelling and marginal effects tutorial and not a research paper, so we will skip it here6. Please do not skip this stage in your own work!

The easiest function for fitting a proportional odds logistic regression (i.e., ordinal regression) in R is the polr function in the MASS package7. The interface is the same as other R regression functions, such as lm. Note that by default, polr uses the logit link function, so we do not need to explicitly specify it. Here is the model:

mod <- polr(A1 ~ years_tertiary_study + work + managerial +
              gender + sexual_orientation + age,
            data = aes22)

We are also going to skip over model diagnostics and checks and all of that here so we can get to the crux of the tutorial content. We can print a quick summary of mod to obtain coefficient values, intercepts for each cumulative threshold, and other information:

summary(mod)
## Call:
## polr(formula = A1 ~ years_tertiary_study + work + managerial + 
##     gender + sexual_orientation + age, data = aes22)
## 
## Coefficients:
##                                                Value Std. Error t value
## years_tertiary_study                         0.09898   0.016491  6.0022
## workFamily or caring responsibilities       -0.88470   0.422507 -2.0939
## workRetired from paid work                  -0.44590   0.394436 -1.1305
## workUnemployed - looking for full-time work -0.40188   0.573699 -0.7005
## workUnemployed - looking for part-time work -0.69857   0.566880 -1.2323
## workWorking full-time for pay               -0.64700   0.362709 -1.7838
## workWorking part-time for pay               -0.78457   0.371394 -2.1125
## managerial.L                                 0.51695   0.100317  5.1531
## managerial.Q                                 0.06986   0.105543  0.6619
## managerial.C                                -0.02053   0.105220 -0.1951
## managerial^4                                -0.34185   0.115103 -2.9699
## genderMale                                   0.40115   0.095092  4.2185
## genderOther                                  2.65070   1.329392  1.9939
## sexual_orientationGay or Lesbian            -0.51216   0.423282 -1.2100
## sexual_orientationHeterosexual or Straight  -1.13469   0.313564 -3.6187
## sexual_orientationOther                     -1.75812   0.554949 -3.1681
## age                                          0.03101   0.003983  7.7852
## 
## Intercepts:
##                  Value   Std. Error t value
## None|Not much    -3.8626  0.4946    -7.8092
## Not much|Some    -1.3382  0.4639    -2.8846
## Some|A good deal  0.8813  0.4635     1.9014
## 
## Residual Deviance: 3769.865 
## AIC: 3809.865

However, as noted above, these are quite prohibitive to interpret. So, we are going to turn to our friend marginal effects for emotional and computational support.

Marginal effects

Marginal effects can be calculated in R a few different ways and often with very different methods/quantities behind the scenes. We are going to use the excellent marginaleffects package for our purposes. Seriously, please visit that link and look through the incredibly detailed documentation. It’s basically my web browser home page at this point.

The marginaleffects package affords an honestly astonishing amount of functionality. As a starting point, we are going to focus on three types of quantities:

  1. Predictions
  2. Slopes
  3. Comparisons

Predictions

In the context of marginaleffects, an adjusted prediction is the outcome predicted by a fitted model on a specified scale for a given combination of values of the predictor variables, such as their observed values, their means, or factor levels. That’s a lot of words to say that we can calculate predicted values for any given subset of the data used in the model. In the most extreme case, we can calculate predictions for a given predictor which are averaged over the values of all other predictors. This procedure is extremely useful for answering questions in practice, since we often want to account for the values of other variables while exploring the effects of the focal variable, rather than just holding the others constant.

In marginaleffects, this is easy through the predictions function (and the associated plot_predictions visualisation function). Let’s examine age. We’ll be sure to specify that we want our output (i.e., \(y\)-axis) to be probabilities:

plot_predictions(mod, condition = "age", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1)

Beautiful! We see that as age increases, the predicted probability of rating political interest as ‘A good deal’ increases almost linearly. Compare that to ‘Not much’, where as age increases, the predicted probability of scarce political interest decreases. These results are far, far more interpretable and immediately practically meaningful than trying to dredge understanding out of log-odds coefficients.

Let’s look at a few more examples. Here is years of tertiary study:

plot_predictions(mod, condition = "years_tertiary_study", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1)

We see similar results to age which is somewhat unsurprising.

How about categorical predictors? Here is gender:

plot_predictions(mod, condition = "gender", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1)

As expected, we do not get nice smooth curves like we did for numerical variables but instead get point estimates for the factor level along with the associated confidence interval. From this plot, it appears that respondents with a gender outside the binary system may be more likely to express ‘A good deal’ of political interest than males or females. Again, how intuitive is this to interpret!

Let’s do one more. Here is employment managerial status:

plot_predictions(mod, condition = "managerial", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1) +
  theme(axis.text.x = element_text(angle = 90))

We see that respondents in middle and upper managerial positions may be more likely to express ‘A good deal’ of political interest than others. Interestingly, probabilities of rating ‘Some’, ‘Not much’, and ‘None’ are quite similar between managerial positions.

A more specific example

The above plots are both interesting and useful, but we need to go deeper. In other words, this is where the fun begins8.

Consider the following hypothetical AES respondent: A man who studied for 12 years at university, currently works full-time for pay, is a middle manager at work, is bisexual, and is 30 years old. We can define this individual in a data frame:

respondent <- data.frame(years_tertiary_study = 12, work = "Working full-time for pay",
                         managerial = "Middle managerial", sexual_orientation = "Bisexual",
                         age = 30, gender = "Male")

We can easily calculate the expected predicted probability for each outcome level (political interest) for this person:

predictions(mod, newdata = respondent)
## 
##        Group Estimate Std. Error     z Pr(>|z|)     S    2.5 %  97.5 %
##  None         0.00234   0.000933  2.51  0.01222   6.4 0.000509 0.00417
##  Not much     0.02609   0.009061  2.88  0.00399   8.0 0.008328 0.04385
##  Some         0.18370   0.048728  3.77  < 0.001  12.6 0.088195 0.27920
##  A good deal  0.78787   0.058370 13.50  < 0.001 135.5 0.673472 0.90228
##  years_tertiary_study                      work        managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

We see that the probability this person rates their political interest as ‘A good deal’ is \(79\%\), ‘Some’ is \(18\%\), ‘Not much’ is \(3\%\), and ‘None’ is almost \(0\%\). How cool is that!

Slopes

Slopes. Partial derivatives. Marginal effects. The rate of change at which the probability of our outcome changes with respect to a given variable.

Calculation of these quantities is made easy in marginaleffects through the slopes() function. Continuing with our hypothetical respondent, we can calculate the slopes for a given variable. Here is age:

slopes(mod, variables = "age", newdata = respondent)
## 
##        Group Term   Estimate Std. Error     z Pr(>|z|)    S     2.5 %    97.5 %
##  None         age -0.0000723  0.0000302 -2.39  0.01680  5.9 -0.000132 -0.000013
##  Not much     age -0.0007839  0.0002842 -2.76  0.00581  7.4 -0.001341 -0.000227
##  Some         age -0.0043257  0.0009707 -4.46  < 0.001 16.9 -0.006228 -0.002423
##  A good deal  age  0.0051820  0.0012653  4.10  < 0.001 14.5  0.002702  0.007662
##  years_tertiary_study                      work        managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

And here is years of tertiary study:

slopes(mod, variables = "years_tertiary_study", newdata = respondent)
## 
##        Group                 Term  Estimate Std. Error     z Pr(>|z|)    S
##  None        years_tertiary_study -0.000231  0.0000873 -2.64  0.00820  6.9
##  Not much    years_tertiary_study -0.002502  0.0007962 -3.14  0.00167  9.2
##  Some        years_tertiary_study -0.013808  0.0027294 -5.06  < 0.001 21.2
##  A good deal years_tertiary_study  0.016541  0.0035167  4.70  < 0.001 18.6
##      2.5 %     97.5 % years_tertiary_study                      work
##  -0.000402 -0.0000597                   12 Working full-time for pay
##  -0.004063 -0.0009418                   12 Working full-time for pay
##  -0.019158 -0.0084586                   12 Working full-time for pay
##   0.009649  0.0234341                   12 Working full-time for pay
##         managerial sexual_orientation age gender
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
## 
## Columns: rowid, group, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

Alternatively, we can calculate average marginal effects using avg_slopes(), which are similar to the slopes() call but the same computation is repeated for every row of the original dataset, and then the average slope for each level of the outcome variable is reported:

avg_slopes(mod)
## 
##        Group                 Term
##  None        years_tertiary_study
##  None        work                
##  None        work                
##  None        work                
##  None        work                
##                                                                            Contrast
##  dY/dX                                                                             
##  Family or caring responsibilities - A full-time school or university student      
##  Retired from paid work - A full-time school or university student                 
##  Unemployed - looking for full-time work - A full-time school or university student
##  Unemployed - looking for part-time work - A full-time school or university student
##  Estimate Std. Error      z Pr(>|z|)    S     2.5 %    97.5 %
##  -0.00151   0.000365 -4.126   <0.001 14.7 -0.002222 -0.000791
##   0.01152   0.005575  2.066   0.0388  4.7  0.000591  0.022446
##   0.00461   0.003619  1.273   0.2030  2.3 -0.002486  0.011700
##   0.00406   0.006257  0.649   0.5165  1.0 -0.008205  0.016323
##   0.00823   0.007811  1.054   0.2918  1.8 -0.007075  0.023545
## --- 58 rows omitted. See ?print.marginaleffects --- 
##  A good deal gender              
##  A good deal sexual_orientation  
##  A good deal sexual_orientation  
##  A good deal sexual_orientation  
##  A good deal age                 
##                                                                            Contrast
##  Other - Female                                                                    
##  Gay or Lesbian - Bisexual                                                         
##  Heterosexual or Straight - Bisexual                                               
##  Other - Bisexual                                                                  
##  dY/dX                                                                             
##  Estimate Std. Error      z Pr(>|z|)    S     2.5 %    97.5 %
##   0.47911   0.134333  3.567   <0.001 11.4  0.215821  0.742398
##  -0.10433   0.085494 -1.220   0.2224  2.2 -0.271892  0.063240
##  -0.23807   0.060119 -3.960   <0.001 13.7 -0.355904 -0.120243
##  -0.36413   0.104987 -3.468   <0.001 10.9 -0.569901 -0.158358
##   0.00661   0.000817  8.094   <0.001 50.6  0.005012  0.008215
## Columns: group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high

Together, these functions enable us to infer a great many things about how the rate of change of the probability of outcome category membership varies as a function of a given predictor variable. Useful stuff!

Comparisons

So far, we have computed adjusted predictions and slopes. What else can we do? A whole lot, but for now we’re just going to explore comparisons. We are going to continue using our hypothetical AES respondent from above.

We can ask lots of interesting questions about this person from a comparative lens, such as ‘what happens to the predicted probability of their political interest if we increased their age by 1?’ Or ‘what happens to the predicted probability of their political interest if they were instead female?’ The comparisons function provides answers to these questions.

Let’s calculate the comparisons first:

comps <- comparisons(mod, newdata = respondent)
head(comps)
## 
##  Group                 Term
##   None years_tertiary_study
##   None work                
##   None work                
##   None work                
##   None work                
##   None work                
##                                                                            Contrast
##  +1                                                                                
##  Family or caring responsibilities - A full-time school or university student      
##  Retired from paid work - A full-time school or university student                 
##  Unemployed - looking for full-time work - A full-time school or university student
##  Unemployed - looking for part-time work - A full-time school or university student
##  Working full-time for pay - A full-time school or university student              
##   Estimate Std. Error      z Pr(>|z|)   S     2.5 %     97.5 %
##  -0.000231  0.0000874 -2.644   0.0082 6.9 -0.000402 -0.0000597
##   0.001738  0.0010737  1.619   0.1055 3.2 -0.000366  0.0038425
##   0.000687  0.0006028  1.140   0.2542 2.0 -0.000494  0.0018689
##   0.000605  0.0009575  0.632   0.5274 0.9 -0.001271  0.0024818
##   0.001236  0.0012819  0.964   0.3350 1.6 -0.001277  0.0037485
##   0.001113  0.0006367  1.747   0.0806 3.6 -0.000135  0.0023604
##  years_tertiary_study                      work        managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

That’s a lot of information! Let’s narrow it down to make it a bit easier to interrogate. Here is age:

comps %>%
  filter(term == "age")
## 
##        Group Term Contrast   Estimate Std. Error     z Pr(>|z|)    S     2.5 %
##  None         age       +1 -0.0000723  0.0000303 -2.39  0.01680  5.9 -0.000132
##  Not much     age       +1 -0.0007841  0.0002842 -2.76  0.00581  7.4 -0.001341
##  Some         age       +1 -0.0043260  0.0009706 -4.46  < 0.001 16.9 -0.006228
##  A good deal  age       +1  0.0051824  0.0012654  4.10  < 0.001 14.5  0.002702
##     97.5 % years_tertiary_study                      work        managerial
##  -0.000013                   12 Working full-time for pay Middle managerial
##  -0.000227                   12 Working full-time for pay Middle managerial
##  -0.002424                   12 Working full-time for pay Middle managerial
##   0.007662                   12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

We see that an increase of 1 year of age (as evidenced by Contrast being +1) is associated with decreases in the probability of rating political interest as ‘None’, ‘Not much’, and ‘Some’, but an increase in the probability of rating ‘A good deal’ by about \(0.5\) percentage points. Neat!

What about gender? Let’s focus on the effect associated with a change from Male to Female:

comps %>%
  filter(term == "gender") %>%
  filter(grepl("Male", contrast))
## 
##        Group   Term      Contrast Estimate Std. Error     z Pr(>|z|)    S
##  None        gender Male - Female -0.00115   0.000511 -2.25   0.0245  5.4
##  Not much    gender Male - Female -0.01229   0.004782 -2.57   0.0102  6.6
##  Some        gender Male - Female -0.06123   0.016425 -3.73   <0.001 12.3
##  A good deal gender Male - Female  0.07467   0.021208  3.52   <0.001 11.2
##     2.5 %    97.5 % years_tertiary_study                      work
##  -0.00215 -0.000148                   12 Working full-time for pay
##  -0.02166 -0.002919                   12 Working full-time for pay
##  -0.09342 -0.029034                   12 Working full-time for pay
##   0.03310  0.116233                   12 Working full-time for pay
##         managerial sexual_orientation age gender
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
## 
## Columns: rowid, group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

Changing gender to Female and keeping everything else the same is associated with decreases in the probability of rating political interest as ‘None’, ‘Not much’, and ‘Some’, but an increase in the probability of rating ‘A good deal’ (by about \(7\) percentage points). Hopefully now it’s evident just how intuitive complex models like ordinal regression become when interpreted using appropriate tools!

Final thoughts

Thanks for making it to the end! Hopefully this post has inspired you to leave the world of coefficient interpretation in favour of more desirable shores. As an aside, this dataset is incredibly interesting, so maybe I will revisit it sometime. There is a panel data version where respondents across time were tracked, so maybe I will do something with that!

As always, please feel free to reach out to me via email if you have any questions or just want to talk about stats.


  1. Agresti, A. (2002) Categorical Data. Second edition. Wiley.↩︎

  2. https://online.stat.psu.edu/stat504/lesson/8/8.4↩︎

  3. But not always. Sometimes you may have use for values on the link scale or otherwise.↩︎

  4. The beauty of outcome scales extends to other models, too. For example, marginal effects calculated on a Poisson model will give you results in terms of counts.↩︎

  5. This is my high-level summary term for what the survey item measures.↩︎

  6. I did do some for my own sanity, however.↩︎

  7. Although I think everyone should go Bayes and use Stan or brms.↩︎

  8. Okay, okay. I’ll stop!↩︎

regression

Ordinance: Make ordinal regression intuitive with marginal effects

26 minute read

Published:

Introduction

Howdy! Today we are going to be talking all things ordinal variables. Ordinal data is very common in applied settings. It may come in surveys where items are measured on a Likert scale (i.e., 1 = Strongly disagree, 2 = Disagree, 3 = Neutral, 4 = Agree, 5 = Strongly agree). It may come in workforce data, where someone’s position in a company is captured. It may come in sports data, where we have finish positions in a race. Irrespective of the source, it would be disingenuous to model such data using a method that ignores its ordered categorical nature (such as fitting a linear regression model). Instead, we should use a tool developed specifically for the job.

Enter ordinal regression.

Regression for ordinal data

Ordinal regression is an extension of linear models much like a generalised linear model (GLM) is. The architecture is basically the same: Linear predictors are constructed from the covariates which are then used to model probability of membership to each level of the ordinal outcome variable. There are several ways to go about doing this: (i) we can model the levels cumulatively; (ii) adjacently; or (iii) sequentially. Regardless, we often see the trusty old logit used as the link function since it converts values into the \([0,1]\) domain of probabilities.

For our purposes, we are going to adopt the cumulative approach since it is the most common. A cumulative model assumes our outcome variable is some unknown latent variable where each level is represented by breakpoints1. Put another way, a cumulative ordinal model uses cumulative probabilities up to a threshold, thereby making the whole range of ordinal categories binary at each threshold. In practice, such an approach is typically implemented using proportional odds logistic regression. But more on that later.

Mathematically, we can express the cumulative logit model as2:

\[ \log \left( \frac{P(Y \leq j)}{P(Y > j)} \right) = \log \left(\frac{P(Y \leq j)}{1 -P(Y \leq j)} \right) = \log \left( \frac{\pi_{1} + \dots + \pi_{j}}{\pi_{j+1} + \dots + \pi_{J}} \right) \]

where our response variable \(Y\) is represented by its levels \(Y = 1, 2, \dots J\) and the associated probabilities are \(\pi_{1} + \dots + \pi_{J}\). The above equation which describes the log-odds of the event that \(Y \leq j\) and measures how likely the response is to be in category \(j\) or below versus in a category higher than \(j\).

For those familiar with GLMs, you already know how difficult to interpret log-odds are. The difficulty usually remains even after transforming to something like odds. And all of that just relates to just one (!!!) of the who-knows-how-many predictor variables you have! So what can you do? Clearly, you need to adopt a more appropriate analytical strategy.

The case for marginal effects

As stated above, interpreting regression models is hard! Especially when we have likelihoods that are non-normal that require link functions and/or the models have numerous covariates. In case it wasn’t already obvious: A coefficient in complex regression models does not tell you what you think it does.

Why is that? The coefficient is just a parameter. In the simple case of linear regression with one predictor, the coefficient tells us what we need to know. But for all other purposes, its meaningfulness is greatly diminished because it does not account for the values of the other covariates in the model. In order to interpret the effect of a given predictor on the outcome variable, we need to isolate it.

I’m going to borrow an excellent analogy from this excellent post by Andrew Heiss. Think of your regression model as an audio mixing board. Every covariate/predictor is represented on the board. Continuous predictors (such as income) are represented by sliders and categorical predictors (such as sex) are represented as switches. If we want to determine the effect of one of these switches or sliders on our outcome, we need to fix the others (i.e., hold them constant), move only the one of interest, and observe how the outcome value changes. Mathematically, this is called a partial derivative, and is represented in notation as \(\frac{\partial y}{\partial x}\). This leads us to the formal definition of a marginal effect: A marginal effect is a partial derivative from a regression equation.

Note that for categorical predictors (i.e., switches), this effect is actually called a ‘conditional effect’ or a ‘group contrast’ since a non-continuous variable does not have a slope/rate of change (i.e., a derivative). Instead, we compare the difference in means for each level of the switch (i.e., each ‘condition’).

Now, all of that should already (hopefully) sound appealing, but it gets even better! Marginal effects and the associated predictions are usually3 calculated and reported on the outcome scale of interest. This is a big deal so let me explain further. We do not need to make inferences in terms of log-odds or some other confusing measurement scale. In the case of ordinal regression4, we can instead talk about something far more intuitive and elegant: probabilities.

That’s right. The marginal effects of an ordinal regression model are expressed in terms of the probability of belonging to each level of the outcome variable. Beautiful stuff.

The dataset

Enough math and explanations! What data are we looking at? For this tutorial, we are going to explore the 2022 Australian Election Study (the ‘AES’). The AES examines important issues such as changing support for political parties, the role of the party leaders, and the importance of political issues like the economy and the environment. We are going to examine one item in particular:

Generally speaking, how much interest do you usually have in what’s going on in politics?

This item is measured on a four-point scale:

  1. A good deal
  2. Some
  3. Not much
  4. None

Clearly, we have an ordinal context. Since we want to explore differences in political interest5, we need some variables to explore differences over. We are going to use the following as predictors:

  • Age
  • Gender
  • Sexual orientation
  • Years of tertiary study
  • Employment status
  • Managerial status at work

Let’s load the R packages we need to get started:

library(dplyr)
library(tidyr)
library(ggplot2)
library(MASS)
library(marginaleffects)

Data preprocessing

Getting the dataset in a nice and clean format for what we need takes a little bit of work so bear with me here. The below code just reads in the file (which I have conveniently stored on my GitHub), filters out erroneous values, and recodes factor variables from their integer codes to qualitative labels to make our lives a little easier when visualising results. Since data cleaning isn’t the focus of this tutorial, feel free to just run this code and not dig in too deep.

aes22 <- read.csv("https://raw.githubusercontent.com/hendersontrent/hendersontrent.github.io/master/files/aes22_unrestricted_v2.csv") %>%
  dplyr::select(c(ID, A1, G2, G4, G5_D, H1, H1b, H2)) %>%
  rename(years_tertiary_study = G2,
         work = G4,
         managerial = G5_D,
         gender = H1,
         sexual_orientation = H1b,
         dob = H2) %>%
  filter(A1 %in% c(1, 2, 3, 4)) %>%
  filter(years_tertiary_study != 999) %>%
  filter(work %in% c(1, 2, 3, 4, 5, 6, 7)) %>%
  filter(managerial %in% c(1, 2, 3, 4, 5)) %>%
  filter(gender %in% c(1, 2, 3)) %>%
  filter(sexual_orientation %in% c(1, 2, 3, 4, 5)) %>%
  filter(dob != 999) %>%
  filter(dob != -97) %>%
  mutate(age = as.integer(2022 - dob)) %>% # Age at 2022 survey
  dplyr::select(-c(dob)) %>%
  mutate(gender = case_when(
          gender == "1" ~ "Male",
          gender == "2" ~ "Female",
          gender == "3" ~ "Other")) %>%
  mutate(sexual_orientation = case_when(
    sexual_orientation == "1" ~ "Heterosexual or Straight",
    sexual_orientation == "2" ~ "Gay or Lesbian",
    sexual_orientation == "3" ~ "Bisexual",
    sexual_orientation == "4" ~ "Other",
    sexual_orientation == "5" ~ "Prefer not to say")) %>%
  mutate(work = case_when(
    work == "1" ~ "Working full-time for pay",
    work == "2" ~ "Working part-time for pay",
    work == "3" ~ "Unemployed - looking for full-time work",
    work == "4" ~ "Unemployed - looking for part-time work",
    work == "5" ~ "Retired from paid work",
    work == "6" ~ "A full-time school or university student",
    work == "7" ~ "Family or caring responsibilities")) %>%
  mutate(managerial = case_when(
    managerial == "1" ~ "Upper managerial",
    managerial == "2" ~ "Middle managerial",
    managerial == "3" ~ "Lower managerial",
    managerial == "4" ~ "Supervisory",
    managerial == "5" ~ "Non-supervisory")) %>%
  mutate(A1 = case_when(
          A1 == "4" ~ "None",
          A1 == "3" ~ "Not much",
          A1 == "2" ~ "Some",
          A1 == "1" ~ "A good deal")) %>%
  mutate(A1 = factor(A1, levels = c("None", "Not much", "Some", "A good deal"), ordered = TRUE),
         managerial = factor(managerial, 
                             levels = c("Non-supervisory", "Supervisory", 
                                        "Lower managerial", "Middle managerial", 
                                        "Upper managerial"), 
                             ordered = TRUE),
         gender = as.factor(gender),
         sexual_orientation = as.factor(sexual_orientation),
         work = as.factor(work),
         years_tertiary_study = as.integer(years_tertiary_study))

Statistical modelling

Typically, we would do extensive exploratory data analysis prior to conducting any statistical modelling. But this is a modelling and marginal effects tutorial and not a research paper, so we will skip it here6. Please do not skip this stage in your own work!

The easiest function for fitting a proportional odds logistic regression (i.e., ordinal regression) in R is the polr function in the MASS package7. The interface is the same as other R regression functions, such as lm. Note that by default, polr uses the logit link function, so we do not need to explicitly specify it. Here is the model:

mod <- polr(A1 ~ years_tertiary_study + work + managerial +
              gender + sexual_orientation + age,
            data = aes22)

We are also going to skip over model diagnostics and checks and all of that here so we can get to the crux of the tutorial content. We can print a quick summary of mod to obtain coefficient values, intercepts for each cumulative threshold, and other information:

summary(mod)
## Call:
## polr(formula = A1 ~ years_tertiary_study + work + managerial + 
##     gender + sexual_orientation + age, data = aes22)
## 
## Coefficients:
##                                                Value Std. Error t value
## years_tertiary_study                         0.09898   0.016491  6.0022
## workFamily or caring responsibilities       -0.88470   0.422507 -2.0939
## workRetired from paid work                  -0.44590   0.394436 -1.1305
## workUnemployed - looking for full-time work -0.40188   0.573699 -0.7005
## workUnemployed - looking for part-time work -0.69857   0.566880 -1.2323
## workWorking full-time for pay               -0.64700   0.362709 -1.7838
## workWorking part-time for pay               -0.78457   0.371394 -2.1125
## managerial.L                                 0.51695   0.100317  5.1531
## managerial.Q                                 0.06986   0.105543  0.6619
## managerial.C                                -0.02053   0.105220 -0.1951
## managerial^4                                -0.34185   0.115103 -2.9699
## genderMale                                   0.40115   0.095092  4.2185
## genderOther                                  2.65070   1.329392  1.9939
## sexual_orientationGay or Lesbian            -0.51216   0.423282 -1.2100
## sexual_orientationHeterosexual or Straight  -1.13469   0.313564 -3.6187
## sexual_orientationOther                     -1.75812   0.554949 -3.1681
## age                                          0.03101   0.003983  7.7852
## 
## Intercepts:
##                  Value   Std. Error t value
## None|Not much    -3.8626  0.4946    -7.8092
## Not much|Some    -1.3382  0.4639    -2.8846
## Some|A good deal  0.8813  0.4635     1.9014
## 
## Residual Deviance: 3769.865 
## AIC: 3809.865

However, as noted above, these are quite prohibitive to interpret. So, we are going to turn to our friend marginal effects for emotional and computational support.

Marginal effects

Marginal effects can be calculated in R a few different ways and often with very different methods/quantities behind the scenes. We are going to use the excellent marginaleffects package for our purposes. Seriously, please visit that link and look through the incredibly detailed documentation. It’s basically my web browser home page at this point.

The marginaleffects package affords an honestly astonishing amount of functionality. As a starting point, we are going to focus on three types of quantities:

  1. Predictions
  2. Slopes
  3. Comparisons

Predictions

In the context of marginaleffects, an adjusted prediction is the outcome predicted by a fitted model on a specified scale for a given combination of values of the predictor variables, such as their observed values, their means, or factor levels. That’s a lot of words to say that we can calculate predicted values for any given subset of the data used in the model. In the most extreme case, we can calculate predictions for a given predictor which are averaged over the values of all other predictors. This procedure is extremely useful for answering questions in practice, since we often want to account for the values of other variables while exploring the effects of the focal variable, rather than just holding the others constant.

In marginaleffects, this is easy through the predictions function (and the associated plot_predictions visualisation function). Let’s examine age. We’ll be sure to specify that we want our output (i.e., \(y\)-axis) to be probabilities:

plot_predictions(mod, condition = "age", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1)

Beautiful! We see that as age increases, the predicted probability of rating political interest as ‘A good deal’ increases almost linearly. Compare that to ‘Not much’, where as age increases, the predicted probability of scarce political interest decreases. These results are far, far more interpretable and immediately practically meaningful than trying to dredge understanding out of log-odds coefficients.

Let’s look at a few more examples. Here is years of tertiary study:

plot_predictions(mod, condition = "years_tertiary_study", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1)

We see similar results to age which is somewhat unsurprising.

How about categorical predictors? Here is gender:

plot_predictions(mod, condition = "gender", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1)

As expected, we do not get nice smooth curves like we did for numerical variables but instead get point estimates for the factor level along with the associated confidence interval. From this plot, it appears that respondents with a gender outside the binary system may be more likely to express ‘A good deal’ of political interest than males or females. Again, how intuitive is this to interpret!

Let’s do one more. Here is employment managerial status:

plot_predictions(mod, condition = "managerial", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1) +
  theme(axis.text.x = element_text(angle = 90))

We see that respondents in middle and upper managerial positions may be more likely to express ‘A good deal’ of political interest than others. Interestingly, probabilities of rating ‘Some’, ‘Not much’, and ‘None’ are quite similar between managerial positions.

A more specific example

The above plots are both interesting and useful, but we need to go deeper. In other words, this is where the fun begins8.

Consider the following hypothetical AES respondent: A man who studied for 12 years at university, currently works full-time for pay, is a middle manager at work, is bisexual, and is 30 years old. We can define this individual in a data frame:

respondent <- data.frame(years_tertiary_study = 12, work = "Working full-time for pay",
                         managerial = "Middle managerial", sexual_orientation = "Bisexual",
                         age = 30, gender = "Male")

We can easily calculate the expected predicted probability for each outcome level (political interest) for this person:

predictions(mod, newdata = respondent)
## 
##        Group Estimate Std. Error     z Pr(>|z|)     S    2.5 %  97.5 %
##  None         0.00234   0.000933  2.51  0.01222   6.4 0.000509 0.00417
##  Not much     0.02609   0.009061  2.88  0.00399   8.0 0.008328 0.04385
##  Some         0.18370   0.048728  3.77  < 0.001  12.6 0.088195 0.27920
##  A good deal  0.78787   0.058370 13.50  < 0.001 135.5 0.673472 0.90228
##  years_tertiary_study                      work        managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

We see that the probability this person rates their political interest as ‘A good deal’ is \(79\%\), ‘Some’ is \(18\%\), ‘Not much’ is \(3\%\), and ‘None’ is almost \(0\%\). How cool is that!

Slopes

Slopes. Partial derivatives. Marginal effects. The rate of change at which the probability of our outcome changes with respect to a given variable.

Calculation of these quantities is made easy in marginaleffects through the slopes() function. Continuing with our hypothetical respondent, we can calculate the slopes for a given variable. Here is age:

slopes(mod, variables = "age", newdata = respondent)
## 
##        Group Term   Estimate Std. Error     z Pr(>|z|)    S     2.5 %    97.5 %
##  None         age -0.0000723  0.0000302 -2.39  0.01680  5.9 -0.000132 -0.000013
##  Not much     age -0.0007839  0.0002842 -2.76  0.00581  7.4 -0.001341 -0.000227
##  Some         age -0.0043257  0.0009707 -4.46  < 0.001 16.9 -0.006228 -0.002423
##  A good deal  age  0.0051820  0.0012653  4.10  < 0.001 14.5  0.002702  0.007662
##  years_tertiary_study                      work        managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

And here is years of tertiary study:

slopes(mod, variables = "years_tertiary_study", newdata = respondent)
## 
##        Group                 Term  Estimate Std. Error     z Pr(>|z|)    S
##  None        years_tertiary_study -0.000231  0.0000873 -2.64  0.00820  6.9
##  Not much    years_tertiary_study -0.002502  0.0007962 -3.14  0.00167  9.2
##  Some        years_tertiary_study -0.013808  0.0027294 -5.06  < 0.001 21.2
##  A good deal years_tertiary_study  0.016541  0.0035167  4.70  < 0.001 18.6
##      2.5 %     97.5 % years_tertiary_study                      work
##  -0.000402 -0.0000597                   12 Working full-time for pay
##  -0.004063 -0.0009418                   12 Working full-time for pay
##  -0.019158 -0.0084586                   12 Working full-time for pay
##   0.009649  0.0234341                   12 Working full-time for pay
##         managerial sexual_orientation age gender
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
## 
## Columns: rowid, group, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

Alternatively, we can calculate average marginal effects using avg_slopes(), which are similar to the slopes() call but the same computation is repeated for every row of the original dataset, and then the average slope for each level of the outcome variable is reported:

avg_slopes(mod)
## 
##        Group                 Term
##  None        years_tertiary_study
##  None        work                
##  None        work                
##  None        work                
##  None        work                
##                                                                            Contrast
##  dY/dX                                                                             
##  Family or caring responsibilities - A full-time school or university student      
##  Retired from paid work - A full-time school or university student                 
##  Unemployed - looking for full-time work - A full-time school or university student
##  Unemployed - looking for part-time work - A full-time school or university student
##  Estimate Std. Error      z Pr(>|z|)    S     2.5 %    97.5 %
##  -0.00151   0.000365 -4.126   <0.001 14.7 -0.002222 -0.000791
##   0.01152   0.005575  2.066   0.0388  4.7  0.000591  0.022446
##   0.00461   0.003619  1.273   0.2030  2.3 -0.002486  0.011700
##   0.00406   0.006257  0.649   0.5165  1.0 -0.008205  0.016323
##   0.00823   0.007811  1.054   0.2918  1.8 -0.007075  0.023545
## --- 58 rows omitted. See ?print.marginaleffects --- 
##  A good deal gender              
##  A good deal sexual_orientation  
##  A good deal sexual_orientation  
##  A good deal sexual_orientation  
##  A good deal age                 
##                                                                            Contrast
##  Other - Female                                                                    
##  Gay or Lesbian - Bisexual                                                         
##  Heterosexual or Straight - Bisexual                                               
##  Other - Bisexual                                                                  
##  dY/dX                                                                             
##  Estimate Std. Error      z Pr(>|z|)    S     2.5 %    97.5 %
##   0.47911   0.134333  3.567   <0.001 11.4  0.215821  0.742398
##  -0.10433   0.085494 -1.220   0.2224  2.2 -0.271892  0.063240
##  -0.23807   0.060119 -3.960   <0.001 13.7 -0.355904 -0.120243
##  -0.36413   0.104987 -3.468   <0.001 10.9 -0.569901 -0.158358
##   0.00661   0.000817  8.094   <0.001 50.6  0.005012  0.008215
## Columns: group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high

Together, these functions enable us to infer a great many things about how the rate of change of the probability of outcome category membership varies as a function of a given predictor variable. Useful stuff!

Comparisons

So far, we have computed adjusted predictions and slopes. What else can we do? A whole lot, but for now we’re just going to explore comparisons. We are going to continue using our hypothetical AES respondent from above.

We can ask lots of interesting questions about this person from a comparative lens, such as ‘what happens to the predicted probability of their political interest if we increased their age by 1?’ Or ‘what happens to the predicted probability of their political interest if they were instead female?’ The comparisons function provides answers to these questions.

Let’s calculate the comparisons first:

comps <- comparisons(mod, newdata = respondent)
head(comps)
## 
##  Group                 Term
##   None years_tertiary_study
##   None work                
##   None work                
##   None work                
##   None work                
##   None work                
##                                                                            Contrast
##  +1                                                                                
##  Family or caring responsibilities - A full-time school or university student      
##  Retired from paid work - A full-time school or university student                 
##  Unemployed - looking for full-time work - A full-time school or university student
##  Unemployed - looking for part-time work - A full-time school or university student
##  Working full-time for pay - A full-time school or university student              
##   Estimate Std. Error      z Pr(>|z|)   S     2.5 %     97.5 %
##  -0.000231  0.0000874 -2.644   0.0082 6.9 -0.000402 -0.0000597
##   0.001738  0.0010737  1.619   0.1055 3.2 -0.000366  0.0038425
##   0.000687  0.0006028  1.140   0.2542 2.0 -0.000494  0.0018689
##   0.000605  0.0009575  0.632   0.5274 0.9 -0.001271  0.0024818
##   0.001236  0.0012819  0.964   0.3350 1.6 -0.001277  0.0037485
##   0.001113  0.0006367  1.747   0.0806 3.6 -0.000135  0.0023604
##  years_tertiary_study                      work        managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

That’s a lot of information! Let’s narrow it down to make it a bit easier to interrogate. Here is age:

comps %>%
  filter(term == "age")
## 
##        Group Term Contrast   Estimate Std. Error     z Pr(>|z|)    S     2.5 %
##  None         age       +1 -0.0000723  0.0000303 -2.39  0.01680  5.9 -0.000132
##  Not much     age       +1 -0.0007841  0.0002842 -2.76  0.00581  7.4 -0.001341
##  Some         age       +1 -0.0043260  0.0009706 -4.46  < 0.001 16.9 -0.006228
##  A good deal  age       +1  0.0051824  0.0012654  4.10  < 0.001 14.5  0.002702
##     97.5 % years_tertiary_study                      work        managerial
##  -0.000013                   12 Working full-time for pay Middle managerial
##  -0.000227                   12 Working full-time for pay Middle managerial
##  -0.002424                   12 Working full-time for pay Middle managerial
##   0.007662                   12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

We see that an increase of 1 year of age (as evidenced by Contrast being +1) is associated with decreases in the probability of rating political interest as ‘None’, ‘Not much’, and ‘Some’, but an increase in the probability of rating ‘A good deal’ by about \(0.5\) percentage points. Neat!

What about gender? Let’s focus on the effect associated with a change from Male to Female:

comps %>%
  filter(term == "gender") %>%
  filter(grepl("Male", contrast))
## 
##        Group   Term      Contrast Estimate Std. Error     z Pr(>|z|)    S
##  None        gender Male - Female -0.00115   0.000511 -2.25   0.0245  5.4
##  Not much    gender Male - Female -0.01229   0.004782 -2.57   0.0102  6.6
##  Some        gender Male - Female -0.06123   0.016425 -3.73   <0.001 12.3
##  A good deal gender Male - Female  0.07467   0.021208  3.52   <0.001 11.2
##     2.5 %    97.5 % years_tertiary_study                      work
##  -0.00215 -0.000148                   12 Working full-time for pay
##  -0.02166 -0.002919                   12 Working full-time for pay
##  -0.09342 -0.029034                   12 Working full-time for pay
##   0.03310  0.116233                   12 Working full-time for pay
##         managerial sexual_orientation age gender
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
## 
## Columns: rowid, group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

Changing gender to Female and keeping everything else the same is associated with decreases in the probability of rating political interest as ‘None’, ‘Not much’, and ‘Some’, but an increase in the probability of rating ‘A good deal’ (by about \(7\) percentage points). Hopefully now it’s evident just how intuitive complex models like ordinal regression become when interpreted using appropriate tools!

Final thoughts

Thanks for making it to the end! Hopefully this post has inspired you to leave the world of coefficient interpretation in favour of more desirable shores. As an aside, this dataset is incredibly interesting, so maybe I will revisit it sometime. There is a panel data version where respondents across time were tracked, so maybe I will do something with that!

As always, please feel free to reach out to me via email if you have any questions or just want to talk about stats.


  1. Agresti, A. (2002) Categorical Data. Second edition. Wiley.↩︎

  2. https://online.stat.psu.edu/stat504/lesson/8/8.4↩︎

  3. But not always. Sometimes you may have use for values on the link scale or otherwise.↩︎

  4. The beauty of outcome scales extends to other models, too. For example, marginal effects calculated on a Poisson model will give you results in terms of counts.↩︎

  5. This is my high-level summary term for what the survey item measures.↩︎

  6. I did do some for my own sanity, however.↩︎

  7. Although I think everyone should go Bayes and use Stan or brms.↩︎

  8. Okay, okay. I’ll stop!↩︎

statistics

Incremental complexity: A forecaster’s odyssey

23 minute read

Published:

1 Introduction

Welcome! Our adventures in time-series analysis continue, but this time with a focus on forecasting. Forecasting is an incredibly important tool for many researchers and analysts, but is often one that is easily abused, obfuscated, or just outright done poorly. In this article, we are going to explore forecasting from a ‘bottom-up’ approach to complexity. That is, we want to try simple models first, before jumping into something more complex, and, therefore, harder to interpret. This is something I call incremental complexity, and is an idea I introduced in this recent paper. Now, in some applications, you might not need to care about interpretability, such as maybe process engineering, where maximising accuracy is what matters. But for most applied settings in which I typically work, people need to know how conclusions were reached, or how numbers were generated. Using some off-the-shelf convoluted neural network just does not cut it. Besides, that stuff is boring anyway, we care about statistical uncertainty and distributions in this house!

Before we get started, we will load the R packages we need:

library(dplyr)
library(ggplot2)
library(lubridate)
library(tsibble)
library(feasts)
library(fable)

2 Downloading the dataset

We are going to analyse monthly laboratory confirmed influenza cases in Australia for no reason other than it seemed interesting. There is a bit of data wrangling to get it in the format we need, so stay with me here. We are just going to take the data up to and including 2017 to avoid the weirdness of COVID-19 (I will save that for a future post!) and the strange dip exhibited in 2018. Since data wrangling is not the focus of this article, feel free to just run this and move on:

library(httr)
library(readxl)
library(janitor)

url <- "https://www.health.gov.au/sites/default/files/2023-09/national-notifiable-diseases-surveillance-system-nndss-public-dataset-influenza-laboratory-confirmed.xlsx"

GET(url, write_disk(tf <- tempfile(fileext = ".xlsx")))
flu_2008_2015 <- read_excel(tf, sheet = 1, skip = 1)
flu_2016_2018 <- read_excel(tf, sheet = 2, skip = 1)

flu <- bind_rows(flu_2008_2015, flu_2016_2018) %>%
  clean_names() %>%
  mutate(week_ending_friday = as.Date(week_ending_friday, format = "%Y-%m-%d"),
         month = yearmonth(as.Date(week_ending_friday, format = "%Y-%m-%d"))) %>%
  reframe(cases = n(), .by = "month") %>%
  mutate(id = 1) %>%
  filter(month <= yearmonth(as.Date("2017-12-31", format = "%Y-%m-%d")))

Here is a quick visualisation of what we are dealing with:

flu %>%
  ggplot(aes(x = month, y = cases)) +
  geom_line(aes(group = 1)) +
  labs(x = "Month",
       y = "Lab confirmed influenza cases",
       colour = NULL) +
  theme_bw() +
  theme(legend.position = "bottom")

However, the temporal dynamics are quite difficult to observe on this scale. Let’s log-scale it and try again:

flu %>%
  ggplot(aes(x = month, y = cases)) +
  geom_line(aes(group = 1)) +
  labs(x = "Month",
       y = "log-scaled lab confirmed influenza cases",
       colour = NULL) +
  scale_y_log10() +
  theme_bw() +
  theme(legend.position = "bottom")

Beautiful! We can now make out a seasonal pattern and an upwards trend. But more on these things later! It's time to try some basic models.

3 Fitting forecast models

In R, forecasting should be done using the tidyverts – a collection of packages developed by the incredible Rob Hyndman and colleagues for handling forecasting and time-series data more generally using tidy data principles. Specifically, we are interested in the following packages:

  • tsibble – package for the tsibble data type, which is a time series version of the regular tibble data frame object
  • feasts – package for computing time-series features and useful associated quantities
  • fable – the actual forecasting package

To start with, we will convert our data frame to a tsibble:

tsbl <- as_tsibble(flu, key = id, index = month)
head(tsbl)
## # A tsibble: 6 x 3 [1M]
## # Key:       id [1]
##      month cases    id
##      <mth> <int> <dbl>
## 1 2008 Jan   121     1
## 2 2008 Feb   155     1
## 3 2008 Mar   149     1
## 4 2008 Apr   163     1
## 5 2008 May   270     1
## 6 2008 Jun   250     1

3.1 Some simple approaches

Okay, so where should we start? Well, if you open the highly-recommended and exceptional Forecasting: Principles and Practice (3rd Edition), you would find a short list of so-called ‘simple’ forecasting methods. These include:

  • Mean method (MEAN function in fable) – forecasts of all future values are equal to the average (i.e., mean) of the historical data. With historical data denoted by \(y_{1}, \dots , y_{T}\), forecasts can be written as:

\[ \hat{y}_{T+h|T} = \bar{y} = \frac{(y_{1} + \dots + y_{T})}{T} \] where \(\hat{y}_{T+h|T}\) basically says ‘the estimate of the value for \(y_{T+h}\) based on the data \(y_{1} + \dots + y_{T}\)’.

  • Naive method (NAIVE function in fable) – forecasts are the value of the last observation. Mathematically, this approach is expressed by:

\[ \hat{y}_{T+h|T} = y_{T} \]

  • Seasonal naive method (SNAIVE function in fable) – forecasts are equal to the last observed value from the same season. Mathematically, this is:

\[ \hat{y}_{T+h|T} = y_{T+h-m(k+1)} \] where \(m\) is the seasonal period and \(k\) is the number of complete years in the forecast period.

  • Drift method (DRIFT function in fable) – forecasts are similar to the naive method, but they are allowed to increase or decrease over time, where the amount of change over time (i.e., ‘drift’) is the average change observed in the historical data. Mathematically, this is:

\[ \hat{y}_{T+h|T} = y_{T} + \frac{h}{T-1} \sum_{t=2}^{T}(y_{t}-y_{t-1}) = y_{T}+h (\frac{y_{T}-y_{1}}{T-1}) \] In fpp3, Rob Hyndman cleverly describes this is being equivalent to drawing a straight line between the first and last observations, and extrapolating it forward.

3.1.1 Fitting simple models in R

As highlighted above, there are simple one-line model functions in fable that enable us to fit these models. But there is another reason why fable is so good – we can fit them all at the same time using the model function! This makes our code more compact and also makes it easier to compare models. Here is an example using our sample data, remembering that we want to log-scale cases1:

fit <- tsbl %>%
  model(
    mean = MEAN(log(cases)),
    naive = NAIVE(log(cases)),
    snaive = SNAIVE(log(cases) ~ lag("year")),
    drift = RW(log(cases) ~ drift()),
  )

fit
## # A mable: 1 x 5
## # Key:     id [1]
##      id    mean   naive   snaive         drift
##   <dbl> <model> <model>  <model>       <model>
## 1     1  <MEAN> <NAIVE> <SNAIVE> <RW w/ drift>

This returns a mable object (a ‘model fable’) where each row corresponds to a unique time series (in our case we only have one) and each column contains an object for each model that was fit. Very neat.

What can we do with this? Well, a lot!

We might want to produce some quick forecasts for an entire year for all models:

first_fc <- fit %>% 
  forecast(h = 12)

head(first_fc)
## # A fable: 6 x 5 [1M]
## # Key:     id, .model [1]
##      id .model    month          cases .mean
##   <dbl> <chr>     <mth>         <dist> <dbl>
## 1     1 mean   2018 Jan t(N(7.4, 2.3)) 3638.
## 2     1 mean   2018 Feb t(N(7.4, 2.3)) 3638.
## 3     1 mean   2018 Mar t(N(7.4, 2.3)) 3638.
## 4     1 mean   2018 Apr t(N(7.4, 2.3)) 3638.
## 5     1 mean   2018 May t(N(7.4, 2.3)) 3638.
## 6     1 mean   2018 Jun t(N(7.4, 2.3)) 3638.

The forecast function knows how to handle mable objects which makes our life easy. We now have a dataframe containing full forecast distributions for our forecast period. We can easily plot them:

first_fc %>%
  autoplot(tsbl) +
  scale_y_log10()

We can also disable the prediction intervals to just observe the means for visual clarity:

first_fc %>%
  autoplot(tsbl, level = NULL) +
  scale_y_log10()

This is great, and it appears that the seasonal naive model performs the best, but how can we quantify that assertion? Let’s take a look at some ways to evaluate performance.

3.1.2 Evaluating forecast accuracy

Typically in forecasting work we would start with analysis of residuals and other diagnostics. But demonstrating all of those for every model here would take a long time. I encourage readers to check out the relevant chapters of fpp3. Instead, we are going to focus on evaluating point forecast and forecast distribution accuracy.

3.1.2.1 Point forecast accuracy

Proper evaluation of accuracy instead of just looking at residuals is critical, however. This is because residuals are not a good indicator of how large actual forecast errors might be. Similar to other areas of statistics and machine learning, we need to split our data into training and test data to evaluate genuine forecasts over unseen time points. Typical guidance is that the test set should be at least as large as the longest forecast time horizon.

Let’s filter our data to contain everything up to and including 2016, and then use 2017 as our test set2:

train <- tsbl %>%
  filter(lubridate::year(month) < 2017)

test <- tsbl %>%
  filter(lubridate::year(month) >= 2017)

Now we can re-fit the models using only the training data and generate forecasts for one year:

fit_train <- train %>%
  model(
    mean = MEAN(log(cases)),
    naive = NAIVE(log(cases)),
    snaive = SNAIVE(log(cases) ~ lag("year")),
    drift = RW(log(cases) ~ drift())
  )

test_fc <- fit_train %>%
  forecast(h = 12)

Here is a plot of our forecasts against the actuals:

test_fc %>%
  autoplot(tsbl, level = NULL) +
  labs(x = "Month",
       y = "log-scaled lab confirmed influenza cases") +
  scale_y_log10() +
  guides(colour = guide_legend(title = "Forecast"))

Clearly, these are pretty poor forecasts (aside from maybe the seasonal naive model) which suggests that a more sophisticated approach is warranted (i.e., one that captures the temporal dynamics our eyes can see in the data, at a minimum). Indeed, this is in fact the point of this entire article – we need to start with basic methods that are completely transparent and interpretable, and build the case for additional complexity. In our circumstances here, we have a pretty strong case for moving to more complex methods. We can further this case by computing a range of error metrics for the forecasts3:

accuracy(test_fc, test)
## # A tibble: 4 × 11
##   .model    id .type     ME   RMSE    MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <dbl> <chr>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 drift      1 Test   8486. 31413. 18776. -164.  201.    NaN   NaN 0.516
## 2 mean       1 Test  17745. 35926. 18164.   27.9  47.4   NaN   NaN 0.560
## 3 naive      1 Test  11597. 32528. 17889.  -99.3 143.    NaN   NaN 0.532
## 4 snaive     1 Test   7982. 18229.  9741.  -17.2  53.9   NaN   NaN 0.542

3.1.2.2 Forecast distribution accuracy

So far, we have only evaluated point forecast accuracy. But we would be terrible forecasters if that is all we cared about! Why interpret a single value, when you can make use of an entire distribution which encapsulates your uncertainty about the future! A useful metric for this scenario is the Continuous Ranked Probability Score (CRPS), which is essentially a a weighted absolute error computed from the entire forecast distribution, where the weighting explicitly accounts for the probabilities. Here is how we can use the CRPS to evaluate forecast distributions for all time points in the prediction set:

test_fc %>%
  accuracy(test, list(crps = CRPS))
## # A tibble: 4 × 4
##   .model    id .type   crps
##   <chr>  <dbl> <chr>  <dbl>
## 1 drift      1 Test  14487.
## 2 mean       1 Test  17314.
## 3 naive      1 Test  14348.
## 4 snaive     1 Test   8222.

For the CRPS, lower values indicate better distributional forecasts. We see that the snaive method is producing better forecasts than the other simple approaches. However, as noted earlier, the other methods are quite poor, which suggests we may need to move to something that better captures the relevant temporal properties.

3.2 More complex methods

Okay, so we have built up a case for moving to more complex methods. Remember, complexity often comes at the cost of interpretability. However, since we are trying to be truly incremental about all of this, we are going to progress to methods that are indeed more complex than a mean or naive forecast, but are still incredibly interpretable and well-studied. Best of all, since we are using fable, we can easily compare the performance of all of these approaches. Let’s fit the same forecasts as before but with the addition of an autoregressive integrated moving average (ARIMA) model, an exponential smoothing (ETS) model, a Seasonal and Trend decomposition using Loess (STL) model, and a time-series linear model which just measures trend and annual seasonality (where seasonality is just modelled as a simple Fourier term):

fit_train_complex <- train %>%
  model(
    mean = MEAN(log(cases)),
    naive = NAIVE(log(cases)),
    snaive = SNAIVE(log(cases) ~ lag("year")),
    drift = RW(log(cases) ~ drift()),
    stepwise = ARIMA(log(cases)),
    search = ARIMA(log(cases), stepwise = FALSE, approx = FALSE),
    ets = ETS(log(cases)),
    stlf = decomposition_model(
      STL(log(cases) ~ trend(window = 21), robust = TRUE), 
      NAIVE(season_adjust)
      ), 
    tslm = TSLM(log(cases) ~ trend() +  season())
  )

test_fc_complex <- fit_train_complex %>%
  forecast(h = 12)

Let’s unpack the new models a bit further. There are two ARIMA specifications. They both use fable’s amazing auto-ARIMA functionality (meaning we don’t have to manually specify the \(p\), \(d\), and \(q\) parameters), but the stepwise model uses the default stepwise automation procedure, while search does a much larger model space search to find the optimal model4. Similarly, we are letting fable do the automated search for the optimal parameters of the ETS model too.

But why are ARIMA, ETS, and STL models ‘interpretable’ despite the substantial jump in complexity? Well, an ARIMA model is a generative process. That is, we can simulate time series from an ARIMA process because it describes the temporal mechanics: an autoregressive lag order (\(p\); i.e., each time point is correlated to the \(p-th\) point behind it), a degree to difference the data (\(d\); removes seasonality), and a moving average lag order (\(q\); i.e., we use past forecast errors rather than the values as in the case of autoregression). This leads us to a formulation that looks incredibly like a nice extension of linear regression, where \(y_{t}^{'}\) is the differenced time series:

\[ y_{t}^{'} = c + \phi_{1} y_{t-1}^{'} + \dots + \phi_{p} y_{t-p}^{'} + \theta_{1} \epsilon_{t-1} + \dots + \theta_{q} \epsilon_{t-q} + \epsilon_{t} \]

The \(\phi\) components describe the autoregressive part of the model, while the \(\theta\) components describe the moving average part of the model. \(\epsilon\) is the error term.

With regards to ETS, put simply, an ETS model employs weighted averages of past observations, with the weights decaying exponentially as the observations get older. This approach can then be extended to account for both trends and seasonality. Mathematically, we can represent the weighted average form of an ETS model as:

\[ \hat{y}_{T+1|T} = \sum_{j=0}^{T-1} \alpha(1 - \alpha)^{j} y_{T-j} + (1 - \alpha)^{T} \mathcal{l}_{0} \]

Finally, STL is the most intuitive of the three new additions. Its usefulness is best seen graphically, since it decomposes the time series into trend, seasonality, and the remainder of the signal. This leads to an additive component interpretation of time series. Here is the STL decomposition for the train set, where the upwards trend and annual seasonality is made very clear:

train %>%
  model(stl = STL(log(cases))) %>%
  components() %>%
  autoplot()

We can now directly compare the forecast distributions all of the models fit thus far:

test_fc_complex %>%
  accuracy(test, list(crps = CRPS)) %>%
  arrange(crps)
## # A tibble: 9 × 4
##   .model      id .type   crps
##   <chr>    <dbl> <chr>  <dbl>
## 1 search       1 Test   5160.
## 2 stepwise     1 Test   5226.
## 3 tslm         1 Test   6749.
## 4 stlf         1 Test   8134.
## 5 snaive       1 Test   8222.
## 6 ets          1 Test  11185.
## 7 naive        1 Test  14348.
## 8 drift        1 Test  14487.
## 9 mean         1 Test  17314.

We see that the search ARIMA model produces the best forecast distributions, and four of the five newer complex models are the top performers. Let’s visualise the search ARIMA’s forecasts independently:

test_fc_complex %>%
  filter(.model == "search") %>%
  autoplot(tsbl) +
  labs(x = "Month",
       y = "log-scaled lab confirmed influenza cases") +
  scale_y_log10() +
  guides(colour = guide_legend(title = "Forecast"))

The model appears to capture 2017’s data reasonably well in terms of both the prediction interval coverage and the mean forecast against the actual. However, there are other methods for evaluating forecast accuracy. One of these is a Winkler score. If the \(100(1- \alpha) \%\) prediction interval at time \(t\) is given by \([\mathcal{l}_{\alpha,t}, \mu_{\alpha,t}]\), then the Winkler score is defined as the length of the interval plus a penalty if the observation is outside the interval. As Rob Hyndman succinctly states in fpp3:

For observations that fall within the interval, the Winkler score is simply the length of the interval. Thus, low scores are associated with narrow intervals. However, if the observation falls outside the interval, the penalty applies, with the penalty proportional to how far the observation is outside the interval.

We can compare Winkler scores for forecasts easily in fable, but the decision of the width of the interval is important (we use \(95\%\) here for demonstration purposes):

test_fc_complex %>%
  accuracy(tsbl, measures = interval_accuracy_measures,
    level = 95) %>%
  arrange(winkler)
## # A tibble: 9 × 4
##   .model      id .type winkler
##   <chr>    <dbl> <chr>   <dbl>
## 1 tslm         1 Test   43905.
## 2 stepwise     1 Test   50738.
## 3 search       1 Test   52982.
## 4 snaive       1 Test   75465.
## 5 naive        1 Test  230301.
## 6 stlf         1 Test  231972.
## 7 drift        1 Test  389376.
## 8 mean         1 Test  466614.
## 9 ets          1 Test  590724.

Winkler scores suggest that the time-series linear model (i.e., a model consisting of a linear trend component added to a seasonal periodicity component) is the best performing method. The two ARIMA specifications are the second and third-best options, followed by the seasonal naive method. Clearly, this finding contrasts the result obtained from the CRPS method. Statistical decision making is hard!

The above accuracy comparisons were computed on the prediction intervals. But we can also just compare pointwise forecasts across a wide range of metrics:

test_fc_complex %>%
  accuracy(test) %>%
  arrange(RMSE)
## # A tibble: 9 × 11
##   .model      id .type      ME   RMSE    MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>    <dbl> <chr>   <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 stlf         1 Test    -906.  7719.  5183.  -75.6  83.0   NaN   NaN -0.206
## 2 search       1 Test    4338. 11623.  5371.  -13.5  31.0   NaN   NaN  0.220
## 3 stepwise     1 Test    4562. 12019.  5913.  -18.4  37.6   NaN   NaN  0.302
## 4 tslm         1 Test    6821. 17958.  8050.  -12.3  32.9   NaN   NaN  0.378
## 5 snaive       1 Test    7982. 18229.  9741.  -17.2  53.9   NaN   NaN  0.542
## 6 ets          1 Test  -15075. 20602. 15075. -160.  160.    NaN   NaN  0.660
## 7 drift        1 Test    8486. 31413. 18776. -164.  201.    NaN   NaN  0.516
## 8 naive        1 Test   11597. 32528. 17889.  -99.3 143.    NaN   NaN  0.532
## 9 mean         1 Test   17745. 35926. 18164.   27.9  47.4   NaN   NaN  0.560

Now we see that the decomposition model is regarded as the best. However, as the table shows, interpretation of different error metrics can lead to different decision regarding the best performing model. Regardless, the overarching finding here is that more complex models appear to perform better than simple models – even from a pointwise perspective.

3.3 Forecast combinations and method comparisons

So far, we have directly compared multiple individual forecasting methods. But what about if we combined them? Could a composite model outperform any individual forecast method? As it turns out, this 1989 review showed that combining multiple forecasts often leads to increased forecast accuracy, and that for most applications, averaging the different forecast methods can provide substantial performance improvement. Now, this is again a step up in complexity, but if the performance gains truly are that meaningful, the increase may be worth it. Further, if the underlying methods used in the composite forecast are themselves interpretable, then producing an average of them should not harm overall interpretability. Let’s try a combination of the search ARIMA and the time-series linear model and see how well it does:

fit_train_composite <- train %>%
  model(
    mean = MEAN(log(cases)),
    naive = NAIVE(log(cases)),
    snaive = SNAIVE(log(cases) ~ lag("year")),
    drift = RW(log(cases) ~ drift()),
    stepwise = ARIMA(log(cases)),
    search = ARIMA(log(cases), stepwise = FALSE, approx = FALSE),
    ets = ETS(log(cases)),
    stlf = decomposition_model(
      STL(log(cases) ~ trend(window = 21), robust = TRUE), 
      NAIVE(season_adjust)
      ), 
    tslm = TSLM(log(cases) ~ trend() +  season())
  ) %>%
    mutate(arima_tslm = (search + tslm) / 2)

Now, you may note that I haven’t just piped straight into the forecast function like we did in the previous sections. This is because fable is not currently able to combine different distribution types (e.g., a normal distribution for component model and a Student’s \(t\) distribution for another component model) yet, and so we need to compute an empirical forecast distribution by simulating many futures from the model. We do this for all the models to ensure fair comparisons:

futures <- fit_train_composite %>%
  generate(h = nrow(test), new_data = test, times = 1000) %>%
  as_tibble() %>%
  group_by(month, .model) %>%
  summarise(dist = distributional::dist_sample(list(.sim))) %>%
  ungroup() %>%
  as_fable(index = month, key = .model,
           distribution = dist, response = "cases")

We can then compare the empirical forecast distributions using the same methods as before:

futures %>%
  accuracy(tsbl, measures = interval_accuracy_measures,
    level = 95) %>%
  arrange(winkler)
## # A tibble: 10 × 3
##    .model     .type winkler
##    <chr>      <chr>   <dbl>
##  1 arima_tslm Test   35734.
##  2 tslm       Test   37146.
##  3 search     Test   50872.
##  4 stepwise   Test   53267.
##  5 snaive     Test   63864.
##  6 naive      Test  205846.
##  7 stlf       Test  265227.
##  8 drift      Test  446284.
##  9 mean       Test  463255.
## 10 ets        Test  931365.

We see that the combination forecast of the search ARIMA and the time-series linear model performs the best, followed by the component models of the TSLM and both ARIMA specifications. The highest-ranked ‘simple’ forecasting method – seasonal naive – is ranked fifth best. This provides us with further evidence that the increase in complexity is warranted.

Finally, with a bit more code, we can produce a visualisation of the distribution of futures for the top composite model:

fit_train_composite %>%
  generate(h = nrow(test), new_data = test, times = 1000) %>%
  as_tibble() %>%
  group_by(month, .model) %>%
  summarise(dist = distributional::dist_sample(list(.sim))) %>%
  ungroup() %>%
  filter(.model == "arima_tslm") %>%
  mutate(id = 1) %>%
  dplyr::select(-c(.model)) %>%
  as_fable(index = month, key = id,
           distribution = dist, response = "cases") %>%
  autoplot(tsbl) +
  labs(x = "Month",
       y = "log-scaled lab confirmed influenza cases") +
  scale_y_log10() +
  guides(colour = guide_legend(title = "Forecast"))

Looks decent!

4 Final thoughts

Thank you for making it to the end! If you have any questions, please feel free to reach out to me via email. I intend to write a follow-up piece to this post exploring the magnitude of the impact of COVID-19 on influenza rates, so stay tuned for that!


  1. Doing this within fable::model is great because it will automatically back-transform forecasts and prediction intervals to the scale of interest automatically for us.↩︎

  2. 2018 saw a strange decrease in average cases, so we will park this complexity for now.↩︎

  3. These become far more meaningful once we directly compare the simple and complex methods. This is coming up shortly!↩︎

  4. Note that this substantially increases computation time.↩︎

</div> </div>

Ordinance: Make ordinal regression intuitive with marginal effects

26 minute read

Published:

Introduction

Howdy! Today we are going to be talking all things ordinal variables. Ordinal data is very common in applied settings. It may come in surveys where items are measured on a Likert scale (i.e., 1 = Strongly disagree, 2 = Disagree, 3 = Neutral, 4 = Agree, 5 = Strongly agree). It may come in workforce data, where someone’s position in a company is captured. It may come in sports data, where we have finish positions in a race. Irrespective of the source, it would be disingenuous to model such data using a method that ignores its ordered categorical nature (such as fitting a linear regression model). Instead, we should use a tool developed specifically for the job.

Enter ordinal regression.

Regression for ordinal data

Ordinal regression is an extension of linear models much like a generalised linear model (GLM) is. The architecture is basically the same: Linear predictors are constructed from the covariates which are then used to model probability of membership to each level of the ordinal outcome variable. There are several ways to go about doing this: (i) we can model the levels cumulatively; (ii) adjacently; or (iii) sequentially. Regardless, we often see the trusty old logit used as the link function since it converts values into the \([0,1]\) domain of probabilities.

For our purposes, we are going to adopt the cumulative approach since it is the most common. A cumulative model assumes our outcome variable is some unknown latent variable where each level is represented by breakpoints1. Put another way, a cumulative ordinal model uses cumulative probabilities up to a threshold, thereby making the whole range of ordinal categories binary at each threshold. In practice, such an approach is typically implemented using proportional odds logistic regression. But more on that later.

Mathematically, we can express the cumulative logit model as2:

\[ \log \left( \frac{P(Y \leq j)}{P(Y > j)} \right) = \log \left(\frac{P(Y \leq j)}{1 -P(Y \leq j)} \right) = \log \left( \frac{\pi_{1} + \dots + \pi_{j}}{\pi_{j+1} + \dots + \pi_{J}} \right) \]

where our response variable \(Y\) is represented by its levels \(Y = 1, 2, \dots J\) and the associated probabilities are \(\pi_{1} + \dots + \pi_{J}\). The above equation which describes the log-odds of the event that \(Y \leq j\) and measures how likely the response is to be in category \(j\) or below versus in a category higher than \(j\).

For those familiar with GLMs, you already know how difficult to interpret log-odds are. The difficulty usually remains even after transforming to something like odds. And all of that just relates to just one (!!!) of the who-knows-how-many predictor variables you have! So what can you do? Clearly, you need to adopt a more appropriate analytical strategy.

The case for marginal effects

As stated above, interpreting regression models is hard! Especially when we have likelihoods that are non-normal that require link functions and/or the models have numerous covariates. In case it wasn’t already obvious: A coefficient in complex regression models does not tell you what you think it does.

Why is that? The coefficient is just a parameter. In the simple case of linear regression with one predictor, the coefficient tells us what we need to know. But for all other purposes, its meaningfulness is greatly diminished because it does not account for the values of the other covariates in the model. In order to interpret the effect of a given predictor on the outcome variable, we need to isolate it.

I’m going to borrow an excellent analogy from this excellent post by Andrew Heiss. Think of your regression model as an audio mixing board. Every covariate/predictor is represented on the board. Continuous predictors (such as income) are represented by sliders and categorical predictors (such as sex) are represented as switches. If we want to determine the effect of one of these switches or sliders on our outcome, we need to fix the others (i.e., hold them constant), move only the one of interest, and observe how the outcome value changes. Mathematically, this is called a partial derivative, and is represented in notation as \(\frac{\partial y}{\partial x}\). This leads us to the formal definition of a marginal effect: A marginal effect is a partial derivative from a regression equation.

Note that for categorical predictors (i.e., switches), this effect is actually called a ‘conditional effect’ or a ‘group contrast’ since a non-continuous variable does not have a slope/rate of change (i.e., a derivative). Instead, we compare the difference in means for each level of the switch (i.e., each ‘condition’).

Now, all of that should already (hopefully) sound appealing, but it gets even better! Marginal effects and the associated predictions are usually3 calculated and reported on the outcome scale of interest. This is a big deal so let me explain further. We do not need to make inferences in terms of log-odds or some other confusing measurement scale. In the case of ordinal regression4, we can instead talk about something far more intuitive and elegant: probabilities.

That’s right. The marginal effects of an ordinal regression model are expressed in terms of the probability of belonging to each level of the outcome variable. Beautiful stuff.

The dataset

Enough math and explanations! What data are we looking at? For this tutorial, we are going to explore the 2022 Australian Election Study (the ‘AES’). The AES examines important issues such as changing support for political parties, the role of the party leaders, and the importance of political issues like the economy and the environment. We are going to examine one item in particular:

Generally speaking, how much interest do you usually have in what’s going on in politics?

This item is measured on a four-point scale:

  1. A good deal
  2. Some
  3. Not much
  4. None

Clearly, we have an ordinal context. Since we want to explore differences in political interest5, we need some variables to explore differences over. We are going to use the following as predictors:

  • Age
  • Gender
  • Sexual orientation
  • Years of tertiary study
  • Employment status
  • Managerial status at work

Let’s load the R packages we need to get started:

library(dplyr)
library(tidyr)
library(ggplot2)
library(MASS)
library(marginaleffects)

Data preprocessing

Getting the dataset in a nice and clean format for what we need takes a little bit of work so bear with me here. The below code just reads in the file (which I have conveniently stored on my GitHub), filters out erroneous values, and recodes factor variables from their integer codes to qualitative labels to make our lives a little easier when visualising results. Since data cleaning isn’t the focus of this tutorial, feel free to just run this code and not dig in too deep.

aes22 <- read.csv("https://raw.githubusercontent.com/hendersontrent/hendersontrent.github.io/master/files/aes22_unrestricted_v2.csv") %>%
  dplyr::select(c(ID, A1, G2, G4, G5_D, H1, H1b, H2)) %>%
  rename(years_tertiary_study = G2,
         work = G4,
         managerial = G5_D,
         gender = H1,
         sexual_orientation = H1b,
         dob = H2) %>%
  filter(A1 %in% c(1, 2, 3, 4)) %>%
  filter(years_tertiary_study != 999) %>%
  filter(work %in% c(1, 2, 3, 4, 5, 6, 7)) %>%
  filter(managerial %in% c(1, 2, 3, 4, 5)) %>%
  filter(gender %in% c(1, 2, 3)) %>%
  filter(sexual_orientation %in% c(1, 2, 3, 4, 5)) %>%
  filter(dob != 999) %>%
  filter(dob != -97) %>%
  mutate(age = as.integer(2022 - dob)) %>% # Age at 2022 survey
  dplyr::select(-c(dob)) %>%
  mutate(gender = case_when(
          gender == "1" ~ "Male",
          gender == "2" ~ "Female",
          gender == "3" ~ "Other")) %>%
  mutate(sexual_orientation = case_when(
    sexual_orientation == "1" ~ "Heterosexual or Straight",
    sexual_orientation == "2" ~ "Gay or Lesbian",
    sexual_orientation == "3" ~ "Bisexual",
    sexual_orientation == "4" ~ "Other",
    sexual_orientation == "5" ~ "Prefer not to say")) %>%
  mutate(work = case_when(
    work == "1" ~ "Working full-time for pay",
    work == "2" ~ "Working part-time for pay",
    work == "3" ~ "Unemployed - looking for full-time work",
    work == "4" ~ "Unemployed - looking for part-time work",
    work == "5" ~ "Retired from paid work",
    work == "6" ~ "A full-time school or university student",
    work == "7" ~ "Family or caring responsibilities")) %>%
  mutate(managerial = case_when(
    managerial == "1" ~ "Upper managerial",
    managerial == "2" ~ "Middle managerial",
    managerial == "3" ~ "Lower managerial",
    managerial == "4" ~ "Supervisory",
    managerial == "5" ~ "Non-supervisory")) %>%
  mutate(A1 = case_when(
          A1 == "4" ~ "None",
          A1 == "3" ~ "Not much",
          A1 == "2" ~ "Some",
          A1 == "1" ~ "A good deal")) %>%
  mutate(A1 = factor(A1, levels = c("None", "Not much", "Some", "A good deal"), ordered = TRUE),
         managerial = factor(managerial, 
                             levels = c("Non-supervisory", "Supervisory", 
                                        "Lower managerial", "Middle managerial", 
                                        "Upper managerial"), 
                             ordered = TRUE),
         gender = as.factor(gender),
         sexual_orientation = as.factor(sexual_orientation),
         work = as.factor(work),
         years_tertiary_study = as.integer(years_tertiary_study))

Statistical modelling

Typically, we would do extensive exploratory data analysis prior to conducting any statistical modelling. But this is a modelling and marginal effects tutorial and not a research paper, so we will skip it here6. Please do not skip this stage in your own work!

The easiest function for fitting a proportional odds logistic regression (i.e., ordinal regression) in R is the polr function in the MASS package7. The interface is the same as other R regression functions, such as lm. Note that by default, polr uses the logit link function, so we do not need to explicitly specify it. Here is the model:

mod <- polr(A1 ~ years_tertiary_study + work + managerial +
              gender + sexual_orientation + age,
            data = aes22)

We are also going to skip over model diagnostics and checks and all of that here so we can get to the crux of the tutorial content. We can print a quick summary of mod to obtain coefficient values, intercepts for each cumulative threshold, and other information:

summary(mod)
## Call:
## polr(formula = A1 ~ years_tertiary_study + work + managerial + 
##     gender + sexual_orientation + age, data = aes22)
## 
## Coefficients:
##                                                Value Std. Error t value
## years_tertiary_study                         0.09898   0.016491  6.0022
## workFamily or caring responsibilities       -0.88470   0.422507 -2.0939
## workRetired from paid work                  -0.44590   0.394436 -1.1305
## workUnemployed - looking for full-time work -0.40188   0.573699 -0.7005
## workUnemployed - looking for part-time work -0.69857   0.566880 -1.2323
## workWorking full-time for pay               -0.64700   0.362709 -1.7838
## workWorking part-time for pay               -0.78457   0.371394 -2.1125
## managerial.L                                 0.51695   0.100317  5.1531
## managerial.Q                                 0.06986   0.105543  0.6619
## managerial.C                                -0.02053   0.105220 -0.1951
## managerial^4                                -0.34185   0.115103 -2.9699
## genderMale                                   0.40115   0.095092  4.2185
## genderOther                                  2.65070   1.329392  1.9939
## sexual_orientationGay or Lesbian            -0.51216   0.423282 -1.2100
## sexual_orientationHeterosexual or Straight  -1.13469   0.313564 -3.6187
## sexual_orientationOther                     -1.75812   0.554949 -3.1681
## age                                          0.03101   0.003983  7.7852
## 
## Intercepts:
##                  Value   Std. Error t value
## None|Not much    -3.8626  0.4946    -7.8092
## Not much|Some    -1.3382  0.4639    -2.8846
## Some|A good deal  0.8813  0.4635     1.9014
## 
## Residual Deviance: 3769.865 
## AIC: 3809.865

However, as noted above, these are quite prohibitive to interpret. So, we are going to turn to our friend marginal effects for emotional and computational support.

Marginal effects

Marginal effects can be calculated in R a few different ways and often with very different methods/quantities behind the scenes. We are going to use the excellent marginaleffects package for our purposes. Seriously, please visit that link and look through the incredibly detailed documentation. It’s basically my web browser home page at this point.

The marginaleffects package affords an honestly astonishing amount of functionality. As a starting point, we are going to focus on three types of quantities:

  1. Predictions
  2. Slopes
  3. Comparisons

Predictions

In the context of marginaleffects, an adjusted prediction is the outcome predicted by a fitted model on a specified scale for a given combination of values of the predictor variables, such as their observed values, their means, or factor levels. That’s a lot of words to say that we can calculate predicted values for any given subset of the data used in the model. In the most extreme case, we can calculate predictions for a given predictor which are averaged over the values of all other predictors. This procedure is extremely useful for answering questions in practice, since we often want to account for the values of other variables while exploring the effects of the focal variable, rather than just holding the others constant.

In marginaleffects, this is easy through the predictions function (and the associated plot_predictions visualisation function). Let’s examine age. We’ll be sure to specify that we want our output (i.e., \(y\)-axis) to be probabilities:

plot_predictions(mod, condition = "age", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1)

Beautiful! We see that as age increases, the predicted probability of rating political interest as ‘A good deal’ increases almost linearly. Compare that to ‘Not much’, where as age increases, the predicted probability of scarce political interest decreases. These results are far, far more interpretable and immediately practically meaningful than trying to dredge understanding out of log-odds coefficients.

Let’s look at a few more examples. Here is years of tertiary study:

plot_predictions(mod, condition = "years_tertiary_study", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1)

We see similar results to age which is somewhat unsurprising.

How about categorical predictors? Here is gender:

plot_predictions(mod, condition = "gender", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1)

As expected, we do not get nice smooth curves like we did for numerical variables but instead get point estimates for the factor level along with the associated confidence interval. From this plot, it appears that respondents with a gender outside the binary system may be more likely to express ‘A good deal’ of political interest than males or females. Again, how intuitive is this to interpret!

Let’s do one more. Here is employment managerial status:

plot_predictions(mod, condition = "managerial", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1) +
  theme(axis.text.x = element_text(angle = 90))

We see that respondents in middle and upper managerial positions may be more likely to express ‘A good deal’ of political interest than others. Interestingly, probabilities of rating ‘Some’, ‘Not much’, and ‘None’ are quite similar between managerial positions.

A more specific example

The above plots are both interesting and useful, but we need to go deeper. In other words, this is where the fun begins8.

Consider the following hypothetical AES respondent: A man who studied for 12 years at university, currently works full-time for pay, is a middle manager at work, is bisexual, and is 30 years old. We can define this individual in a data frame:

respondent <- data.frame(years_tertiary_study = 12, work = "Working full-time for pay",
                         managerial = "Middle managerial", sexual_orientation = "Bisexual",
                         age = 30, gender = "Male")

We can easily calculate the expected predicted probability for each outcome level (political interest) for this person:

predictions(mod, newdata = respondent)
## 
##        Group Estimate Std. Error     z Pr(>|z|)     S    2.5 %  97.5 %
##  None         0.00234   0.000933  2.51  0.01222   6.4 0.000509 0.00417
##  Not much     0.02609   0.009061  2.88  0.00399   8.0 0.008328 0.04385
##  Some         0.18370   0.048728  3.77  < 0.001  12.6 0.088195 0.27920
##  A good deal  0.78787   0.058370 13.50  < 0.001 135.5 0.673472 0.90228
##  years_tertiary_study                      work        managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

We see that the probability this person rates their political interest as ‘A good deal’ is \(79\%\), ‘Some’ is \(18\%\), ‘Not much’ is \(3\%\), and ‘None’ is almost \(0\%\). How cool is that!

Slopes

Slopes. Partial derivatives. Marginal effects. The rate of change at which the probability of our outcome changes with respect to a given variable.

Calculation of these quantities is made easy in marginaleffects through the slopes() function. Continuing with our hypothetical respondent, we can calculate the slopes for a given variable. Here is age:

slopes(mod, variables = "age", newdata = respondent)
## 
##        Group Term   Estimate Std. Error     z Pr(>|z|)    S     2.5 %    97.5 %
##  None         age -0.0000723  0.0000302 -2.39  0.01680  5.9 -0.000132 -0.000013
##  Not much     age -0.0007839  0.0002842 -2.76  0.00581  7.4 -0.001341 -0.000227
##  Some         age -0.0043257  0.0009707 -4.46  < 0.001 16.9 -0.006228 -0.002423
##  A good deal  age  0.0051820  0.0012653  4.10  < 0.001 14.5  0.002702  0.007662
##  years_tertiary_study                      work        managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

And here is years of tertiary study:

slopes(mod, variables = "years_tertiary_study", newdata = respondent)
## 
##        Group                 Term  Estimate Std. Error     z Pr(>|z|)    S
##  None        years_tertiary_study -0.000231  0.0000873 -2.64  0.00820  6.9
##  Not much    years_tertiary_study -0.002502  0.0007962 -3.14  0.00167  9.2
##  Some        years_tertiary_study -0.013808  0.0027294 -5.06  < 0.001 21.2
##  A good deal years_tertiary_study  0.016541  0.0035167  4.70  < 0.001 18.6
##      2.5 %     97.5 % years_tertiary_study                      work
##  -0.000402 -0.0000597                   12 Working full-time for pay
##  -0.004063 -0.0009418                   12 Working full-time for pay
##  -0.019158 -0.0084586                   12 Working full-time for pay
##   0.009649  0.0234341                   12 Working full-time for pay
##         managerial sexual_orientation age gender
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
## 
## Columns: rowid, group, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

Alternatively, we can calculate average marginal effects using avg_slopes(), which are similar to the slopes() call but the same computation is repeated for every row of the original dataset, and then the average slope for each level of the outcome variable is reported:

avg_slopes(mod)
## 
##        Group                 Term
##  None        years_tertiary_study
##  None        work                
##  None        work                
##  None        work                
##  None        work                
##                                                                            Contrast
##  dY/dX                                                                             
##  Family or caring responsibilities - A full-time school or university student      
##  Retired from paid work - A full-time school or university student                 
##  Unemployed - looking for full-time work - A full-time school or university student
##  Unemployed - looking for part-time work - A full-time school or university student
##  Estimate Std. Error      z Pr(>|z|)    S     2.5 %    97.5 %
##  -0.00151   0.000365 -4.126   <0.001 14.7 -0.002222 -0.000791
##   0.01152   0.005575  2.066   0.0388  4.7  0.000591  0.022446
##   0.00461   0.003619  1.273   0.2030  2.3 -0.002486  0.011700
##   0.00406   0.006257  0.649   0.5165  1.0 -0.008205  0.016323
##   0.00823   0.007811  1.054   0.2918  1.8 -0.007075  0.023545
## --- 58 rows omitted. See ?print.marginaleffects --- 
##  A good deal gender              
##  A good deal sexual_orientation  
##  A good deal sexual_orientation  
##  A good deal sexual_orientation  
##  A good deal age                 
##                                                                            Contrast
##  Other - Female                                                                    
##  Gay or Lesbian - Bisexual                                                         
##  Heterosexual or Straight - Bisexual                                               
##  Other - Bisexual                                                                  
##  dY/dX                                                                             
##  Estimate Std. Error      z Pr(>|z|)    S     2.5 %    97.5 %
##   0.47911   0.134333  3.567   <0.001 11.4  0.215821  0.742398
##  -0.10433   0.085494 -1.220   0.2224  2.2 -0.271892  0.063240
##  -0.23807   0.060119 -3.960   <0.001 13.7 -0.355904 -0.120243
##  -0.36413   0.104987 -3.468   <0.001 10.9 -0.569901 -0.158358
##   0.00661   0.000817  8.094   <0.001 50.6  0.005012  0.008215
## Columns: group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high

Together, these functions enable us to infer a great many things about how the rate of change of the probability of outcome category membership varies as a function of a given predictor variable. Useful stuff!

Comparisons

So far, we have computed adjusted predictions and slopes. What else can we do? A whole lot, but for now we’re just going to explore comparisons. We are going to continue using our hypothetical AES respondent from above.

We can ask lots of interesting questions about this person from a comparative lens, such as ‘what happens to the predicted probability of their political interest if we increased their age by 1?’ Or ‘what happens to the predicted probability of their political interest if they were instead female?’ The comparisons function provides answers to these questions.

Let’s calculate the comparisons first:

comps <- comparisons(mod, newdata = respondent)
head(comps)
## 
##  Group                 Term
##   None years_tertiary_study
##   None work                
##   None work                
##   None work                
##   None work                
##   None work                
##                                                                            Contrast
##  +1                                                                                
##  Family or caring responsibilities - A full-time school or university student      
##  Retired from paid work - A full-time school or university student                 
##  Unemployed - looking for full-time work - A full-time school or university student
##  Unemployed - looking for part-time work - A full-time school or university student
##  Working full-time for pay - A full-time school or university student              
##   Estimate Std. Error      z Pr(>|z|)   S     2.5 %     97.5 %
##  -0.000231  0.0000874 -2.644   0.0082 6.9 -0.000402 -0.0000597
##   0.001738  0.0010737  1.619   0.1055 3.2 -0.000366  0.0038425
##   0.000687  0.0006028  1.140   0.2542 2.0 -0.000494  0.0018689
##   0.000605  0.0009575  0.632   0.5274 0.9 -0.001271  0.0024818
##   0.001236  0.0012819  0.964   0.3350 1.6 -0.001277  0.0037485
##   0.001113  0.0006367  1.747   0.0806 3.6 -0.000135  0.0023604
##  years_tertiary_study                      work        managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

That’s a lot of information! Let’s narrow it down to make it a bit easier to interrogate. Here is age:

comps %>%
  filter(term == "age")
## 
##        Group Term Contrast   Estimate Std. Error     z Pr(>|z|)    S     2.5 %
##  None         age       +1 -0.0000723  0.0000303 -2.39  0.01680  5.9 -0.000132
##  Not much     age       +1 -0.0007841  0.0002842 -2.76  0.00581  7.4 -0.001341
##  Some         age       +1 -0.0043260  0.0009706 -4.46  < 0.001 16.9 -0.006228
##  A good deal  age       +1  0.0051824  0.0012654  4.10  < 0.001 14.5  0.002702
##     97.5 % years_tertiary_study                      work        managerial
##  -0.000013                   12 Working full-time for pay Middle managerial
##  -0.000227                   12 Working full-time for pay Middle managerial
##  -0.002424                   12 Working full-time for pay Middle managerial
##   0.007662                   12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

We see that an increase of 1 year of age (as evidenced by Contrast being +1) is associated with decreases in the probability of rating political interest as ‘None’, ‘Not much’, and ‘Some’, but an increase in the probability of rating ‘A good deal’ by about \(0.5\) percentage points. Neat!

What about gender? Let’s focus on the effect associated with a change from Male to Female:

comps %>%
  filter(term == "gender") %>%
  filter(grepl("Male", contrast))
## 
##        Group   Term      Contrast Estimate Std. Error     z Pr(>|z|)    S
##  None        gender Male - Female -0.00115   0.000511 -2.25   0.0245  5.4
##  Not much    gender Male - Female -0.01229   0.004782 -2.57   0.0102  6.6
##  Some        gender Male - Female -0.06123   0.016425 -3.73   <0.001 12.3
##  A good deal gender Male - Female  0.07467   0.021208  3.52   <0.001 11.2
##     2.5 %    97.5 % years_tertiary_study                      work
##  -0.00215 -0.000148                   12 Working full-time for pay
##  -0.02166 -0.002919                   12 Working full-time for pay
##  -0.09342 -0.029034                   12 Working full-time for pay
##   0.03310  0.116233                   12 Working full-time for pay
##         managerial sexual_orientation age gender
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
## 
## Columns: rowid, group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

Changing gender to Female and keeping everything else the same is associated with decreases in the probability of rating political interest as ‘None’, ‘Not much’, and ‘Some’, but an increase in the probability of rating ‘A good deal’ (by about \(7\) percentage points). Hopefully now it’s evident just how intuitive complex models like ordinal regression become when interpreted using appropriate tools!

Final thoughts

Thanks for making it to the end! Hopefully this post has inspired you to leave the world of coefficient interpretation in favour of more desirable shores. As an aside, this dataset is incredibly interesting, so maybe I will revisit it sometime. There is a panel data version where respondents across time were tracked, so maybe I will do something with that!

As always, please feel free to reach out to me via email if you have any questions or just want to talk about stats.


  1. Agresti, A. (2002) Categorical Data. Second edition. Wiley.↩︎

  2. https://online.stat.psu.edu/stat504/lesson/8/8.4↩︎

  3. But not always. Sometimes you may have use for values on the link scale or otherwise.↩︎

  4. The beauty of outcome scales extends to other models, too. For example, marginal effects calculated on a Poisson model will give you results in terms of counts.↩︎

  5. This is my high-level summary term for what the survey item measures.↩︎

  6. I did do some for my own sanity, however.↩︎

  7. Although I think everyone should go Bayes and use Stan or brms.↩︎

  8. Okay, okay. I’ll stop!↩︎

timeseries

Incremental complexity: A forecaster’s odyssey

23 minute read

Published:

1 Introduction

Welcome! Our adventures in time-series analysis continue, but this time with a focus on forecasting. Forecasting is an incredibly important tool for many researchers and analysts, but is often one that is easily abused, obfuscated, or just outright done poorly. In this article, we are going to explore forecasting from a ‘bottom-up’ approach to complexity. That is, we want to try simple models first, before jumping into something more complex, and, therefore, harder to interpret. This is something I call incremental complexity, and is an idea I introduced in this recent paper. Now, in some applications, you might not need to care about interpretability, such as maybe process engineering, where maximising accuracy is what matters. But for most applied settings in which I typically work, people need to know how conclusions were reached, or how numbers were generated. Using some off-the-shelf convoluted neural network just does not cut it. Besides, that stuff is boring anyway, we care about statistical uncertainty and distributions in this house!

Before we get started, we will load the R packages we need:

library(dplyr)
library(ggplot2)
library(lubridate)
library(tsibble)
library(feasts)
library(fable)

2 Downloading the dataset

We are going to analyse monthly laboratory confirmed influenza cases in Australia for no reason other than it seemed interesting. There is a bit of data wrangling to get it in the format we need, so stay with me here. We are just going to take the data up to and including 2017 to avoid the weirdness of COVID-19 (I will save that for a future post!) and the strange dip exhibited in 2018. Since data wrangling is not the focus of this article, feel free to just run this and move on:

library(httr)
library(readxl)
library(janitor)

url <- "https://www.health.gov.au/sites/default/files/2023-09/national-notifiable-diseases-surveillance-system-nndss-public-dataset-influenza-laboratory-confirmed.xlsx"

GET(url, write_disk(tf <- tempfile(fileext = ".xlsx")))
flu_2008_2015 <- read_excel(tf, sheet = 1, skip = 1)
flu_2016_2018 <- read_excel(tf, sheet = 2, skip = 1)

flu <- bind_rows(flu_2008_2015, flu_2016_2018) %>%
  clean_names() %>%
  mutate(week_ending_friday = as.Date(week_ending_friday, format = "%Y-%m-%d"),
         month = yearmonth(as.Date(week_ending_friday, format = "%Y-%m-%d"))) %>%
  reframe(cases = n(), .by = "month") %>%
  mutate(id = 1) %>%
  filter(month <= yearmonth(as.Date("2017-12-31", format = "%Y-%m-%d")))

Here is a quick visualisation of what we are dealing with:

flu %>%
  ggplot(aes(x = month, y = cases)) +
  geom_line(aes(group = 1)) +
  labs(x = "Month",
       y = "Lab confirmed influenza cases",
       colour = NULL) +
  theme_bw() +
  theme(legend.position = "bottom")

However, the temporal dynamics are quite difficult to observe on this scale. Let’s log-scale it and try again:

flu %>%
  ggplot(aes(x = month, y = cases)) +
  geom_line(aes(group = 1)) +
  labs(x = "Month",
       y = "log-scaled lab confirmed influenza cases",
       colour = NULL) +
  scale_y_log10() +
  theme_bw() +
  theme(legend.position = "bottom")

Beautiful! We can now make out a seasonal pattern and an upwards trend. But more on these things later! It's time to try some basic models.

3 Fitting forecast models

In R, forecasting should be done using the tidyverts – a collection of packages developed by the incredible Rob Hyndman and colleagues for handling forecasting and time-series data more generally using tidy data principles. Specifically, we are interested in the following packages:

  • tsibble – package for the tsibble data type, which is a time series version of the regular tibble data frame object
  • feasts – package for computing time-series features and useful associated quantities
  • fable – the actual forecasting package

To start with, we will convert our data frame to a tsibble:

tsbl <- as_tsibble(flu, key = id, index = month)
head(tsbl)
## # A tsibble: 6 x 3 [1M]
## # Key:       id [1]
##      month cases    id
##      <mth> <int> <dbl>
## 1 2008 Jan   121     1
## 2 2008 Feb   155     1
## 3 2008 Mar   149     1
## 4 2008 Apr   163     1
## 5 2008 May   270     1
## 6 2008 Jun   250     1

3.1 Some simple approaches

Okay, so where should we start? Well, if you open the highly-recommended and exceptional Forecasting: Principles and Practice (3rd Edition), you would find a short list of so-called ‘simple’ forecasting methods. These include:

  • Mean method (MEAN function in fable) – forecasts of all future values are equal to the average (i.e., mean) of the historical data. With historical data denoted by \(y_{1}, \dots , y_{T}\), forecasts can be written as:

\[ \hat{y}_{T+h|T} = \bar{y} = \frac{(y_{1} + \dots + y_{T})}{T} \] where \(\hat{y}_{T+h|T}\) basically says ‘the estimate of the value for \(y_{T+h}\) based on the data \(y_{1} + \dots + y_{T}\)’.

  • Naive method (NAIVE function in fable) – forecasts are the value of the last observation. Mathematically, this approach is expressed by:

\[ \hat{y}_{T+h|T} = y_{T} \]

  • Seasonal naive method (SNAIVE function in fable) – forecasts are equal to the last observed value from the same season. Mathematically, this is:

\[ \hat{y}_{T+h|T} = y_{T+h-m(k+1)} \] where \(m\) is the seasonal period and \(k\) is the number of complete years in the forecast period.

  • Drift method (DRIFT function in fable) – forecasts are similar to the naive method, but they are allowed to increase or decrease over time, where the amount of change over time (i.e., ‘drift’) is the average change observed in the historical data. Mathematically, this is:

\[ \hat{y}_{T+h|T} = y_{T} + \frac{h}{T-1} \sum_{t=2}^{T}(y_{t}-y_{t-1}) = y_{T}+h (\frac{y_{T}-y_{1}}{T-1}) \] In fpp3, Rob Hyndman cleverly describes this is being equivalent to drawing a straight line between the first and last observations, and extrapolating it forward.

3.1.1 Fitting simple models in R

As highlighted above, there are simple one-line model functions in fable that enable us to fit these models. But there is another reason why fable is so good – we can fit them all at the same time using the model function! This makes our code more compact and also makes it easier to compare models. Here is an example using our sample data, remembering that we want to log-scale cases1:

fit <- tsbl %>%
  model(
    mean = MEAN(log(cases)),
    naive = NAIVE(log(cases)),
    snaive = SNAIVE(log(cases) ~ lag("year")),
    drift = RW(log(cases) ~ drift()),
  )

fit
## # A mable: 1 x 5
## # Key:     id [1]
##      id    mean   naive   snaive         drift
##   <dbl> <model> <model>  <model>       <model>
## 1     1  <MEAN> <NAIVE> <SNAIVE> <RW w/ drift>

This returns a mable object (a ‘model fable’) where each row corresponds to a unique time series (in our case we only have one) and each column contains an object for each model that was fit. Very neat.

What can we do with this? Well, a lot!

We might want to produce some quick forecasts for an entire year for all models:

first_fc <- fit %>% 
  forecast(h = 12)

head(first_fc)
## # A fable: 6 x 5 [1M]
## # Key:     id, .model [1]
##      id .model    month          cases .mean
##   <dbl> <chr>     <mth>         <dist> <dbl>
## 1     1 mean   2018 Jan t(N(7.4, 2.3)) 3638.
## 2     1 mean   2018 Feb t(N(7.4, 2.3)) 3638.
## 3     1 mean   2018 Mar t(N(7.4, 2.3)) 3638.
## 4     1 mean   2018 Apr t(N(7.4, 2.3)) 3638.
## 5     1 mean   2018 May t(N(7.4, 2.3)) 3638.
## 6     1 mean   2018 Jun t(N(7.4, 2.3)) 3638.

The forecast function knows how to handle mable objects which makes our life easy. We now have a dataframe containing full forecast distributions for our forecast period. We can easily plot them:

first_fc %>%
  autoplot(tsbl) +
  scale_y_log10()

We can also disable the prediction intervals to just observe the means for visual clarity:

first_fc %>%
  autoplot(tsbl, level = NULL) +
  scale_y_log10()

This is great, and it appears that the seasonal naive model performs the best, but how can we quantify that assertion? Let’s take a look at some ways to evaluate performance.

3.1.2 Evaluating forecast accuracy

Typically in forecasting work we would start with analysis of residuals and other diagnostics. But demonstrating all of those for every model here would take a long time. I encourage readers to check out the relevant chapters of fpp3. Instead, we are going to focus on evaluating point forecast and forecast distribution accuracy.

3.1.2.1 Point forecast accuracy

Proper evaluation of accuracy instead of just looking at residuals is critical, however. This is because residuals are not a good indicator of how large actual forecast errors might be. Similar to other areas of statistics and machine learning, we need to split our data into training and test data to evaluate genuine forecasts over unseen time points. Typical guidance is that the test set should be at least as large as the longest forecast time horizon.

Let’s filter our data to contain everything up to and including 2016, and then use 2017 as our test set2:

train <- tsbl %>%
  filter(lubridate::year(month) < 2017)

test <- tsbl %>%
  filter(lubridate::year(month) >= 2017)

Now we can re-fit the models using only the training data and generate forecasts for one year:

fit_train <- train %>%
  model(
    mean = MEAN(log(cases)),
    naive = NAIVE(log(cases)),
    snaive = SNAIVE(log(cases) ~ lag("year")),
    drift = RW(log(cases) ~ drift())
  )

test_fc <- fit_train %>%
  forecast(h = 12)

Here is a plot of our forecasts against the actuals:

test_fc %>%
  autoplot(tsbl, level = NULL) +
  labs(x = "Month",
       y = "log-scaled lab confirmed influenza cases") +
  scale_y_log10() +
  guides(colour = guide_legend(title = "Forecast"))

Clearly, these are pretty poor forecasts (aside from maybe the seasonal naive model) which suggests that a more sophisticated approach is warranted (i.e., one that captures the temporal dynamics our eyes can see in the data, at a minimum). Indeed, this is in fact the point of this entire article – we need to start with basic methods that are completely transparent and interpretable, and build the case for additional complexity. In our circumstances here, we have a pretty strong case for moving to more complex methods. We can further this case by computing a range of error metrics for the forecasts3:

accuracy(test_fc, test)
## # A tibble: 4 × 11
##   .model    id .type     ME   RMSE    MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <dbl> <chr>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 drift      1 Test   8486. 31413. 18776. -164.  201.    NaN   NaN 0.516
## 2 mean       1 Test  17745. 35926. 18164.   27.9  47.4   NaN   NaN 0.560
## 3 naive      1 Test  11597. 32528. 17889.  -99.3 143.    NaN   NaN 0.532
## 4 snaive     1 Test   7982. 18229.  9741.  -17.2  53.9   NaN   NaN 0.542

3.1.2.2 Forecast distribution accuracy

So far, we have only evaluated point forecast accuracy. But we would be terrible forecasters if that is all we cared about! Why interpret a single value, when you can make use of an entire distribution which encapsulates your uncertainty about the future! A useful metric for this scenario is the Continuous Ranked Probability Score (CRPS), which is essentially a a weighted absolute error computed from the entire forecast distribution, where the weighting explicitly accounts for the probabilities. Here is how we can use the CRPS to evaluate forecast distributions for all time points in the prediction set:

test_fc %>%
  accuracy(test, list(crps = CRPS))
## # A tibble: 4 × 4
##   .model    id .type   crps
##   <chr>  <dbl> <chr>  <dbl>
## 1 drift      1 Test  14487.
## 2 mean       1 Test  17314.
## 3 naive      1 Test  14348.
## 4 snaive     1 Test   8222.

For the CRPS, lower values indicate better distributional forecasts. We see that the snaive method is producing better forecasts than the other simple approaches. However, as noted earlier, the other methods are quite poor, which suggests we may need to move to something that better captures the relevant temporal properties.

3.2 More complex methods

Okay, so we have built up a case for moving to more complex methods. Remember, complexity often comes at the cost of interpretability. However, since we are trying to be truly incremental about all of this, we are going to progress to methods that are indeed more complex than a mean or naive forecast, but are still incredibly interpretable and well-studied. Best of all, since we are using fable, we can easily compare the performance of all of these approaches. Let’s fit the same forecasts as before but with the addition of an autoregressive integrated moving average (ARIMA) model, an exponential smoothing (ETS) model, a Seasonal and Trend decomposition using Loess (STL) model, and a time-series linear model which just measures trend and annual seasonality (where seasonality is just modelled as a simple Fourier term):

fit_train_complex <- train %>%
  model(
    mean = MEAN(log(cases)),
    naive = NAIVE(log(cases)),
    snaive = SNAIVE(log(cases) ~ lag("year")),
    drift = RW(log(cases) ~ drift()),
    stepwise = ARIMA(log(cases)),
    search = ARIMA(log(cases), stepwise = FALSE, approx = FALSE),
    ets = ETS(log(cases)),
    stlf = decomposition_model(
      STL(log(cases) ~ trend(window = 21), robust = TRUE), 
      NAIVE(season_adjust)
      ), 
    tslm = TSLM(log(cases) ~ trend() +  season())
  )

test_fc_complex <- fit_train_complex %>%
  forecast(h = 12)

Let’s unpack the new models a bit further. There are two ARIMA specifications. They both use fable’s amazing auto-ARIMA functionality (meaning we don’t have to manually specify the \(p\), \(d\), and \(q\) parameters), but the stepwise model uses the default stepwise automation procedure, while search does a much larger model space search to find the optimal model4. Similarly, we are letting fable do the automated search for the optimal parameters of the ETS model too.

But why are ARIMA, ETS, and STL models ‘interpretable’ despite the substantial jump in complexity? Well, an ARIMA model is a generative process. That is, we can simulate time series from an ARIMA process because it describes the temporal mechanics: an autoregressive lag order (\(p\); i.e., each time point is correlated to the \(p-th\) point behind it), a degree to difference the data (\(d\); removes seasonality), and a moving average lag order (\(q\); i.e., we use past forecast errors rather than the values as in the case of autoregression). This leads us to a formulation that looks incredibly like a nice extension of linear regression, where \(y_{t}^{'}\) is the differenced time series:

\[ y_{t}^{'} = c + \phi_{1} y_{t-1}^{'} + \dots + \phi_{p} y_{t-p}^{'} + \theta_{1} \epsilon_{t-1} + \dots + \theta_{q} \epsilon_{t-q} + \epsilon_{t} \]

The \(\phi\) components describe the autoregressive part of the model, while the \(\theta\) components describe the moving average part of the model. \(\epsilon\) is the error term.

With regards to ETS, put simply, an ETS model employs weighted averages of past observations, with the weights decaying exponentially as the observations get older. This approach can then be extended to account for both trends and seasonality. Mathematically, we can represent the weighted average form of an ETS model as:

\[ \hat{y}_{T+1|T} = \sum_{j=0}^{T-1} \alpha(1 - \alpha)^{j} y_{T-j} + (1 - \alpha)^{T} \mathcal{l}_{0} \]

Finally, STL is the most intuitive of the three new additions. Its usefulness is best seen graphically, since it decomposes the time series into trend, seasonality, and the remainder of the signal. This leads to an additive component interpretation of time series. Here is the STL decomposition for the train set, where the upwards trend and annual seasonality is made very clear:

train %>%
  model(stl = STL(log(cases))) %>%
  components() %>%
  autoplot()

We can now directly compare the forecast distributions all of the models fit thus far:

test_fc_complex %>%
  accuracy(test, list(crps = CRPS)) %>%
  arrange(crps)
## # A tibble: 9 × 4
##   .model      id .type   crps
##   <chr>    <dbl> <chr>  <dbl>
## 1 search       1 Test   5160.
## 2 stepwise     1 Test   5226.
## 3 tslm         1 Test   6749.
## 4 stlf         1 Test   8134.
## 5 snaive       1 Test   8222.
## 6 ets          1 Test  11185.
## 7 naive        1 Test  14348.
## 8 drift        1 Test  14487.
## 9 mean         1 Test  17314.

We see that the search ARIMA model produces the best forecast distributions, and four of the five newer complex models are the top performers. Let’s visualise the search ARIMA’s forecasts independently:

test_fc_complex %>%
  filter(.model == "search") %>%
  autoplot(tsbl) +
  labs(x = "Month",
       y = "log-scaled lab confirmed influenza cases") +
  scale_y_log10() +
  guides(colour = guide_legend(title = "Forecast"))

The model appears to capture 2017’s data reasonably well in terms of both the prediction interval coverage and the mean forecast against the actual. However, there are other methods for evaluating forecast accuracy. One of these is a Winkler score. If the \(100(1- \alpha) \%\) prediction interval at time \(t\) is given by \([\mathcal{l}_{\alpha,t}, \mu_{\alpha,t}]\), then the Winkler score is defined as the length of the interval plus a penalty if the observation is outside the interval. As Rob Hyndman succinctly states in fpp3:

For observations that fall within the interval, the Winkler score is simply the length of the interval. Thus, low scores are associated with narrow intervals. However, if the observation falls outside the interval, the penalty applies, with the penalty proportional to how far the observation is outside the interval.

We can compare Winkler scores for forecasts easily in fable, but the decision of the width of the interval is important (we use \(95\%\) here for demonstration purposes):

test_fc_complex %>%
  accuracy(tsbl, measures = interval_accuracy_measures,
    level = 95) %>%
  arrange(winkler)
## # A tibble: 9 × 4
##   .model      id .type winkler
##   <chr>    <dbl> <chr>   <dbl>
## 1 tslm         1 Test   43905.
## 2 stepwise     1 Test   50738.
## 3 search       1 Test   52982.
## 4 snaive       1 Test   75465.
## 5 naive        1 Test  230301.
## 6 stlf         1 Test  231972.
## 7 drift        1 Test  389376.
## 8 mean         1 Test  466614.
## 9 ets          1 Test  590724.

Winkler scores suggest that the time-series linear model (i.e., a model consisting of a linear trend component added to a seasonal periodicity component) is the best performing method. The two ARIMA specifications are the second and third-best options, followed by the seasonal naive method. Clearly, this finding contrasts the result obtained from the CRPS method. Statistical decision making is hard!

The above accuracy comparisons were computed on the prediction intervals. But we can also just compare pointwise forecasts across a wide range of metrics:

test_fc_complex %>%
  accuracy(test) %>%
  arrange(RMSE)
## # A tibble: 9 × 11
##   .model      id .type      ME   RMSE    MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>    <dbl> <chr>   <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 stlf         1 Test    -906.  7719.  5183.  -75.6  83.0   NaN   NaN -0.206
## 2 search       1 Test    4338. 11623.  5371.  -13.5  31.0   NaN   NaN  0.220
## 3 stepwise     1 Test    4562. 12019.  5913.  -18.4  37.6   NaN   NaN  0.302
## 4 tslm         1 Test    6821. 17958.  8050.  -12.3  32.9   NaN   NaN  0.378
## 5 snaive       1 Test    7982. 18229.  9741.  -17.2  53.9   NaN   NaN  0.542
## 6 ets          1 Test  -15075. 20602. 15075. -160.  160.    NaN   NaN  0.660
## 7 drift        1 Test    8486. 31413. 18776. -164.  201.    NaN   NaN  0.516
## 8 naive        1 Test   11597. 32528. 17889.  -99.3 143.    NaN   NaN  0.532
## 9 mean         1 Test   17745. 35926. 18164.   27.9  47.4   NaN   NaN  0.560

Now we see that the decomposition model is regarded as the best. However, as the table shows, interpretation of different error metrics can lead to different decision regarding the best performing model. Regardless, the overarching finding here is that more complex models appear to perform better than simple models – even from a pointwise perspective.

3.3 Forecast combinations and method comparisons

So far, we have directly compared multiple individual forecasting methods. But what about if we combined them? Could a composite model outperform any individual forecast method? As it turns out, this 1989 review showed that combining multiple forecasts often leads to increased forecast accuracy, and that for most applications, averaging the different forecast methods can provide substantial performance improvement. Now, this is again a step up in complexity, but if the performance gains truly are that meaningful, the increase may be worth it. Further, if the underlying methods used in the composite forecast are themselves interpretable, then producing an average of them should not harm overall interpretability. Let’s try a combination of the search ARIMA and the time-series linear model and see how well it does:

fit_train_composite <- train %>%
  model(
    mean = MEAN(log(cases)),
    naive = NAIVE(log(cases)),
    snaive = SNAIVE(log(cases) ~ lag("year")),
    drift = RW(log(cases) ~ drift()),
    stepwise = ARIMA(log(cases)),
    search = ARIMA(log(cases), stepwise = FALSE, approx = FALSE),
    ets = ETS(log(cases)),
    stlf = decomposition_model(
      STL(log(cases) ~ trend(window = 21), robust = TRUE), 
      NAIVE(season_adjust)
      ), 
    tslm = TSLM(log(cases) ~ trend() +  season())
  ) %>%
    mutate(arima_tslm = (search + tslm) / 2)

Now, you may note that I haven’t just piped straight into the forecast function like we did in the previous sections. This is because fable is not currently able to combine different distribution types (e.g., a normal distribution for component model and a Student’s \(t\) distribution for another component model) yet, and so we need to compute an empirical forecast distribution by simulating many futures from the model. We do this for all the models to ensure fair comparisons:

futures <- fit_train_composite %>%
  generate(h = nrow(test), new_data = test, times = 1000) %>%
  as_tibble() %>%
  group_by(month, .model) %>%
  summarise(dist = distributional::dist_sample(list(.sim))) %>%
  ungroup() %>%
  as_fable(index = month, key = .model,
           distribution = dist, response = "cases")

We can then compare the empirical forecast distributions using the same methods as before:

futures %>%
  accuracy(tsbl, measures = interval_accuracy_measures,
    level = 95) %>%
  arrange(winkler)
## # A tibble: 10 × 3
##    .model     .type winkler
##    <chr>      <chr>   <dbl>
##  1 arima_tslm Test   35734.
##  2 tslm       Test   37146.
##  3 search     Test   50872.
##  4 stepwise   Test   53267.
##  5 snaive     Test   63864.
##  6 naive      Test  205846.
##  7 stlf       Test  265227.
##  8 drift      Test  446284.
##  9 mean       Test  463255.
## 10 ets        Test  931365.

We see that the combination forecast of the search ARIMA and the time-series linear model performs the best, followed by the component models of the TSLM and both ARIMA specifications. The highest-ranked ‘simple’ forecasting method – seasonal naive – is ranked fifth best. This provides us with further evidence that the increase in complexity is warranted.

Finally, with a bit more code, we can produce a visualisation of the distribution of futures for the top composite model:

fit_train_composite %>%
  generate(h = nrow(test), new_data = test, times = 1000) %>%
  as_tibble() %>%
  group_by(month, .model) %>%
  summarise(dist = distributional::dist_sample(list(.sim))) %>%
  ungroup() %>%
  filter(.model == "arima_tslm") %>%
  mutate(id = 1) %>%
  dplyr::select(-c(.model)) %>%
  as_fable(index = month, key = id,
           distribution = dist, response = "cases") %>%
  autoplot(tsbl) +
  labs(x = "Month",
       y = "log-scaled lab confirmed influenza cases") +
  scale_y_log10() +
  guides(colour = guide_legend(title = "Forecast"))

Looks decent!

4 Final thoughts

Thank you for making it to the end! If you have any questions, please feel free to reach out to me via email. I intend to write a follow-up piece to this post exploring the magnitude of the impact of COVID-19 on influenza rates, so stay tuned for that!


  1. Doing this within fable::model is great because it will automatically back-transform forecasts and prediction intervals to the scale of interest automatically for us.↩︎

  2. 2018 saw a strange decrease in average cases, so we will park this complexity for now.↩︎

  3. These become far more meaningful once we directly compare the simple and complex methods. This is coming up shortly!↩︎

  4. Note that this substantially increases computation time.↩︎

</div> </div>