Interpretable time-series modelling using Gaussian processes (Stan’s version)

15 minute read

Published: June 01, 2024

Ahoy! Welcome back to my nascent blog. It turned out that quite a few people found my last post regarding interpretable time-series modelling using Gaussian processes useful and/or informative, so I felt sufficiently motivated to write a follow up piece¹. In the first article, I delved into the underlying mechanics of Gaussian processes (GP) and built them from the ground-up using linear algebra. The result was a nice, principled way of decomposing a time series into its structural components (e.g., trend, seasonality, noise) and then specifying an additive covariance function (kernel) that captures each of the relevant components. Put another way, we modelled a time series in an interpretable way². To make it easier for readers to explore the workflow for themselves, I have since improved and turned the code into a small R package called tsgp.

The overall approach was neat, and really demonstrated the value of transparency in the modelling process. But despite its elegance, there was a major drawback of the approach: it was quite restrictive. What do I mean by that? Well, the core code in tsgp (the final product of the previous article’s code) is something like:

GP(x1, xprime = x1, y, CovarianceFunction, noise = 0.8,
   sigma_1 = 5, sigma_2 = 1, l_1 = 75, l_2 = 1, p = 25)

where x1 is our input vector of \(x\) values, xprime is our target vector of \(x'\) values to predict \(y\) values for, y are our observed values to learn from, CovarianceFunction was our custom covariance function which was comprised of the addition of an exponentiated quadratic (squared exponential) and periodic kernel, noise was the scaling factor of uncertainty to include, and the remaining arguments were hyperparameters to the two covariance functions contained within CovarianceFunction.

Why is this restrictive? Well, we have clearly set fixed values for the covariance function hyperparameters. While we could certainly implement a solution such as marginal maximum likelihood to optimise them, often in practice we have great uncertainty about what the parameters might actually be. An exception to this might be the period argument of the periodic kernel, as for monthly data we would set \(p = 12\), or for daily data \(p = 7\). For real-world datasets, we might just have a hunch (or knowledge from previous research), of the likely range of plausible values for the hyperparameters. For example, we might expect the variance (\(\sigma^{2}\)) of the exponentiated quadratic kernel for the temporal trend to be somewhere around \(2-3\), but not know it precisely. We could express this uncertainty as a probability distribution, such as a normal distribution like \(\mathcal{N}(2,1)\).

Further, the previous approach assumed a time series of continuous values. But what if our data was discrete (e.g., counts of homeless people, binary values for whether the manufacturing line had a fault or not)? Well, clearly we need to extend our capabilities.

Therefore, this article endeavours to answer the question: how can we extend the interpretable modelling workflow presented in the first article, to incorporate uncertainty and account for additional complexity?

Spoiler alert, the answer is we use Stan!

Stan

Stan is a probabilistic programming language written in C++ with interfaces to most major scientific computing languages, such as R, Python, and Julia. It is an utterly remarkable achievement, and even has its own automatic differentiation library (stanmath). Stan is an excellent resource, and has numerous incredible packages built on top of it to make Bayesian inference more accessible, such as brms.

Stan’s appeal (at least to me!) is the honestly astounding amount of flexibility it affords the modeller. Stan can be used to do anything from Bayesian linear regression, to mixed-effects regression modelling, to ordinary differential equations. It is quite possibly the closest thing to simply writing down the mathematical structure of a model. Naturally, with a tool this flexible, the barrier to entry or perceived complexity can be high. Both of those things are true. But there are many, many excellent resources to help newcomers to the language, such as the comprehensive Stan User’s Guide.

At its core, Stan uses Markov chain Monte Carlo to sample the posterior distribution, rather than solving the model analytically (as we did in my last post)³. This facilitates the high degree of flexibility and permits excellent model diagnostics, but it also introduces a rather large computational cost – especially for complex models such as GPs, or for models trained on large datasets. However, for most cases, modelling something appropriately with proper quantification of uncertainty is often far more important than getting an ‘answer’ quickly.

Alright, enough chatter. Let’s code! Here are the packages we will need:

library(dplyr)
library(ggplot2)
library(tidyr)
library(rstan)
library(bayesplot)

Replicating the model from last article

Recall that last time we simulated a time series with temporal dynamics commonly encountered in applied settings. Specifically, we generated a noisy sine wave (i.e., to emulate periodic or seasonal data) with a linearly increasing trend:

set.seed(123)
y <- 3 * sin(2 * seq(0, 4 * pi, length.out = 100)) + runif(100) * 2
trend <- 0.08 * seq(from = 1, to = 100, by = 1)
y <- y + trend
tmp <- data.frame(timepoint = 1:length(y), y = y)

# Draw plot

ggplot(data = tmp) +
  geom_point(aes(x = timepoint, y = y), colour = "black") +
  geom_line(aes(x = timepoint, y = y), colour = "black") +
  labs(x = "Timepoint",
       y = "Values") +
  theme_bw()

Let’s assume we want to predict \(y\) values for a set of points over this time domain, such as \(50\) equidistant points between \(1-100\), just like we did at the end of the previous article. Similarly, we still want to use an additive kernel that combines a covariance function for the trend (which we know is linear but we will use an exponentiated quadratic just because it’s more common to have non-linearity in the real-world) and a periodic kernel for the seasonality (which we showed last time to have a period of \(p = 25\)). However, since we are now using Stan, we can generate posterior samples for all the hyperparameters as well as the values we want to predict, and interrogate the model in considerable depth.

Without further adieu, here is the GP in Stan:

stan_code <- "
data {
  int<lower=1> N1;
  array[N1] real x1;
  vector[N1] y1;
  int<lower=1> N2;
  array[N2] real x2;
  real p_periodic;
}

transformed data {
  real delta = 1e-9;
  int<lower=1> N = N1 + N2;
  array[N] real x;
  
  for (n1 in 1:N1) {
    x[n1] = x1[n1];
  }
  for (n2 in 1:N2) {
    x[N1 + n2] = x2[n2];
  }
}

parameters {
  real<lower=0> sigma;
  real<lower=0> sigma_trend;
  real<lower=0> l_trend;
  real<lower=0> sigma_periodic;
  real<lower=0> l_periodic;
  vector[N] eta;
}

transformed parameters {
  vector[N] f;
  {
    matrix[N, N] L_K;
    matrix[N, N] K = gp_exp_quad_cov(x, sigma_trend, l_trend) + gp_periodic_cov(x, sigma_periodic, l_periodic, p_periodic);

    // Diagonal elements
    
    for (n in 1:N) {
      K[n, n] = K[n, n] + delta;
    }

    L_K = cholesky_decompose(K);
    f = L_K * eta;
  }
}

model {

  // Priors

  sigma ~ std_normal();
  eta ~ std_normal();
  sigma_trend ~ std_normal();
  l_trend ~ std_normal();
  sigma_periodic ~ std_normal();
  l_periodic ~ std_normal();
  
  // Likelihood

  y1 ~ normal(f[1:N1], sigma);
}

generated quantities {
  vector[N2] y2;
  
  for (n2 in 1:N2) {
    y2[n2] = normal_rng(f[N1 + n2], sigma);
  }
}
"

That’s a lot! Let’s break it down block-by-block.

Structure of the Stan model

As mentioned previously, Stan is written in C++. This means we have types such as real, int, and vector, and have to end each line of code with ;. I won’t go into too much detail on C++ syntax here since there are plenty of better resources out there.

data block

This is where we tell Stan what inputs we want to pass in ourselves. In our case, we are specifying inputs for the training data x1 (and its size N1), the data to predict values for x2 (and its size N2), the observations y1, and the period for the seasonal kernel p_periodic.

parameters block

This is where we declare the parameters we want Stan to sample posterior values for. We are declaring all of our covariance function hyperparameters (except the period for the seasonal kernel since we know it), as well as an error term sigma and eta to help with the Cholesky decomposition. Note that we are fixing lower bounds for the parameters to be \(0\). This both helps ensure terms such as the variance of the Gaussian posterior (sigma) are valid (i.e., non-negative) and also makes the model sample faster.

transformed parameters block

This is where we do any calculations that come before the prior and likelihood sampling. In our case, we perform the covariance function summation here and add the tiny \(\epsilon\) to the diagonal for numerical stability. We also make use of the Cholesky decomposition for computational efficiency. Note that most of the commonly-encountered covariance functions are already included as functions in Stan. Here we make use of gp_exp_quad_cov for the exponentiated quadratic kernel (i.e., trend) and gp_periodic_cov for the periodic kernel (i.e., seasonality).

model block

This is where we define our priors and the likelihood. We are just using standard normal priors for all parameters for this example.

generated quantities block

This is where the magic happens! Almost anything can be calculated here (e.g., test-set predictions, pointwise log-likelihood, etc.), and it is not uncommon for real-world models to make complete use of its flexibility. For our purposes, we are just going to draw samples from the posterior for each time point we wish to predict values for. These are drawn from a normal distribution random number generator (‘rng’) parameterised by the GP and the variance/error sigma.

Fitting the Stan model in R

First, we will define our \(x\) and \(x'\) vectors:

x1 <- 1:100
xprime <- seq(from = 1, to = 100, by = 2)

Now we can fit the model. In order to do this, we need to pass in a few things to rstan::stan: the Stan model code, the input data to satisfy the requirements of the data block, and whatever other MCMC settings we want. I’m going to turn on parallel processing to make the chains evaluate simultaneously, and increase the MCMC algorithm iterations per chain since we want our effective sample size in the distribution’s tails to be large enough for appropriate inference.

Fair warning, this will take a while!

options(mc.cores = parallel::detectCores())

model <- stan(model_code = stan_code, 
              data = list(N1 = length(x1), x1 = x1, y1 = y, N2 = length(xprime), x2 = xprime,
                          p_periodic = 25), 
              iter = 5000, chains = 3, seed = 123)

Beautiful, we now have everything we need! Let’s convert our model into a data frame and pull out the posterior samples for the predicted \(y\) values for each of our \(x'\) time points (the other columns are the posterior draws for all our parameters!):

draws <- as.data.frame(model)
preds <- draws[, grepl("y2", names(draws))]
colnames(preds) <- paste0("y_", xprime)
preds[1:5, 1:5]

##         y_1      y_3      y_5      y_7      y_9
## 1 1.4010139 2.231305 2.663115 4.757358 4.701955
## 2 0.5189433 3.760109 3.763193 4.743285 3.802899
## 3 1.2850740 2.429294 3.985677 4.408179 4.405878
## 4 1.7550481 1.413475 3.353406 3.866813 4.384098
## 5 0.9879761 3.002428 3.774711 4.327413 3.496659

From here, we can do several things. The first might be to replicate the nice graph from last time of the actual data and the posterior predictions on top. Here is the posterior mean and its associated \(95\%\) credible interval:

prob <- 0.95
u_bound <- ((prob + 1) / 2)
l_bound <- u_bound - prob
  
preds <- preds %>%
  mutate(draw = dplyr::row_number()) %>%
  pivot_longer(cols = 1:ncol(preds), names_to = "timepoint", values_to = "values") %>%
  mutate(timepoint = as.integer(gsub(".*_", "\\1", timepoint))) %>%
  reframe(.mean = mean(values),
          .lower = quantile(values, probs = l_bound),
          .upper = quantile(values, probs = u_bound),
          .by = "timepoint")

ggplot() +
  geom_ribbon(data = preds, aes(x = timepoint, ymin = .lower, ymax = .upper, group = 1), 
              fill = "steelblue2", alpha = 0.3) +
  geom_line(data = tmp, aes(x = timepoint, y = y), colour = "black") +
  geom_point(data = tmp, aes(x = timepoint, y = y), colour = "black") +
  geom_line(data = preds, aes(x = timepoint, y = .mean), colour = "steelblue2", size = 1) +
  labs(title = paste0("Posterior mean and ", round(prob * 100, digits = 0), "% credible interval"),
       subtitle = paste0("Interval calculated over ", nrow(draws), " samples"),
       x = "Timepoint",
       y = "Values") +
  theme_bw()

It appears that our model does quite well in predicting plausible values for each of the \(50\) time points. So, where do we go now? Well, while this post won’t go into model diagnostics in detail, we can at least explore a few important visualisations.

Here are trace plots for the parameters which display chain convergence (we want these to look like white noise with no visible patterns). We'll just generate the first ten as an example:

traceplot(model)

We can also visualise the posterior distributions for the parameters. All appear consistent with our expectations based on the standard normal distributions we specified in the model. For those that read the first post, you may recall that we fixed the hyperparameters. Here we have estimated entire probability distributions for them based off our beliefs and the actual observations. Very cool.

as.data.frame(model) %>%
  dplyr::select(c(sigma, sigma_trend, l_trend, sigma_periodic, l_periodic)) %>%
  mutate(iteration = row_number()) %>%
  pivot_longer(cols = sigma:l_periodic, names_to = "parameter", values_to = "values") %>%
  ggplot(aes(x = values)) +
  geom_histogram(aes(y = after_stat(density)), fill = "steelblue2", alpha = 0.9) +
  labs(x = "Parameter value",
       y = "Probability density") +
  theme_bw() +
  facet_wrap(~parameter, scales = "free")

With all of that, we have successfully implemented a GP in Stan that can generate plausible predictions, just like the first article. While the computation time here was orders of magnitude longer, we also exercised far more control over the incorporation and propagation of uncertainty, and have considerably more information at our disposal through which to evaluate the model.

Final thoughts

Thank you for making it to the end! This concludes the short series on simple GPs. In the future, I might write on common extensions, such as latent GPs for modelling non-normal variables, such as counts or binary categories.

Again, for more information on GPs, please check out these much better resources:

Gaussian processes for machine learning — link
Gaussian Processes for Timeseries Modelling — link
Robust Gaussian Process Modelling — link
Kernel Cookbook — link
Statistical Rethinking 2022 lecture series — link

Or, feel free to reach out to me via email.

Who am I kidding, I would have written it anyway!↩︎
There will be no black boxes in this blog!↩︎
Specifically, Stan implements the Hamiltonian Monte Carlo (HMC) algorithm for full Bayesian inference and uses its adaptive variant known as the no-U-turn sampler (NUTS).↩︎

Share on

Twitter Facebook LinkedIn

Ordinance: Make ordinal regression intuitive with marginal effects

26 minute read

Published: June 06, 2024

Introduction

Howdy! Today we are going to be talking all things ordinal variables. Ordinal data is very common in applied settings. It may come in surveys where items are measured on a Likert scale (i.e., 1 = Strongly disagree, 2 = Disagree, 3 = Neutral, 4 = Agree, 5 = Strongly agree). It may come in workforce data, where someone’s position in a company is captured. It may come in sports data, where we have finish positions in a race. Irrespective of the source, it would be disingenuous to model such data using a method that ignores its ordered categorical nature (such as fitting a linear regression model). Instead, we should use a tool developed specifically for the job.

Enter ordinal regression.

Regression for ordinal data

Ordinal regression is an extension of linear models much like a generalised linear model (GLM) is. The architecture is basically the same: Linear predictors are constructed from the covariates which are then used to model probability of membership to each level of the ordinal outcome variable. There are several ways to go about doing this: (i) we can model the levels cumulatively; (ii) adjacently; or (iii) sequentially. Regardless, we often see the trusty old logit used as the link function since it converts values into the \([0,1]\) domain of probabilities.

For our purposes, we are going to adopt the cumulative approach since it is the most common. A cumulative model assumes our outcome variable is some unknown latent variable where each level is represented by breakpoints¹. Put another way, a cumulative ordinal model uses cumulative probabilities up to a threshold, thereby making the whole range of ordinal categories binary at each threshold. In practice, such an approach is typically implemented using proportional odds logistic regression. But more on that later.

Mathematically, we can express the cumulative logit model as²:

\[ \log \left( \frac{P(Y \leq j)}{P(Y > j)} \right) = \log \left(\frac{P(Y \leq j)}{1 -P(Y \leq j)} \right) = \log \left( \frac{\pi_{1} + \dots + \pi_{j}}{\pi_{j+1} + \dots + \pi_{J}} \right) \]

where our response variable \(Y\) is represented by its levels \(Y = 1, 2, \dots J\) and the associated probabilities are \(\pi_{1} + \dots + \pi_{J}\). The above equation which describes the log-odds of the event that \(Y \leq j\) and measures how likely the response is to be in category \(j\) or below versus in a category higher than \(j\).

For those familiar with GLMs, you already know how difficult to interpret log-odds are. The difficulty usually remains even after transforming to something like odds. And all of that just relates to just one (!!!) of the who-knows-how-many predictor variables you have! So what can you do? Clearly, you need to adopt a more appropriate analytical strategy.

The case for marginal effects

As stated above, interpreting regression models is hard! Especially when we have likelihoods that are non-normal that require link functions and/or the models have numerous covariates. In case it wasn’t already obvious: A coefficient in complex regression models does not tell you what you think it does.

Why is that? The coefficient is just a parameter. In the simple case of linear regression with one predictor, the coefficient tells us what we need to know. But for all other purposes, its meaningfulness is greatly diminished because it does not account for the values of the other covariates in the model. In order to interpret the effect of a given predictor on the outcome variable, we need to isolate it.

I’m going to borrow an excellent analogy from this excellent post by Andrew Heiss. Think of your regression model as an audio mixing board. Every covariate/predictor is represented on the board. Continuous predictors (such as income) are represented by sliders and categorical predictors (such as sex) are represented as switches. If we want to determine the effect of one of these switches or sliders on our outcome, we need to fix the others (i.e., hold them constant), move only the one of interest, and observe how the outcome value changes. Mathematically, this is called a partial derivative, and is represented in notation as \(\frac{\partial y}{\partial x}\). This leads us to the formal definition of a marginal effect: A marginal effect is a partial derivative from a regression equation.

Note that for categorical predictors (i.e., switches), this effect is actually called a ‘conditional effect’ or a ‘group contrast’ since a non-continuous variable does not have a slope/rate of change (i.e., a derivative). Instead, we compare the difference in means for each level of the switch (i.e., each ‘condition’).

Now, all of that should already (hopefully) sound appealing, but it gets even better! Marginal effects and the associated predictions are usually³ calculated and reported on the outcome scale of interest. This is a big deal so let me explain further. We do not need to make inferences in terms of log-odds or some other confusing measurement scale. In the case of ordinal regression⁴, we can instead talk about something far more intuitive and elegant: probabilities.

That’s right. The marginal effects of an ordinal regression model are expressed in terms of the probability of belonging to each level of the outcome variable. Beautiful stuff.

The dataset

Enough math and explanations! What data are we looking at? For this tutorial, we are going to explore the 2022 Australian Election Study (the ‘AES’). The AES examines important issues such as changing support for political parties, the role of the party leaders, and the importance of political issues like the economy and the environment. We are going to examine one item in particular:

Generally speaking, how much interest do you usually have in what’s going on in politics?

This item is measured on a four-point scale:

A good deal
Some
Not much
None

Clearly, we have an ordinal context. Since we want to explore differences in political interest⁵, we need some variables to explore differences over. We are going to use the following as predictors:

Age
Gender
Sexual orientation
Years of tertiary study
Employment status
Managerial status at work

Let’s load the R packages we need to get started:

library(dplyr)
library(tidyr)
library(ggplot2)
library(MASS)
library(marginaleffects)

Data preprocessing

Getting the dataset in a nice and clean format for what we need takes a little bit of work so bear with me here. The below code just reads in the file (which I have conveniently stored on my GitHub), filters out erroneous values, and recodes factor variables from their integer codes to qualitative labels to make our lives a little easier when visualising results. Since data cleaning isn’t the focus of this tutorial, feel free to just run this code and not dig in too deep.

aes22 <- read.csv("https://raw.githubusercontent.com/hendersontrent/hendersontrent.github.io/master/files/aes22_unrestricted_v2.csv") %>%
  dplyr::select(c(ID, A1, G2, G4, G5_D, H1, H1b, H2)) %>%
  rename(years_tertiary_study = G2,
         work = G4,
         managerial = G5_D,
         gender = H1,
         sexual_orientation = H1b,
         dob = H2) %>%
  filter(A1 %in% c(1, 2, 3, 4)) %>%
  filter(years_tertiary_study != 999) %>%
  filter(work %in% c(1, 2, 3, 4, 5, 6, 7)) %>%
  filter(managerial %in% c(1, 2, 3, 4, 5)) %>%
  filter(gender %in% c(1, 2, 3)) %>%
  filter(sexual_orientation %in% c(1, 2, 3, 4, 5)) %>%
  filter(dob != 999) %>%
  filter(dob != -97) %>%
  mutate(age = as.integer(2022 - dob)) %>% # Age at 2022 survey
  dplyr::select(-c(dob)) %>%
  mutate(gender = case_when(
          gender == "1" ~ "Male",
          gender == "2" ~ "Female",
          gender == "3" ~ "Other")) %>%
  mutate(sexual_orientation = case_when(
    sexual_orientation == "1" ~ "Heterosexual or Straight",
    sexual_orientation == "2" ~ "Gay or Lesbian",
    sexual_orientation == "3" ~ "Bisexual",
    sexual_orientation == "4" ~ "Other",
    sexual_orientation == "5" ~ "Prefer not to say")) %>%
  mutate(work = case_when(
    work == "1" ~ "Working full-time for pay",
    work == "2" ~ "Working part-time for pay",
    work == "3" ~ "Unemployed - looking for full-time work",
    work == "4" ~ "Unemployed - looking for part-time work",
    work == "5" ~ "Retired from paid work",
    work == "6" ~ "A full-time school or university student",
    work == "7" ~ "Family or caring responsibilities")) %>%
  mutate(managerial = case_when(
    managerial == "1" ~ "Upper managerial",
    managerial == "2" ~ "Middle managerial",
    managerial == "3" ~ "Lower managerial",
    managerial == "4" ~ "Supervisory",
    managerial == "5" ~ "Non-supervisory")) %>%
  mutate(A1 = case_when(
          A1 == "4" ~ "None",
          A1 == "3" ~ "Not much",
          A1 == "2" ~ "Some",
          A1 == "1" ~ "A good deal")) %>%
  mutate(A1 = factor(A1, levels = c("None", "Not much", "Some", "A good deal"), ordered = TRUE),
         managerial = factor(managerial, 
                             levels = c("Non-supervisory", "Supervisory", 
                                        "Lower managerial", "Middle managerial", 
                                        "Upper managerial"), 
                             ordered = TRUE),
         gender = as.factor(gender),
         sexual_orientation = as.factor(sexual_orientation),
         work = as.factor(work),
         years_tertiary_study = as.integer(years_tertiary_study))

Statistical modelling

Typically, we would do extensive exploratory data analysis prior to conducting any statistical modelling. But this is a modelling and marginal effects tutorial and not a research paper, so we will skip it here⁶. Please do not skip this stage in your own work!

The easiest function for fitting a proportional odds logistic regression (i.e., ordinal regression) in R is the polr function in the MASS package⁷. The interface is the same as other R regression functions, such as lm. Note that by default, polr uses the logit link function, so we do not need to explicitly specify it. Here is the model:

mod <- polr(A1 ~ years_tertiary_study + work + managerial +
              gender + sexual_orientation + age,
            data = aes22)

We are also going to skip over model diagnostics and checks and all of that here so we can get to the crux of the tutorial content. We can print a quick summary of mod to obtain coefficient values, intercepts for each cumulative threshold, and other information:

summary(mod)

## Call:
## polr(formula = A1 ~ years_tertiary_study + work + managerial + 
##     gender + sexual_orientation + age, data = aes22)
## 
## Coefficients:
##                                                Value Std. Error t value
## years_tertiary_study                         0.09898   0.016491  6.0022
## workFamily or caring responsibilities       -0.88470   0.422507 -2.0939
## workRetired from paid work                  -0.44590   0.394436 -1.1305
## workUnemployed - looking for full-time work -0.40188   0.573699 -0.7005
## workUnemployed - looking for part-time work -0.69857   0.566880 -1.2323
## workWorking full-time for pay               -0.64700   0.362709 -1.7838
## workWorking part-time for pay               -0.78457   0.371394 -2.1125
## managerial.L                                 0.51695   0.100317  5.1531
## managerial.Q                                 0.06986   0.105543  0.6619
## managerial.C                                -0.02053   0.105220 -0.1951
## managerial^4                                -0.34185   0.115103 -2.9699
## genderMale                                   0.40115   0.095092  4.2185
## genderOther                                  2.65070   1.329392  1.9939
## sexual_orientationGay or Lesbian            -0.51216   0.423282 -1.2100
## sexual_orientationHeterosexual or Straight  -1.13469   0.313564 -3.6187
## sexual_orientationOther                     -1.75812   0.554949 -3.1681
## age                                          0.03101   0.003983  7.7852
## 
## Intercepts:
##                  Value   Std. Error t value
## None|Not much    -3.8626  0.4946    -7.8092
## Not much|Some    -1.3382  0.4639    -2.8846
## Some|A good deal  0.8813  0.4635     1.9014
## 
## Residual Deviance: 3769.865 
## AIC: 3809.865

However, as noted above, these are quite prohibitive to interpret. So, we are going to turn to our friend marginal effects for emotional and computational support.

Marginal effects

Marginal effects can be calculated in R a few different ways and often with very different methods/quantities behind the scenes. We are going to use the excellent marginaleffects package for our purposes. Seriously, please visit that link and look through the incredibly detailed documentation. It’s basically my web browser home page at this point.

The marginaleffects package affords an honestly astonishing amount of functionality. As a starting point, we are going to focus on three types of quantities:

Predictions
Slopes
Comparisons

Predictions

In the context of marginaleffects, an adjusted prediction is the outcome predicted by a fitted model on a specified scale for a given combination of values of the predictor variables, such as their observed values, their means, or factor levels. That’s a lot of words to say that we can calculate predicted values for any given subset of the data used in the model. In the most extreme case, we can calculate predictions for a given predictor which are averaged over the values of all other predictors. This procedure is extremely useful for answering questions in practice, since we often want to account for the values of other variables while exploring the effects of the focal variable, rather than just holding the others constant.

In marginaleffects, this is easy through the predictions function (and the associated plot_predictions visualisation function). Let’s examine age. We’ll be sure to specify that we want our output (i.e., \(y\)-axis) to be probabilities:

plot_predictions(mod, condition = "age", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1)

Beautiful! We see that as age increases, the predicted probability of rating political interest as ‘A good deal’ increases almost linearly. Compare that to ‘Not much’, where as age increases, the predicted probability of scarce political interest decreases. These results are far, far more interpretable and immediately practically meaningful than trying to dredge understanding out of log-odds coefficients.

Let’s look at a few more examples. Here is years of tertiary study:

plot_predictions(mod, condition = "years_tertiary_study", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1)

We see similar results to age which is somewhat unsurprising.

How about categorical predictors? Here is gender:

plot_predictions(mod, condition = "gender", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1)

As expected, we do not get nice smooth curves like we did for numerical variables but instead get point estimates for the factor level along with the associated confidence interval. From this plot, it appears that respondents with a gender outside the binary system may be more likely to express ‘A good deal’ of political interest than males or females. Again, how intuitive is this to interpret!

Let’s do one more. Here is employment managerial status:

plot_predictions(mod, condition = "managerial", type = "probs") + 
  facet_wrap(~factor(group, levels = c("None", "Not much", "Some", "A good deal")), 
             nrow = 1) +
  theme(axis.text.x = element_text(angle = 90))

We see that respondents in middle and upper managerial positions may be more likely to express ‘A good deal’ of political interest than others. Interestingly, probabilities of rating ‘Some’, ‘Not much’, and ‘None’ are quite similar between managerial positions.

A more specific example

The above plots are both interesting and useful, but we need to go deeper. In other words, this is where the fun begins ⁸.

Consider the following hypothetical AES respondent: A man who studied for 12 years at university, currently works full-time for pay, is a middle manager at work, is bisexual, and is 30 years old. We can define this individual in a data frame:

respondent <- data.frame(years_tertiary_study = 12, work = "Working full-time for pay",
                         managerial = "Middle managerial", sexual_orientation = "Bisexual",
                         age = 30, gender = "Male")

We can easily calculate the expected predicted probability for each outcome level (political interest) for this person:

predictions(mod, newdata = respondent)

## 
##        Group Estimate Std. Error     z Pr(>|z|)     S    2.5 %  97.5 %
##  None         0.00234   0.000933  2.51  0.01222   6.4 0.000509 0.00417
##  Not much     0.02609   0.009061  2.88  0.00399   8.0 0.008328 0.04385
##  Some         0.18370   0.048728  3.77  < 0.001  12.6 0.088195 0.27920
##  A good deal  0.78787   0.058370 13.50  < 0.001 135.5 0.673472 0.90228
##  years_tertiary_study                      work        managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

We see that the probability this person rates their political interest as ‘A good deal’ is \(79\%\), ‘Some’ is \(18\%\), ‘Not much’ is \(3\%\), and ‘None’ is almost \(0\%\). How cool is that!

Slopes

Slopes. Partial derivatives. Marginal effects. The rate of change at which the probability of our outcome changes with respect to a given variable.

Calculation of these quantities is made easy in marginaleffects through the slopes() function. Continuing with our hypothetical respondent, we can calculate the slopes for a given variable. Here is age:

slopes(mod, variables = "age", newdata = respondent)

## 
##        Group Term   Estimate Std. Error     z Pr(>|z|)    S     2.5 %    97.5 %
##  None         age -0.0000723  0.0000302 -2.39  0.01680  5.9 -0.000132 -0.000013
##  Not much     age -0.0007839  0.0002842 -2.76  0.00581  7.4 -0.001341 -0.000227
##  Some         age -0.0043257  0.0009707 -4.46  < 0.001 16.9 -0.006228 -0.002423
##  A good deal  age  0.0051820  0.0012653  4.10  < 0.001 14.5  0.002702  0.007662
##  years_tertiary_study                      work        managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

And here is years of tertiary study:

slopes(mod, variables = "years_tertiary_study", newdata = respondent)

## 
##        Group                 Term  Estimate Std. Error     z Pr(>|z|)    S
##  None        years_tertiary_study -0.000231  0.0000873 -2.64  0.00820  6.9
##  Not much    years_tertiary_study -0.002502  0.0007962 -3.14  0.00167  9.2
##  Some        years_tertiary_study -0.013808  0.0027294 -5.06  < 0.001 21.2
##  A good deal years_tertiary_study  0.016541  0.0035167  4.70  < 0.001 18.6
##      2.5 %     97.5 % years_tertiary_study                      work
##  -0.000402 -0.0000597                   12 Working full-time for pay
##  -0.004063 -0.0009418                   12 Working full-time for pay
##  -0.019158 -0.0084586                   12 Working full-time for pay
##   0.009649  0.0234341                   12 Working full-time for pay
##         managerial sexual_orientation age gender
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
## 
## Columns: rowid, group, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

Alternatively, we can calculate average marginal effects using avg_slopes(), which are similar to the slopes() call but the same computation is repeated for every row of the original dataset, and then the average slope for each level of the outcome variable is reported:

avg_slopes(mod)

## 
##        Group                 Term
##  None        years_tertiary_study
##  None        work                
##  None        work                
##  None        work                
##  None        work                
##                                                                            Contrast
##  dY/dX                                                                             
##  Family or caring responsibilities - A full-time school or university student      
##  Retired from paid work - A full-time school or university student                 
##  Unemployed - looking for full-time work - A full-time school or university student
##  Unemployed - looking for part-time work - A full-time school or university student
##  Estimate Std. Error      z Pr(>|z|)    S     2.5 %    97.5 %
##  -0.00151   0.000365 -4.126   <0.001 14.7 -0.002222 -0.000791
##   0.01152   0.005575  2.066   0.0388  4.7  0.000591  0.022446
##   0.00461   0.003619  1.273   0.2030  2.3 -0.002486  0.011700
##   0.00406   0.006257  0.649   0.5165  1.0 -0.008205  0.016323
##   0.00823   0.007811  1.054   0.2918  1.8 -0.007075  0.023545
## --- 58 rows omitted. See ?print.marginaleffects --- 
##  A good deal gender              
##  A good deal sexual_orientation  
##  A good deal sexual_orientation  
##  A good deal sexual_orientation  
##  A good deal age                 
##                                                                            Contrast
##  Other - Female                                                                    
##  Gay or Lesbian - Bisexual                                                         
##  Heterosexual or Straight - Bisexual                                               
##  Other - Bisexual                                                                  
##  dY/dX                                                                             
##  Estimate Std. Error      z Pr(>|z|)    S     2.5 %    97.5 %
##   0.47911   0.134333  3.567   <0.001 11.4  0.215821  0.742398
##  -0.10433   0.085494 -1.220   0.2224  2.2 -0.271892  0.063240
##  -0.23807   0.060119 -3.960   <0.001 13.7 -0.355904 -0.120243
##  -0.36413   0.104987 -3.468   <0.001 10.9 -0.569901 -0.158358
##   0.00661   0.000817  8.094   <0.001 50.6  0.005012  0.008215
## Columns: group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high

Together, these functions enable us to infer a great many things about how the rate of change of the probability of outcome category membership varies as a function of a given predictor variable. Useful stuff!

Comparisons

So far, we have computed adjusted predictions and slopes. What else can we do? A whole lot, but for now we’re just going to explore comparisons. We are going to continue using our hypothetical AES respondent from above.

We can ask lots of interesting questions about this person from a comparative lens, such as ‘what happens to the predicted probability of their political interest if we increased their age by 1?’ Or ‘what happens to the predicted probability of their political interest if they were instead female?’ The comparisons function provides answers to these questions.

Let’s calculate the comparisons first:

comps <- comparisons(mod, newdata = respondent)
head(comps)

## 
##  Group                 Term
##   None years_tertiary_study
##   None work                
##   None work                
##   None work                
##   None work                
##   None work                
##                                                                            Contrast
##  +1                                                                                
##  Family or caring responsibilities - A full-time school or university student      
##  Retired from paid work - A full-time school or university student                 
##  Unemployed - looking for full-time work - A full-time school or university student
##  Unemployed - looking for part-time work - A full-time school or university student
##  Working full-time for pay - A full-time school or university student              
##   Estimate Std. Error      z Pr(>|z|)   S     2.5 %     97.5 %
##  -0.000231  0.0000874 -2.644   0.0082 6.9 -0.000402 -0.0000597
##   0.001738  0.0010737  1.619   0.1055 3.2 -0.000366  0.0038425
##   0.000687  0.0006028  1.140   0.2542 2.0 -0.000494  0.0018689
##   0.000605  0.0009575  0.632   0.5274 0.9 -0.001271  0.0024818
##   0.001236  0.0012819  0.964   0.3350 1.6 -0.001277  0.0037485
##   0.001113  0.0006367  1.747   0.0806 3.6 -0.000135  0.0023604
##  years_tertiary_study                      work        managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##                    12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

That’s a lot of information! Let’s narrow it down to make it a bit easier to interrogate. Here is age:

comps %>%
  filter(term == "age")

## 
##        Group Term Contrast   Estimate Std. Error     z Pr(>|z|)    S     2.5 %
##  None         age       +1 -0.0000723  0.0000303 -2.39  0.01680  5.9 -0.000132
##  Not much     age       +1 -0.0007841  0.0002842 -2.76  0.00581  7.4 -0.001341
##  Some         age       +1 -0.0043260  0.0009706 -4.46  < 0.001 16.9 -0.006228
##  A good deal  age       +1  0.0051824  0.0012654  4.10  < 0.001 14.5  0.002702
##     97.5 % years_tertiary_study                      work        managerial
##  -0.000013                   12 Working full-time for pay Middle managerial
##  -0.000227                   12 Working full-time for pay Middle managerial
##  -0.002424                   12 Working full-time for pay Middle managerial
##   0.007662                   12 Working full-time for pay Middle managerial
##  sexual_orientation age gender
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
##            Bisexual  30   Male
## 
## Columns: rowid, group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

We see that an increase of 1 year of age (as evidenced by Contrast being +1) is associated with decreases in the probability of rating political interest as ‘None’, ‘Not much’, and ‘Some’, but an increase in the probability of rating ‘A good deal’ by about \(0.5\) percentage points. Neat!

What about gender? Let’s focus on the effect associated with a change from Male to Female:

comps %>%
  filter(term == "gender") %>%
  filter(grepl("Male", contrast))

## 
##        Group   Term      Contrast Estimate Std. Error     z Pr(>|z|)    S
##  None        gender Male - Female -0.00115   0.000511 -2.25   0.0245  5.4
##  Not much    gender Male - Female -0.01229   0.004782 -2.57   0.0102  6.6
##  Some        gender Male - Female -0.06123   0.016425 -3.73   <0.001 12.3
##  A good deal gender Male - Female  0.07467   0.021208  3.52   <0.001 11.2
##     2.5 %    97.5 % years_tertiary_study                      work
##  -0.00215 -0.000148                   12 Working full-time for pay
##  -0.02166 -0.002919                   12 Working full-time for pay
##  -0.09342 -0.029034                   12 Working full-time for pay
##   0.03310  0.116233                   12 Working full-time for pay
##         managerial sexual_orientation age gender
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
##  Middle managerial           Bisexual  30   Male
## 
## Columns: rowid, group, term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, years_tertiary_study, work, managerial, sexual_orientation, age, gender, A1

Changing gender to Female and keeping everything else the same is associated with decreases in the probability of rating political interest as ‘None’, ‘Not much’, and ‘Some’, but an increase in the probability of rating ‘A good deal’ (by about \(7\) percentage points). Hopefully now it’s evident just how intuitive complex models like ordinal regression become when interpreted using appropriate tools!

Final thoughts

Thanks for making it to the end! Hopefully this post has inspired you to leave the world of coefficient interpretation in favour of more desirable shores. As an aside, this dataset is incredibly interesting, so maybe I will revisit it sometime. There is a panel data version where respondents across time were tracked, so maybe I will do something with that!

As always, please feel free to reach out to me via email if you have any questions or just want to talk about stats.

Agresti, A. (2002) Categorical Data. Second edition. Wiley.↩︎
https://online.stat.psu.edu/stat504/lesson/8/8.4 ↩︎
But not always. Sometimes you may have use for values on the link scale or otherwise.↩︎
The beauty of outcome scales extends to other models, too. For example, marginal effects calculated on a Poisson model will give you results in terms of counts.↩︎
This is my high-level summary term for what the survey item measures.↩︎
I did do some for my own sanity, however.↩︎
Although I think everyone should go Bayes and use Stan or brms.↩︎
Okay, okay. I’ll stop!↩︎