Interpretable time-series modelling using Gaussian processes (Stan’s version)

15 minute read

Published:

Ahoy! Welcome back to my nascent blog. It turned out that quite a few people found my last post regarding interpretable time-series modelling using Gaussian processes useful and/or informative, so I felt sufficiently motivated to write a follow up piece1. In the first article, I delved into the underlying mechanics of Gaussian processes (GP) and built them from the ground-up using linear algebra. The result was a nice, principled way of decomposing a time series into its structural components (e.g., trend, seasonality, noise) and then specifying an additive covariance function (kernel) that captures each of the relevant components. Put another way, we modelled a time series in an interpretable way2. To make it easier for readers to explore the workflow for themselves, I have since improved and turned the code into a small R package called tsgp.

The overall approach was neat, and really demonstrated the value of transparency in the modelling process. But despite its elegance, there was a major drawback of the approach: it was quite restrictive. What do I mean by that? Well, the core code in tsgp (the final product of the previous article’s code) is something like:

GP(x1, xprime = x1, y, CovarianceFunction, noise = 0.8,
   sigma_1 = 5, sigma_2 = 1, l_1 = 75, l_2 = 1, p = 25)

where x1 is our input vector of \(x\) values, xprime is our target vector of \(x'\) values to predict \(y\) values for, y are our observed values to learn from, CovarianceFunction was our custom covariance function which was comprised of the addition of an exponentiated quadratic (squared exponential) and periodic kernel, noise was the scaling factor of uncertainty to include, and the remaining arguments were hyperparameters to the two covariance functions contained within CovarianceFunction.

Why is this restrictive? Well, we have clearly set fixed values for the covariance function hyperparameters. While we could certainly implement a solution such as marginal maximum likelihood to optimise them, often in practice we have great uncertainty about what the parameters might actually be. An exception to this might be the period argument of the periodic kernel, as for monthly data we would set \(p = 12\), or for daily data \(p = 7\). For real-world datasets, we might just have a hunch (or knowledge from previous research), of the likely range of plausible values for the hyperparameters. For example, we might expect the variance (\(\sigma^{2}\)) of the exponentiated quadratic kernel for the temporal trend to be somewhere around \(2-3\), but not know it precisely. We could express this uncertainty as a probability distribution, such as a normal distribution like \(\mathcal{N}(2,1)\).

Further, the previous approach assumed a time series of continuous values. But what if our data was discrete (e.g., counts of homeless people, binary values for whether the manufacturing line had a fault or not)? Well, clearly we need to extend our capabilities.

Therefore, this article endeavours to answer the question: how can we extend the interpretable modelling workflow presented in the first article, to incorporate uncertainty and account for additional complexity?

Spoiler alert, the answer is we use Stan!

Stan

Stan is a probabilistic programming language written in C++ with interfaces to most major scientific computing languages, such as R, Python, and Julia. It is an utterly remarkable achievement, and even has its own automatic differentiation library (stanmath). Stan is an excellent resource, and has numerous incredible packages built on top of it to make Bayesian inference more accessible, such as brms.

Stan’s appeal (at least to me!) is the honestly astounding amount of flexibility it affords the modeller. Stan can be used to do anything from Bayesian linear regression, to mixed-effects regression modelling, to ordinary differential equations. It is quite possibly the closest thing to simply writing down the mathematical structure of a model. Naturally, with a tool this flexible, the barrier to entry or perceived complexity can be high. Both of those things are true. But there are many, many excellent resources to help newcomers to the language, such as the comprehensive Stan User’s Guide.

At its core, Stan uses Markov chain Monte Carlo to sample the posterior distribution, rather than solving the model analytically (as we did in my last post)3. This facilitates the high degree of flexibility and permits excellent model diagnostics, but it also introduces a rather large computational cost – especially for complex models such as GPs, or for models trained on large datasets. However, for most cases, modelling something appropriately with proper quantification of uncertainty is often far more important than getting an ‘answer’ quickly.

Alright, enough chatter. Let’s code! Here are the packages we will need:

library(dplyr)
library(ggplot2)
library(tidyr)
library(rstan)
library(bayesplot)

Replicating the model from last article

Recall that last time we simulated a time series with temporal dynamics commonly encountered in applied settings. Specifically, we generated a noisy sine wave (i.e., to emulate periodic or seasonal data) with a linearly increasing trend:

set.seed(123)
y <- 3 * sin(2 * seq(0, 4 * pi, length.out = 100)) + runif(100) * 2
trend <- 0.08 * seq(from = 1, to = 100, by = 1)
y <- y + trend
tmp <- data.frame(timepoint = 1:length(y), y = y)

# Draw plot

ggplot(data = tmp) +
  geom_point(aes(x = timepoint, y = y), colour = "black") +
  geom_line(aes(x = timepoint, y = y), colour = "black") +
  labs(x = "Timepoint",
       y = "Values") +
  theme_bw()

Let’s assume we want to predict \(y\) values for a set of points over this time domain, such as \(50\) equidistant points between \(1-100\), just like we did at the end of the previous article. Similarly, we still want to use an additive kernel that combines a covariance function for the trend (which we know is linear but we will use an exponentiated quadratic just because it’s more common to have non-linearity in the real-world) and a periodic kernel for the seasonality (which we showed last time to have a period of \(p = 25\)). However, since we are now using Stan, we can generate posterior samples for all the hyperparameters as well as the values we want to predict, and interrogate the model in considerable depth.

Without further adieu, here is the GP in Stan:

stan_code <- "
data {
  int<lower=1> N1;
  array[N1] real x1;
  vector[N1] y1;
  int<lower=1> N2;
  array[N2] real x2;
  real p_periodic;
}

transformed data {
  real delta = 1e-9;
  int<lower=1> N = N1 + N2;
  array[N] real x;
  
  for (n1 in 1:N1) {
    x[n1] = x1[n1];
  }
  for (n2 in 1:N2) {
    x[N1 + n2] = x2[n2];
  }
}

parameters {
  real<lower=0> sigma;
  real<lower=0> sigma_trend;
  real<lower=0> l_trend;
  real<lower=0> sigma_periodic;
  real<lower=0> l_periodic;
  vector[N] eta;
}

transformed parameters {
  vector[N] f;
  {
    matrix[N, N] L_K;
    matrix[N, N] K = gp_exp_quad_cov(x, sigma_trend, l_trend) + gp_periodic_cov(x, sigma_periodic, l_periodic, p_periodic);

    // Diagonal elements
    
    for (n in 1:N) {
      K[n, n] = K[n, n] + delta;
    }

    L_K = cholesky_decompose(K);
    f = L_K * eta;
  }
}

model {

  // Priors

  sigma ~ std_normal();
  eta ~ std_normal();
  sigma_trend ~ std_normal();
  l_trend ~ std_normal();
  sigma_periodic ~ std_normal();
  l_periodic ~ std_normal();
  
  // Likelihood

  y1 ~ normal(f[1:N1], sigma);
}

generated quantities {
  vector[N2] y2;
  
  for (n2 in 1:N2) {
    y2[n2] = normal_rng(f[N1 + n2], sigma);
  }
}
"

That’s a lot! Let’s break it down block-by-block.

Structure of the Stan model

As mentioned previously, Stan is written in C++. This means we have types such as real, int, and vector, and have to end each line of code with ;. I won’t go into too much detail on C++ syntax here since there are plenty of better resources out there.

data block

This is where we tell Stan what inputs we want to pass in ourselves. In our case, we are specifying inputs for the training data x1 (and its size N1), the data to predict values for x2 (and its size N2), the observations y1, and the period for the seasonal kernel p_periodic.

parameters block

This is where we declare the parameters we want Stan to sample posterior values for. We are declaring all of our covariance function hyperparameters (except the period for the seasonal kernel since we know it), as well as an error term sigma and eta to help with the Cholesky decomposition. Note that we are fixing lower bounds for the parameters to be \(0\). This both helps ensure terms such as the variance of the Gaussian posterior (sigma) are valid (i.e., non-negative) and also makes the model sample faster.

transformed parameters block

This is where we do any calculations that come before the prior and likelihood sampling. In our case, we perform the covariance function summation here and add the tiny \(\epsilon\) to the diagonal for numerical stability. We also make use of the Cholesky decomposition for computational efficiency. Note that most of the commonly-encountered covariance functions are already included as functions in Stan. Here we make use of gp_exp_quad_cov for the exponentiated quadratic kernel (i.e., trend) and gp_periodic_cov for the periodic kernel (i.e., seasonality).

model block

This is where we define our priors and the likelihood. We are just using standard normal priors for all parameters for this example.

generated quantities block

This is where the magic happens! Almost anything can be calculated here (e.g., test-set predictions, pointwise log-likelihood, etc.), and it is not uncommon for real-world models to make complete use of its flexibility. For our purposes, we are just going to draw samples from the posterior for each time point we wish to predict values for. These are drawn from a normal distribution random number generator (‘rng’) parameterised by the GP and the variance/error sigma.

Fitting the Stan model in R

First, we will define our \(x\) and \(x'\) vectors:

x1 <- 1:100
xprime <- seq(from = 1, to = 100, by = 2)

Now we can fit the model. In order to do this, we need to pass in a few things to rstan::stan: the Stan model code, the input data to satisfy the requirements of the data block, and whatever other MCMC settings we want. I’m going to turn on parallel processing to make the chains evaluate simultaneously, and increase the MCMC algorithm iterations per chain since we want our effective sample size in the distribution’s tails to be large enough for appropriate inference.

Fair warning, this will take a while!

options(mc.cores = parallel::detectCores())

model <- stan(model_code = stan_code, 
              data = list(N1 = length(x1), x1 = x1, y1 = y, N2 = length(xprime), x2 = xprime,
                          p_periodic = 25), 
              iter = 5000, chains = 3, seed = 123)

Beautiful, we now have everything we need! Let’s convert our model into a data frame and pull out the posterior samples for the predicted \(y\) values for each of our \(x'\) time points (the other columns are the posterior draws for all our parameters!):

draws <- as.data.frame(model)
preds <- draws[, grepl("y2", names(draws))]
colnames(preds) <- paste0("y_", xprime)
preds[1:5, 1:5]
##         y_1      y_3      y_5      y_7      y_9
## 1 1.4010139 2.231305 2.663115 4.757358 4.701955
## 2 0.5189433 3.760109 3.763193 4.743285 3.802899
## 3 1.2850740 2.429294 3.985677 4.408179 4.405878
## 4 1.7550481 1.413475 3.353406 3.866813 4.384098
## 5 0.9879761 3.002428 3.774711 4.327413 3.496659

From here, we can do several things. The first might be to replicate the nice graph from last time of the actual data and the posterior predictions on top. Here is the posterior mean and its associated \(95\%\) credible interval:

prob <- 0.95
u_bound <- ((prob + 1) / 2)
l_bound <- u_bound - prob
  
preds <- preds %>%
  mutate(draw = dplyr::row_number()) %>%
  pivot_longer(cols = 1:ncol(preds), names_to = "timepoint", values_to = "values") %>%
  mutate(timepoint = as.integer(gsub(".*_", "\\1", timepoint))) %>%
  reframe(.mean = mean(values),
          .lower = quantile(values, probs = l_bound),
          .upper = quantile(values, probs = u_bound),
          .by = "timepoint")

ggplot() +
  geom_ribbon(data = preds, aes(x = timepoint, ymin = .lower, ymax = .upper, group = 1), 
              fill = "steelblue2", alpha = 0.3) +
  geom_line(data = tmp, aes(x = timepoint, y = y), colour = "black") +
  geom_point(data = tmp, aes(x = timepoint, y = y), colour = "black") +
  geom_line(data = preds, aes(x = timepoint, y = .mean), colour = "steelblue2", size = 1) +
  labs(title = paste0("Posterior mean and ", round(prob * 100, digits = 0), "% credible interval"),
       subtitle = paste0("Interval calculated over ", nrow(draws), " samples"),
       x = "Timepoint",
       y = "Values") +
  theme_bw()

It appears that our model does quite well in predicting plausible values for each of the \(50\) time points. So, where do we go now? Well, while this post won’t go into model diagnostics in detail, we can at least explore a few important visualisations.

Here are trace plots for the parameters which display chain convergence (we want these to look like white noise with no visible patterns). We'll just generate the first ten as an example:

traceplot(model)

We can also visualise the posterior distributions for the parameters. All appear consistent with our expectations based on the standard normal distributions we specified in the model. For those that read the first post, you may recall that we fixed the hyperparameters. Here we have estimated entire probability distributions for them based off our beliefs and the actual observations. Very cool.

as.data.frame(model) %>%
  dplyr::select(c(sigma, sigma_trend, l_trend, sigma_periodic, l_periodic)) %>%
  mutate(iteration = row_number()) %>%
  pivot_longer(cols = sigma:l_periodic, names_to = "parameter", values_to = "values") %>%
  ggplot(aes(x = values)) +
  geom_histogram(aes(y = after_stat(density)), fill = "steelblue2", alpha = 0.9) +
  labs(x = "Parameter value",
       y = "Probability density") +
  theme_bw() +
  facet_wrap(~parameter, scales = "free")

With all of that, we have successfully implemented a GP in Stan that can generate plausible predictions, just like the first article. While the computation time here was orders of magnitude longer, we also exercised far more control over the incorporation and propagation of uncertainty, and have considerably more information at our disposal through which to evaluate the model.

Final thoughts

Thank you for making it to the end! This concludes the short series on simple GPs. In the future, I might write on common extensions, such as latent GPs for modelling non-normal variables, such as counts or binary categories.

Again, for more information on GPs, please check out these much better resources:

  • Gaussian processes for machine learninglink
  • Gaussian Processes for Timeseries Modellinglink
  • Robust Gaussian Process Modellinglink
  • Kernel Cookbooklink
  • Statistical Rethinking 2022 lecture serieslink

Or, feel free to reach out to me via email.


  1. Who am I kidding, I would have written it anyway!↩︎

  2. There will be no black boxes in this blog!↩︎

  3. Specifically, Stan implements the Hamiltonian Monte Carlo (HMC) algorithm for full Bayesian inference and uses its adaptive variant known as the no-U-turn sampler (NUTS).↩︎