Interpretable time-series modelling using Gaussian processes

32 minute read

Published: May 03, 2024

1 Introduction

Welcome! If it wasn’t already obvious, I love time-series analysis, so we are going to dive deep into it here. Another thing I love is Gaussian processes (GP), so why not combine the two? We’ll get into the details later, but GPs are an insanely powerful tool that can model an absurd range of data (including continuous and discrete) in an interpretable way that gives the modeller a large degree of control over how the technique learns from data. This makes GPs an invaluable tool in any applied statistician/machine learning practitioner’s toolkit.

It is important to remember that using a GP for time series basically converts the problem from one of modelling a generative process (e.g., if we used an ARIMA model) to that of essentially a curve-fitting exercise. GPs are a Bayesian method, so if you are unfamiliar with Bayesian inference, please go check out some resources on that, such as this great video by Richard McElreath. Essentially, the key bits to remember for this post is that in Bayesian inference, we are interested in the posterior distribution, which is proportional to the product of the prior (our beliefs about an event before seeing the data) and the likelihood (the probability of the data given model parameters). Throughout this post, we are going to build a basic GP from scratch to highlight the intuition and mechanics underlying the modelling process.

In practice, we would most likely fit GPs using a purpose-built tool, such as coding one in the probabilistic programming language Stan to utilise Markov chain Monte Carlo for sampling, or using a dedicated GP package such as GPy, GauPro, or even Tensorflow Probability.

Before we get started, we will load the R packages we need:

library(Rcpp)
library(dplyr)
library(tidyr)
library(ggplot2)
library(MASS)
library(TSA)

2 Generating some data

We are going to simulate some time series data with temporal dynamics that are commonly encountered in applied settings. Specifically, we are going to generate a noisy sine wave (i.e., to emulate periodic or seasonal data) with a linearly increasing trend:

set.seed(123)
y <- 3 * sin(2 * seq(0, 4 * pi, length.out = 100)) + runif(100) * 2
trend <- 0.08 * seq(from = 1, to = 100, by = 1)
y <- y + trend
tmp <- data.frame(timepoint = 1:length(y), y = y)

# Draw plot

ggplot(data = tmp) +
  geom_point(aes(x = timepoint, y = y), colour = "black") +
  geom_line(aes(x = timepoint, y = y), colour = "black") +
  labs(x = "Timepoint",
       y = "Values") +
  theme_bw()

3 Decomposing a time series for modelling

There are a great many ways to do time-series analysis. A lot of it hinges on the expected usage of the model, such as forecasting or classification. In our case, we are interested in constructing a model that can accurately capture temporal properties and generate sensible predictions on which to do inference on. One approach for doing this is known as structural time series. Basically, this involves decomposing a time series into the sum of simpler parts. Such a model might take the following form:

\[ f(x) = f_{1}(x) + f_{2}(x) + \dots + f_{n}(x) + \epsilon \]

where $f_{1}(x) \dots f_{n}(x)$ are the different parts of the time series we are explicitly modelling, and $\epsilon$ is the error term. Examples of these parts may include:

Trend — Is there an average increase/decrease over time?
Autocorrelation structure — How correlated are points to those that came before it? At which lag is this relationship meaningful (e.g., one point behind? Five points behind?)?
Seasonality — Are there recurring periodicities in the data (e.g., cyclic patterns observed at a weekly and monthly level for retail sales)?
Holidays — Are there infrequent holiday periods which are important to capture? We are going to ignore this here for simplicity
Likelihood — What probability distribution best describes my data (e.g., for continuous data this might be a Gaussian, for count data it might be Poisson or Negative Binomial, etc.)?

Hopefully at this point the intention here is reasonably clear — we can model time series in an extremely flexible and transparent manner through the explicit modelling of various statistical processes describe by the data. While a GP doesn’t look like the above equation, it is important to keep this idea of components/parts in mind, as it will become quite useful very soon after we unpack the mechanics of GPs.

4 Gaussian processes

The most general GP definition is that they are a generalisation of the multivariate normal distribution to an infinite number of dimensions. That is not super helpful… Let’s unpack it a bit.

Recall that a Gaussian distribution is parameterised by two scalar values: (i) mean $\mu$; and (ii) variance $\sigma^{2}$ (or the standard deviation which is the square root of the variance). We can write it as:

\[\mathcal{N}(\mu, \sigma^{2})\]

We can increase the dimensionality of this univariate case to describe multivariate Gaussian distributions, by increasing the dimensionality of the parameters. This results in our mean becoming a mean vector $\mu$ and our variance becoming a covariance matrix $\Sigma$ (as we now need to describe both the variance of each variable and the covariance between them), written as:

\[\textbf{X} \sim \mathcal{N}(\mu, \Sigma)\]

A GP, by its broad definition, is just a generalisation of this idea to an infinite number of dimensions. This means our mean vector and covariance matrix become a mean function* $m(x)$ and covariance function $k(x, x')$ (also called a ‘kernel’). See how we have abstracted from scalar values to functions without changing the broad underlying idea? Putting this all together, we have the general form of a GP:

\[ f(x) = \mathcal{GP}(m(x), k(x, x')) \]

Now, it turns out that for most applications the mean function plays a relatively minor role in contributing to the outputs of a GP. In other words, almost all of the details are driven by the covariance function. Practically, this means that oftentimes $m(x)$ is just set to $0$. In contrast, by specifying different covariance functions, we can exercise substantial control over the resulting model fit and predictions. This is where domain and statistical expertise comes in.

There are a great number of covariance functions (noting that they must be positive definite to be valid) available to us, and deciding on which of them to use is a key aspect of the modelling process. In the GP regression context, we specify a prior over functions using the kernel. The process of Bayesian updating then occurs, resulting from the considerations of both our prior and the likelihood (the data).

The following sections explore a few of the most common kernels you might encounter. To demonstrate them, we are just going to generate a bunch of data points in a constrained range (don’t worry, we’ll return to our original time series soon!):

x1 <- seq(from = -2, to = 2, length.out = 100)

4.1 Exponentiated quadratic kernel

This is by far the most common kernel (also known as “radial basis function” or RBF or “squared exponential”), and even forms the basis for simpler machine learning methods, such as RBF support vector machines (SVM). One of the reasons for this is it is quite similar to the formula for a univariate Gaussian distribution. We can write it mathematically as:

\[ k(x, x') = \text{exp} \left( -\frac{1}{2\sigma^{2}}||x - x'||^{2} \right) \]

where $\sigma^{2}$ is the variance (average distance of a function away from its mean) and $\mathcal{l}$ is the lengthscale (the length of the “wiggles” in a function). You can refer to these as hyperparameters that we can set or learn to control the shape of the kernel.

In R, we can code this manually (note we are using Rcpp for computational efficiency):

Rcpp::cppFunction("
NumericMatrix cov_exp_quad(NumericVector xa, NumericVector xb, double sigma, double l) {
  int n1 = xa.size();
  int n2 = xb.size();
  NumericMatrix K(n1, n2);
  
  for (int i = 0; i < n1; ++i) {
    for (int j = 0; j < n2; ++j) {
      double diff = xa[i] - xb[j];
      double exp_term = exp(-0.5 * pow(diff / l, 2));
      K(i, j) = pow(sigma, 2) * exp_term;
    }
  }
  
  return K;
}
")

We can visualise the resulting matrix of this function on a subset of datapoints from our vector y. Let’s say the first 5 points for clarity:

mat_exp_quad <- cov_exp_quad(x1, x1, 1, 1)
mat_exp_quad[1:5, 1:5]

##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 1.0000000 0.9991841 0.9967404 0.9926807 0.9870250
## [2,] 0.9991841 1.0000000 0.9991841 0.9967404 0.9926807
## [3,] 0.9967404 0.9991841 1.0000000 0.9991841 0.9967404
## [4,] 0.9926807 0.9967404 0.9991841 1.0000000 0.9991841
## [5,] 0.9870250 0.9926807 0.9967404 0.9991841 1.0000000

We can see ones on the diagonal (variance with itself) and numbers everywhere else (covariance). Beautiful. We can also confirm that it is symmetrical:

isSymmetric(mat_exp_quad)

## [1] TRUE

We can generate some random samples from the kernel to visualise what it looks like on its own (we are going to define a reusable function here to use later — note that most of this code is just data wrangling and plotting — the actual GP mechanics only take 3 lines):

#' Draw random samples from kernel (assumed mean vector is zero)
#' @param n \code{integer} denoting number of observations
#' @param k \code{integer} denoting number  of realisations to draw. Defaults to \code{5}
#' @param mu \code{vector} for the mean. Defaults to \code{NULL} for a vector of zeros
#' @param X \code{vector} of numeric values used to generate \code{Sigma}
#' @param Sigma \code{matrix} of covariance
#' @returns object of class \code{ggplot}
#'

sample_kernel <- function(n, k = 5, mu = NULL, X, Sigma){
  
  # Prepare mean vector if NULL
  
  if(is.null(mu)){
    mu <- integer(length = n)
  }
  
  # Produce dynamic column names, draw from multivariate normal, and convert to dataframe
  
  thenames <- list()
  
  for(i in 1:k){
    thenames[[i]] <- paste0("Sample ", i)
  }
  
  thenames <- unlist(thenames)
  y <- as.data.frame(t(MASS::mvrnorm(n = k, mu = mu, Sigma = Sigma)))
  colnames(y) <- thenames
  draws <- data.frame(x = X)
  draws <- cbind(draws, y)
  
  # Wrangle into long format for ggplot
  
  draws <- reshape(draws, direction = "long",
                   v.names = "y",
                   varying = 2:ncol(draws),
                   times = names(draws)[2:ncol(draws)],
                   timevar = "sample_ind")
  
  # Draw plot
  
  p <- ggplot(data = draws, aes(x = x, y = y, colour = sample_ind)) +
      geom_line(linewidth = 0.7) +
      labs(x = "x",
           y = "y = f(x)",
           colour = NULL) +
      theme_bw() +
      theme(legend.position = "bottom")
  
  return(p)
}

p <- sample_kernel(n = nrow(mat_exp_quad), k = 5, mu = NULL, X = x1, Sigma = mat_exp_quad)
print(p)

Notice how these curves could realistically be using to model a long-term effect (even non-linear ones), such as a trend? You might also notice that they are essentially shaped like splines (similar to what is used in generalised additive models).

As another example, assume that for our data we believed the pattern to be highly non-linear with substantial variation. Well, we can make the lengthscale ($\mathcal{l}$) shorter to increase the function “wiggliness” (yes, that’s the technical term). Let’s try $\mathcal{l} = 0.3$ instead of $\mathcal{l} = 1$ and see what happens:

mat_exp_quad2 <- cov_exp_quad(x1, x1, 1, 0.3)
p_b <- sample_kernel(n = nrow(mat_exp_quad2), k = 5, mu = NULL, X = x1, Sigma = mat_exp_quad2)
print(p_b)

Our resulting functions are now far more wiggly. Notice how the statistical implementation follows directly from the conceptual understanding of the data? This is the idea at the heart of interpretable modelling.

Another way to visualise the covariance matrix is through a heatmap. Here is a function to do that, and the plot for all hyperparameters set to $1$:

#' Convert covariance matrix to dataframe and plot
#' @param matrix the covariance matrix
#' @return object of class \code{ggplot}
#' 

plot_covariance <- function(matrix){
  
  mat_df <- as.data.frame(matrix)
  mat_df$x <- seq_len(nrow(mat_df))
  
  mat_df <- reshape(mat_df, 
                    direction = "long",
                    v.names = "values",
                    varying = 1:(ncol(mat_df) - 1),
                    times = 1:nrow(mat_df),
                    idvar = "x")
  
  p <- ggplot(data = mat_df) +
    geom_tile(aes(x = x, y = time, fill = values)) +
    labs(title = "Heatmap of covariance matrix",
         x = "x",
         y = "x",
         fill = "k(x,x')") +
    scale_fill_viridis_c(limits = c(0, 1),
                         breaks = seq(from = 0, to = 1, by = 0.2)) +
    theme_bw() +
    theme(legend.position = "bottom")
  
  return(p)
}

plot_covariance(matrix = mat_exp_quad)

And with the shorter lengthscale of $\mathcal{l} = 0.3$ for comparison:

plot_covariance(matrix = mat_exp_quad2)

4.2 Rational quadratic kernel

Another common kernel is the rational quadratic. You can think of this as being equivalent to adding together many squared exponential kernels with different lengthscales (credit to this great resource for the brief description).

\[ k(x, x') = \sigma^{2} \left( 1 + \frac{||x - x'||^{2}}{2\alpha \mathcal{l}^{2}} \right)^{-\alpha} \]

where where $\sigma^{2}$ is the variance, $\mathcal{l}$ is the lengthscale, and $\alpha$ is the mixing coefficient (where $\alpha > 0$).

We can code this in R as:

cppFunction("
NumericMatrix cov_rat_quad(NumericVector xa, NumericVector xb, double sigma, double alpha, double l) {
  int n1 = xa.size();
  int n2 = xb.size();
  NumericMatrix K(n1, n2);
  
  for (int i = 0; i < n1; ++i) {
    for (int j = 0; j < n2; ++j) {
      double diff = xa[i] - xb[j];
      double norm_sq = pow(diff, 2);
      double exp_term = pow(1 + norm_sq / (2 * alpha * pow(l, 2)), -alpha);
      K(i, j) = pow(sigma, 2) * exp_term;
    }
  }
  
  return K;
}
")

Again, a sample of the matrix:

mat_rat_quad <- cov_rat_quad(x1, x1, 1, 1, 1)
mat_rat_quad[1:5, 1:5]

##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 1.0000000 0.9991844 0.9967457 0.9927074 0.9871085
## [2,] 0.9991844 1.0000000 0.9991844 0.9967457 0.9927074
## [3,] 0.9967457 0.9991844 1.0000000 0.9991844 0.9967457
## [4,] 0.9927074 0.9967457 0.9991844 1.0000000 0.9991844
## [5,] 0.9871085 0.9927074 0.9967457 0.9991844 1.0000000

And some random samples from the kernel:

p1 <- sample_kernel(n = nrow(mat_rat_quad), k = 5, mu = NULL, X = x1, Sigma = mat_rat_quad)
print(p1)

Note that again we can control wiggliness through lengthscale $\mathcal{l}$ but also now $\alpha$ here too. Here is a reduction in $\mathcal{l}$:

mat_rat_quad2 <- cov_rat_quad(x1, x1, 1, 1, 0.3)
p1_b <- sample_kernel(n = nrow(mat_rat_quad2), k = 5, mu = NULL, X = x1, Sigma = mat_rat_quad2)
print(p1_b)

And here is a reduction in $\alpha$:

mat_rat_quad3 <- cov_rat_quad(x1, x1, 1, 0.1, 1)
p1_c <- sample_kernel(n = nrow(mat_rat_quad3), k = 5, mu = NULL, X = x1, Sigma = mat_rat_quad3)
print(p1_c)

Finally, we can visualise the heatmap of the matrix with all hyperparameters set to $1$:

plot_covariance(matrix = mat_rat_quad)

And with $\alpha = 0.1$ as a comparison:

plot_covariance(matrix = mat_rat_quad3)

4.3 Periodic kernel

Since we are interested in time series in this post, it would make sense to introduce some kernels which capture properties that are common in temporal data. One of these is seasonality (i.e., broad periodic patterns which repeat at regular intervals). We can capture this through a periodic kernel (also known as an ‘exponentiated sine squared’ kernel):

\[ k(x, x') = \sigma^{2} \text{exp} \left( -\frac{2}{\mathcal{l}^{2}}\text{sin}^{2} \left( \pi\frac{|x - x'|}{p} \right) \right) \]

where $\sigma^{2}$ is the variance (or amplitude in this case), $\mathcal{l}$ is the lengthscale, and $p$ is the period (distance between repetitions). We can sometimes determine $p$ beforehand by either eyeballing the time-series plot, or numerically decomposing the data and determining periods using a Fourier transform. We will just use $p = 1$ here for demonstration.

We can code this kernel in R as:

cppFunction("
NumericMatrix cov_periodic(NumericVector xa, NumericVector xb, double sigma, double l, double p) {
  int n1 = xa.size();
  int n2 = xb.size();
  NumericMatrix K(n1, n2);
  
  for (int i = 0; i < n1; ++i) {
    for (int j = 0; j < n2; ++j) {
      double diff = std::abs(xa[i] - xb[j]);
      double sin_term = std::sin(M_PI * diff / p);
      double exp_term = std::exp(-2 / pow(l, 2) * std::pow(sin_term, 2));
      K(i, j) = pow(sigma, 2) * exp_term;
    }
  }
  
  return K;
}
")

Here is a sample of the matrix:

mat_periodic <- cov_periodic(x1, x1, 1, 2, 1)
mat_periodic[1:5, 1:5]

##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 1.0000000 0.9920192 0.9689545 0.9332646 0.8885240
## [2,] 0.9920192 1.0000000 0.9920192 0.9689545 0.9332646
## [3,] 0.9689545 0.9920192 1.0000000 0.9920192 0.9689545
## [4,] 0.9332646 0.9689545 0.9920192 1.0000000 0.9920192
## [5,] 0.8885240 0.9332646 0.9689545 0.9920192 1.0000000

And some random samples from the kernel:

p2 <- sample_kernel(n = nrow(mat_periodic), k = 5, mu = NULL, X = x1, Sigma = mat_periodic)
print(p2)

Notice the cyclic nature? This is what enables us to capture seasonality in time series.

Here is the heatmap (how cool is it seeing periodicity in this format!):

plot_covariance(matrix = mat_periodic)

4.4 White noise kernel

The last common kernel we will introduce here is the white noise kernel. This does exactly as you might expect — incorporates independent and identically-distributed (i.i.d.) noise into the GP. Mathematically, it is simple:

\[ k(x, x') = \sigma^{2}I_{n} \]

where $\sigma^{2}$ is the variance of the noise and $I_{n}$ is the identity matrix (a matrix of ones on the diagonal and zeros everywhere else — kind of like the number one but for matrices). This kernel produces a matrix that has zeros everywhere except the diagonal, which now contains the variances.

We can code this in R as:

#' White noise covariance function
#' @param x \code{numeric} vector
#' @param sigma \code{numeric} scalar denoting variance. Defaults to \code{1}
#' @returns an object of class \code{matrix}
#' 

cov_noise <- function(x, sigma = 1){
  k <- sigma * diag(length(x))
  return(k)
}

Here is a sample of the matrix:

mat_white_noise <- cov_noise(x = x1, sigma = 0.01)
mat_white_noise[1:5, 1:5]

##      [,1] [,2] [,3] [,4] [,5]
## [1,] 0.01 0.00 0.00 0.00 0.00
## [2,] 0.00 0.01 0.00 0.00 0.00
## [3,] 0.00 0.00 0.01 0.00 0.00
## [4,] 0.00 0.00 0.00 0.01 0.00
## [5,] 0.00 0.00 0.00 0.00 0.01

And some random samples from the kernel:

p3 <- sample_kernel(n = nrow(mat_white_noise), k = 5, mu = NULL, X = x1, Sigma = mat_white_noise)
print(p3)

And the heatmap (with zero highlighting, as we would expect from white noise):

plot_covariance(matrix = mat_white_noise)

Side note: You may have picked up that every covariance function introduced here has the parameter $\sigma^{2}$ out front. This parameter acts as a scaling factor. By increasing its value, we can increase the contribution of a given covariance function to the model.

5 In the time series context

Now for the key piece of information:

GPs are not limited to a single kernel. We can combine kernels through multiplication or addition to generate a final kernel that captures all the dynamics as we need.

Here is where the structural time series decomposition idea comes in. Using GPs, we can apply the same logic: summing or multiplying different kernels that capture different statistical properties of our data can produce a model that appropriately understands the various dynamics. Summation of kernels can be thought of as an OR procedure, while multiplication can be thought of as an AND procedure. Looking at our original time-series data again, we have a linearly increasing trend, seasonality, and noise. If we can define a kernel that defines and combines these individual components, we should be able to construct a GP that can model the data with reasonable accuracy.

5.1 Preparing the covariance function

We will start with the trend kernel. We know in our case that the trend itself linear, but for posterity (pun intended), we will use a squared exponential kernel anyway as it can model long-term smooth effects that may not necessarily be linear in other applications. Recall we coded this earlier as cov_exp_quad.

Next we will add a periodic kernel, as we know there is periodicity in our data. In our case, since we simulated the data, we know the period ($p$ in the kernel) is $2 \times 4 \times \pi \approx 25$, so we can be pretty confident that $p = 25$ is a good option. In practice, we would need to numerically determine the period (or maybe 2 or more if we have multiple seasonalities) using something like a Fourier transform.

We are also going to add a white noise kernel as we know there is some error around the data (as we added noise to begin with through runif). We can now compute the three covariance matrices, and sum them for a final matrix which captures all the individual dynamics:

Sigma_exp_quad <- cov_exp_quad(y, y, 1, length(y))
Sigma_periodic <- cov_periodic(y, y, 1, 1, 25)
Sigma_white_noise <- cov_noise(x = 1:nrow(Sigma_exp_quad), sigma = 0.01)
Sigma_comb <- Sigma_exp_quad + Sigma_periodic + Sigma_white_noise

As a bonus, I mentioned earlier we can detect periods in data using a Fourier transform. Here is one example of how we can find the most important period in our data using a Fourier transform from the TSA package:

peri <- TSA::periodogram(y, plot = FALSE) # Do Fourier transform to decompose signal
dd <- data.frame(freq = peri$freq, spec = peri$spec) # Extract frequency and spectral value
order <- dd[order(-dd$spec),] # Rearrange to descending order
time <- 1 / order$f # Convert frequency to original scale (i.e., "timepoint" index)
time <- time[!time %in% length(y)] # Remove length of y from as it's not informative
time[1]

## [1] 25

As we see, we get $p = 25$, the same period value we knew from the parameters of our simulated data. Hopefully this demonstrates to you how we can numerically test for periods on unknown data.

5.2 Sampling from the prior

Bada bing, bada boom¹ our full covariance function is specified, our data is ready to go, and we are now ready to specify our full GP! Well… sort of. We are first going to draw samples from the GP prior to visualise a bunch of different candidate functions before we account for any actual data:

p4 <- sample_kernel(n = nrow(Sigma_comb), k = 5, mu = NULL, X = y, Sigma = Sigma_comb)
print(p4)

This is starting to look more like a time series!

We can also visualise the heatmap:

plot_covariance(matrix = Sigma_comb) +
  scale_fill_viridis_c()

Clearly we still have some work to do, but it’s nice to know our prior is in the ballpark. Now we can incorporate the actual data (likelihood) and compute the thing of interest — the posterior.

5.3 Predicting from the posterior

To reiterate, we are trying to predict $\mathbf{y}_{2} = f(X_{2})$ points for $n_{2}$ new samples using our GP prior and $n_{1}$ datapoints ($(X_{1}, \mathbf{y}_{1})$):

\[ \left[\begin{array}{c} \mathbf{y}_{1} \\ \mathbf{y}_{2} \end{array}\right] \sim \mathcal{N} \left( \left[\begin{array}{c} \mu_{1} \\ \mu_{2} \end{array}\right], \left[ \begin{array}{cc} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{array} \right] \right) \]

We can summarise this using conditional probability for the posterior distribution: $p(\mathbf{y}_{2} | \; \mathbf{y}_{1}, X_{1}, X_{2})$.

Remember that $\mu_{1} = m(X_{1})$ (mean function of $n_{1} \times 1$ size) and $\Sigma_{11} = k(X_{1}, X_{1})$ (covariance function of size $n_{1} \times n_{1}$) specifies our GP. This can be broken down further to get the conditional distributions, but I will save you the linear algebra (see this post for a nice overview). However, it is this linear algebra that makes GPs tricky to apply in practice, especially when our input data is very large. GPs rely on computing the inverse of covariance matrices, pushing their computation time to $\mathcal{O}(N^{3})$ (compared to $\mathcal{O}(N)$ for standard Bayesian linear regression). Keep this in mind if you decide to use GPs in your own work.

Before we define the GP itself, for ease and clarity, we are going to define a simple function which automatically computes and sums the covariance functions we want. Note that we are actually going to add noise in the GP function itself since only the covariance matrix between observations contain noise. There are much more flexible and extensible ways to define a composite kernel, but the method below is simple for our purposes.

#' Additive kernel of exponentiated quadratic, periodic, and white noise
#' @param xa \code{numeric} vector of input data
#' @param xb \code{numeric} vector of data to predict for
#' @param sigma_1 \code{numeric} scalar denoting the variance of \code{cov_exp_quad}. Defaults to \code{1}
#' @param sigma_2 \code{numeric} scalar denoting the variance of \code{cov_periodic}. Defaults to \code{1}
#' @param l_1 \code{numeric} scalar denoting the lengthscale of \code{cov_exp_quad}. Defaults to \code{1}
#' @param l_2 \code{numeric} scalar denoting the lengthscale of \code{cov_periodic}. Defaults to \code{1}
#' @param p \code{numeric} scalar denoting the period of \code{cov_periodic}. Defaults to \code{1}
#' @returns an object of class \code{matrix}
#' 

cov_summed <- function(xa, xb, sigma_1 = 1, sigma_2 = 1, l_1 = 1, l_2 = 1, p = 1){
  Sigma_exp_quad <- cov_exp_quad(xa, xb, sigma_1, l_1)
  Sigma_periodic <- cov_periodic(xa, xb, sigma_2, l_2, p)
  Sigma <- Sigma_exp_quad + Sigma_periodic
  return(Sigma)
}

Okay, we can now define our GP function to calculate the posterior mean vector and covariance matrix. We are going to predict values for a set of evenly-spaced time points across the $[1-100]$ domain.

#' Calculate posterior mean vector and covariance matrix
#' @param x \code{numeric} vector of input data
#' @param xprime \code{numeric} vector of data points to predict
#' @param y \code{numeric} vector values to learn from
#' @param sigma \code{numeric} scalar denoting variance of the noise kernel. Defaults to \code{0.05}
#' @param cov_function \code{function} which calculates the additive covariance matrix
#' @param ... covariance function hyperparameters to be passed to the covariance functions within \code{cov_function}
#' @return \code{GaussProc} object containing the input data, posterior mean function and covariance matrix
#'

GP <- function(x, xprime, y, sigma = 0.05, cov_function, ...){
  Sigma_11 <- cov_function(x, x, ...) + diag(sigma ^ 2, length(x))
  Sigma_12 <- cov_function(x, xprime, ...)
  Sigma_inv <- t(solve(Sigma_11, Sigma_12))
  Sigma_22 <- cov_function(xprime, xprime, ...)
  Sigma_2 <- Sigma_22 - (Sigma_inv %*% Sigma_12) # Posterior covariance matrix
  mu_2 <- Sigma_inv %*% y # Posterior mean vector
  gp <- list(x, xprime, y, mu_2, Sigma_2)
  names(gp) <- c("x", "xprime", "y", "mu", "Sigma")
  gp <- structure(gp, class = c("GaussProc", "list"))
  return(gp)
}

# Run the function

mod <- GP(x = 1:length(y), xprime = seq(from = 1, to = 100, by = 10), 
         y = y, sigma = 0.8, cov_function = cov_summed, 
         sigma_1 = 5, sigma_2 = 1, l_1 = 75, l_2 = 1, p = 25)

Now that we have our mean vector and covariance matrix for our posterior distribution, we can do useful things, such as sample draws from it and summarise it in terms of the posterior mean and credible interval (which we will set to $95\%$, though there is much discussion about the appropriate interval for Bayesian posteriors). We are going to define a function to do some of the same data frame wrangling we have done before and draw a plot to visualise model predictions against the input data. Again, this code is mostly just preparatory and visualisation stuff — the actual draws from the posterior is one line of code:

posterior <- as.data.frame(t(MASS::mvrnorm(n = draws, mu = x$mu, Sigma = x$Sigma)))

#' Plot summary of draws from GP posterior
#' @param x \code{GaussProc} object
#' @param data \code{data.frame} containing time points and \code{y} values
#' @param draws \code{integer} denoting the number of draws from the posterior. Defaults to \code{100}
#' @param ... arguments to be passed to methods
#' @return object of class \code{ggplot}
#'

plot.GaussProc <- function(x, data, draws = 100, ...){
  
  stopifnot(inherits(x, "GaussProc") == TRUE)
  
  # Generate dataframe of draws from posterior
  
  thenames <- list()
    
  for(i in 1:draws){
    thenames[[i]] <- paste0("Sample ", i)
  }
  
  thenames <- unlist(thenames)
  posterior <- as.data.frame(t(MASS::mvrnorm(n = draws, mu = x$mu, Sigma = x$Sigma)))
  colnames(posterior) <- thenames
  
  posterior <- posterior %>%
    mutate(timepoint = x$xprime) %>%
    pivot_longer(cols = 1:ncol(posterior), names_to = "sample", values_to = "y")
  
  posterior_mean <- posterior %>%
    reframe(mu = mean(y), .by = "timepoint")

  posterior_sd <- data.frame(timepoint = x$xprime,
                             sigma = sqrt(diag(x$Sigma))) # SD is square root of variance

  posterior_summary <- merge(x = posterior_mean, y = posterior_sd, by = "timepoint")
  multiplier <- 1.96 # Multiplier for 95% interval over normal distribution
  posterior_summary$lower <- posterior_summary$mu - (multiplier * posterior_summary$sigma)
  posterior_summary$upper <- posterior_summary$mu + (multiplier * posterior_summary$sigma)
  
  # Draw plot
  
  p <- ggplot(data = posterior_summary) +
    geom_ribbon(aes(x = timepoint, ymin = lower, ymax = upper), fill = "steelblue2", alpha = 0.5) +
    geom_line(data = data, aes(x = timepoint, y = y), colour = "black") +
    geom_point(data = data, aes(x = timepoint, y = y), colour = "black") +
    geom_line(aes(x = timepoint, y = mu), colour = "steelblue2", size = 1) +
    labs(title = "Posterior mean and 95% prediction interval",
         subtitle = paste0("Interval calculated over ", draws, " samples"),
         x = "Timepoint",
         y = "Value") +
    theme_bw()
  
  return(p)
}

# Apply function

p6 <- plot(mod, data = tmp, 100)
print(p6)

Very cool! Our model clearly has learnt the right composition of dynamics from the input data in that it follows the data reasonably closely. However there are a couple of concerns:

The $95\%$ prediction interval is pretty narrow — Our model is very confident in the numbers despite not matching the $y$ values exactly. This is mildly worrisome, and may call for re-parameterisation with increased uncertainty, or other considerations
The model underestimates many of the “peak” values in the cycles and overestimates the “trough” values — This isn’t the case for all, but it is common enough to warrant further exploration (and potentially re-parameterisation)

Remember: Model fitting, evaluation, and re-parameterisation is all part of the modelling process! Very rarely do we land a final model first go. This iterative approach to statistical modelling ensures you are in-the-loop the whole time, making active decisions that can be defended by empirical evidence, theory, or domain knowledge.

The above model examined ten uniformally distributed time points across our available domain. How well does our model do if we predicted for 20?

mod2 <- GP(x = 1:length(y), xprime = seq(from = 1, to = 100, by = 5),
           y = y, sigma = 0.8, cov_function = cov_summed, 
           sigma_1 = 5, sigma_2 = 1, l_1 = 75, l_2 = 1, p = 25)

p7 <- plot(mod2, tmp, 100)
print(p7)

What about all 100 time points?

mod3 <- GP(x = 1:length(y), xprime = 1:length(y),
           y = y, sigma = 0.8, cov_function = cov_summed, 
           sigma_1 = 5, sigma_2 = 1, l_1 = 75, l_2 = 1, p = 25)

p8 <- plot(mod3, tmp, 100)
print(p8)

How about if we increased the noise scaling factor (i.e., uncertainty)?

mod4 <- GP(x = 1:length(y), xprime = 1:length(y),
           y = y, sigma = 1.2, cov_function = cov_summed, 
           sigma_1 = 5, sigma_2 = 1, l_1 = 75, l_2 = 1, p = 25)

p9 <- plot(mod4, tmp, 100)
print(p9)

Intuitively, we see that the prediction intervals widen (i.e., become more uncertain) with the greater inclusion of noise. Keep in mind that if we used an actual GP tool, hyperparameter optimisation would usually be performed for all covariance function parameters through a method such as marginal maximum likelihood.

Lastly, just like with the GP prior, we can also sample individual realisations from the posterior. Our above method of summarising a $\pm1.96\sigma$ prediction interval essentially does this over a large range, but seeing how individual realisations progress from the prior stage where they do not fit the shape of our data to the posterior stage where each realisation is a feasible data generating function for our observations is really cool:

p10 <- sample_kernel(n = nrow(mod4$Sigma), k = 5, X = mod4$x, mu = mod4$mu, Sigma = mod4$Sigma) +
  labs(title = "5 realisations of the posterior distribution")

print(p10)

And there we have it. A GP from scratch that can model time-series data with trend, seasonality, and noise in an interpretable way².

6 Final thoughts

Thank you for making it to the end of this behemoth! I hope you found at least something useful and feel more empowered to consider using Gaussian processes in your own time series work. For more information, please check out these much better resources:

Gaussian processes for machine learning — link
Gaussian Processes for Timeseries Modelling — link
Robust Gaussian Process Modelling — link
Kernel Cookbook — link
Statistical Rethinking 2022 lecture series — link

Or, feel free to reach out to me via email.

Mr. Worldwide as I step in the room ↩︎
Because nobody wants a black-box algorithm.↩︎

Share on

Twitter Facebook LinkedIn

Trent Henderson

Interpretable time-series modelling using Gaussian processes

1 Introduction

2 Generating some data

3 Decomposing a time series for modelling

4 Gaussian processes

4.1 Exponentiated quadratic kernel

4.2 Rational quadratic kernel

4.3 Periodic kernel

4.4 White noise kernel

5 In the time series context

5.1 Preparing the covariance function

5.2 Sampling from the prior

5.3 Predicting from the posterior

6 Final thoughts

Share on

You May Also Enjoy

How Deep is Your L(ove)LM? What Makes a Neural Network ‘Deep’ When Building GPTs

Attention seeker: Unpacking the role of ‘attention’ in transformers like GPT

Using deep learning in PyTorch on rainfall data to predict Australian area remoteness

Learning the temporal dynamics of climate data using interpretable machine learning