Learning the temporal dynamics of climate data using interpretable machine learning

24 minute read

Published:

Introduction

It has been quite some time since my last past, apologies. It turns out that life gets busy… But I am back with a very detailed banger! In today’s installment, we are going to investigate the empirical dynamics of a bunch of time-series data associated with climate. More specifically, I want to see if we can determine whether different climate measurements exhibit different temporal dynamics (i.e., patterns observed over time). Following that, if we are successful in finding that they do, I then want to understand what they are.

So what are we working with? I have gone and procured data back to 1 January 1980 and up until 1 January 2020 for every ABS Statistical Area 2 in Australia for the following measurement types:

  • Monthly average rainfall
  • Monthly average temperature
  • Monthly average Forest Fire Index
  • Monthly average soil moisture

Unfortunately, this data is absolutely huge and comes from a SQL server. I have also done some behind the scenes work to join, aggregate, and preprocess it so that this analysis would be possible, which means you will not be able to run the code below. Sorry! In any case, I hope the workflow, intuition, and findings are still helpful, though!

Loading R packages and visualising the raw data

As with any new data project, you should always start by plotting your data. Let’s load the R packages we’ll need and produce a graph for one of the SA2s:

library(dplyr)
library(tidyr)
library(ggplot2)
library(tsibble)
library(theft)
library(theftdlc)
library(normaliseR)

Here is a snippet of the tsibble that holds the raw data:

head(climate_raw)
## # A tsibble: 6 x 6 [1M]
## # Key:       sa2_name_2016 [1]
##   rainfall_mm observation_date sa2_name_2016    mean_temperature soil_moisture
##         <dbl>            <mth> <chr>                       <dbl>         <dbl>
## 1        2.91         1960 Jan ACT - South West             18.4         0.368
## 2        2.60         1960 Feb ACT - South West             18.3         0.362
## 3        2.30         1960 Mar ACT - South West             18.3         0.332
## 4        2.12         1960 Apr ACT - South West             18.3         0.303
## 5        2.44         1960 May ACT - South West             18.0         0.313
## 6        2.19         1960 Jun ACT - South West             17.9         0.319
## # ℹ 1 more variable: fire_index <dbl>

I also have a slightly more wrangled long-form version of the data ready for some subsequent analysis where each time series has been z-scored, and it looks like this:

head(climate)
## # A tsibble: 6 x 4 [1]
## # Key:       id, metric [1]
##   id                          timepoint metric            values
##   <chr>                           <int> <chr>              <dbl>
## 1 ACT - South West_fire_index         1 Forest Fire Index -1.06 
## 2 ACT - South West_fire_index         2 Forest Fire Index -0.940
## 3 ACT - South West_fire_index         3 Forest Fire Index -0.712
## 4 ACT - South West_fire_index         4 Forest Fire Index -0.558
## 5 ACT - South West_fire_index         5 Forest Fire Index -0.648
## 6 ACT - South West_fire_index         6 Forest Fire Index -0.638

Now we can define a plotting function (because functional programming is the way!) and apply it to a given SA2 to see what we are working with:

#' Draw climate panel plot for a given SA2
#'
#' @param data \code{data.frame} containing the climate data
#' @param sa2 \code{character} denoting the ABS SA2 to draw the plot for. Defaults to \code{"North Melbourne"}
#' @return \code{ggplot} containing the plot
#' @author Trent Henderson
#'

plot_climate <- function(data, sa2 = "North Melbourne"){

  p <- data %>%
    filter(sa2_name_2016 == sa2) %>%
    pivot_longer(cols = c("rainfall_mm", "mean_temperature",
                          "soil_moisture", "fire_index"),
                 names_to = "metric", values_to = "values") %>%
    mutate(metric = case_when(
      metric == "rainfall_mm"      ~ "Mean daily average rainfall in past 12 months (mm)",
      metric == "mean_temperature" ~ "Mean maximum temperature over last 12 months (C)",
      metric == "soil_moisture"    ~ "Mean root zone soil moisture for the past 12 months as an absolute value (proportion)",
      metric == "fire_index"       ~ "Mean Forest Fire Danger Index over past 12 months")) %>%
    ggplot(aes(x = observation_date, y = values, colour = metric)) +
    geom_line() +
    labs(title = paste0("Time series of climate metrics for ", sa2),
         x = "Observation date",
         y = "Value",
         colour = NULL) +
    scale_colour_manual(values = RColorBrewer::brewer.pal(4, name = "Dark2")) +
    theme_bw() +
    theme(legend.position = "none",) +
    facet_wrap(~metric, scales = "free_y", nrow = 4)

  return(p)
}

plot_climate(climate_raw, sa2 = "North Melbourne")

Very cool! We can immediately notice some interesting patterns: For example, (this is most noticeable in the large rainfall peak around 2010), we see that when rainfall values increase, soil moisture also increases (which makes sense!), whereas temperature and the Forest Fire Danger Index decrease. Again, all of this makes intuitive sense, but it’s still cool to visualise nonetheless. However, we are not interested in validating what we likely already knew – we are more interested in the dynamics! To develop that understanding, we need to turn to a different analytical tool.

Feature-based time-series analysis

One method of doing time-series analysis is to take a feature-based approach. The broad idea is to reduce a time series down to a single scalar summary statistic (or a set of summary statistics). This means we can reduce the cumbersome temporal domain down to a potentially more usable time series \(\times\) feature matrix. Time-series features vary substantially in their complexity, ranging from measures that ignore temporality, such as the moments of the distribution, all the way to coefficients of model fits, information criteria, and countless other statistical properties. The first attempt to organise features across the scientific literature was called highly comparative time-series analysis (“hctsa”), which culminated in a very large feature set of \(>7700\) features. Unfortunately, hctsa exists in proprietary Matlab software (ewww). So, what can we do?

Enter the theft ecosystem for R.

The theft ecosystem for R

The theft (Tools for Handing Extraction of Features from Time series) ecosystem is a collection of (currently) two R packages – theft and theftdlc (theft ‘downloadable content’, just like a video game has DLC) – that were purposefully designed by me to do the following:

  • Unify and standardise the outputs of existing open source feature sets across both R and Python
  • Provide an extensive and extensible framework of feature-based analyses to help users get to scientific understanding faster
  • Make my life easier!

Regarding the first point, there exists six popular open-source time-series feature sets:

  1. catch22 (R) – a collection of 22 general purpose features, implemented in the R package Rcatch22 by me
  2. tsfeatures (R) – a collection of 62 features from methods commonly used by econometricians and forecasters, such as crossing points, seasonal and trend decomposition using Loess (STL), autoregressive conditional heteroscedasticity (ARCH) models, unit-root tests, and sliding windows
  3. feasts (R) – a collection of 43 features that largely capture quantities of interest in econometrics, such as autocorrelation, seasonality, and STL decompositions
  4. tsfresh (Python) – a large collection of 783 features that measure properties of the autocorrelation function, entropy, quantiles, fast Fourier transforms, and distributional characteristics
  5. kats (Python) – a collection of 40 features from a broad range of time-series methods, including operations for forecasting, outlier and property detection, and feature calculation
  6. tsfel (Python) – a collection of 156 features that measure properties associated with distributional characteristics, the autocorrelation function, spectral quantities, and wavelets

As you can see, all of the feature sets contain a wide range of features, and without strong a priori understanding, it can sometimes be difficult to know which features might be best suited to solving your problem. To make matters worse, they all expect different input formats and produce different output formats, let alone the fact that three are in Python and three are in R. Thankfully, I suffered through unifying all of this within theft so you don’t have to!

The broad functionality provided by theft and theftdlc is presented below. I have a very detailed paper hopefully being published soon that details the underlying philosophy of theft and theftdlc with a very interesting machine learning multiclass classification case study, but for now, this infographic will have to suffice!

knitr::include_graphics("workflow-graphic_ecosystem-final.png")

As you have likely guessed, we are going to use the theft ecosystem here to calculate features from our climate time series and subsequently do analysis in the feature space to see if we can glean some insight into the temporal dynamics that distinguish rainfall from soil moisture from the Forest Fire Danger Index from temperature.

Calculating time-series features

theft makes our lives easy by letting us creating a new virtual environment (we’ll call it "climatetheft"), installing the requisite Python packages, and then pointing R to the correct virtual environment. Note that Python \(\geq3.10\) is required.

install_python_pkgs("climatetheft", python = "/usr/local/bin/python3.10")
init_theft("climatetheft")

Now we can calculate features for each of the specified sets. Note that we are just going to use catch22, kats, tsfresh, and tsfel here. These are quite fast to evaluate compared to feasts and tsfeatures (see my conference paper for a comprehensive test of computation speed between the sets), and my poor computer has already suffered enough throughout my PhD and recording, mixing, and mastering my debut album.

features <- calculate_features(climate, 
                               feature_set = c("catch22", "kats", 
                                               "tsfresh", "tsfel"))

This gives us a nice data frame of class "feature_calculations" which means we can use the defined generic functions in theftdlc that are designed to effortlessly work with objects of this class:

head(features)
##                            id             group feature_set
## 1 ACT - South West_fire_index Forest Fire Index     catch22
## 2 ACT - South West_fire_index Forest Fire Index     catch22
## 3 ACT - South West_fire_index Forest Fire Index     catch22
## 4 ACT - South West_fire_index Forest Fire Index     catch22
## 5 ACT - South West_fire_index Forest Fire Index     catch22
## 6 ACT - South West_fire_index Forest Fire Index     catch22
##                      names     values
## 1       DN_HistogramMode_5 -0.4061680
## 2      DN_HistogramMode_10 -0.1713925
## 3                CO_f1ecac 10.6526715
## 4           CO_FirstMin_ac 21.0000000
## 5 CO_HistogramAMI_even_2_5  0.6974725
## 6            CO_trev_1_num -0.0018446

Visualising the feature matrix

One of the foundational plots in feature-based time-series analysis is a heat map graphic of the time series \(\times\) feature matrix where values are represented as shaded cells and rows and columns are organised via hierarchical clustering to pull out structure from the data. theftdlc can easily do this for us (and we are using a pretty cool outlier-robust normalisation method that was introduced in this excellent paper):

plot(features, type = "matrix", norm_method = "RobustSigmoid")

This is pretty interesting. We can see some clusters of time series which exhibit similar values across a range of features. The natural intuition might be that time series from a given class (eg., rainfall) are more likely to group together.

Projecting the feature space into a low-dimensional representation

As seen in the feature matrix graphic, we are working in a very high dimensional space. Dimension reduction methods – such as principal component analysis (PCA) and t-distributed stochastic neighbour embedding (t-SNE) – can project the high dimensional data into a lower dimensional space (such as 2-D) to enable effective visualisation of groups. theftdlc can implement all modern dimension reduction techniques through its simple and consistent project function. Keeping with the object-oriented theme, project returns an object of class feature_projection which R’s generic functions – such as plot – have been written to know how to handle. Here’s a simple PCA:

pca <- project(features,
        norm_method = "RobustSigmoid",
        unit_int = TRUE,
        low_dim_method = "PCA")

pca %>%
  plot(show_covariance = TRUE)

We can see that each class of climate measurement largely occupies its own region of the two-dimensional space, though with some overlap. This suggests that this dataset might be well classified by time-series features. Interestingly, soil moisture appears to be the most distinctive in terms of its position on the first two principal components. Lastly, we see that the first principal component explains over three-quarters of the variance in the data. That’s pretty huge considering the very large number of features (\(985\) after filtering out NAs) that went into it. Further inspection as to the features which loaded most strongly onto the first principal component were features associated with Shannon entropy (i.e., information or ‘surprise’ or uncertainty), fast Fourier transform coefficients, first and location of the maximum value in the time series, permutation entropy (which is a measure of the complexity of a system that captures the order relations between values and extracts a probability distribution of the ordinal patterns – basically, smaller values on this feature indicate a more ‘regular’ or deterministic system), and trend quantities.

#' Extract subset of strong factor loadings on the first principal component
#' 
#' @param pca the model output object from \code{prcomp}
#' @return \code{data.frame} containing the feature name and loading on the first principal component
#' @author Trent Henderson
#' 

get_loadings <- function(pca){
  
  contribs <- factoextra::get_pca_var(pca)
  loadings <- as.data.frame(contribs$contrib)
  n_vars <- ncol(loadings)
  features <- rownames(loadings)
  
  loadings <- loadings %>%
    tibble::rownames_to_column(var = "feature_name") %>%
    pivot_longer(cols = 2:(n_vars + 1), names_to = "names", values_to = "value") %>%
    filter(names == "Dim.1") %>%
    filter(value >= 0.15) %>%
    dplyr::select(c(feature_name, value)) %>%
    arrange(desc(value))
  
  return(loadings)
}

get_loadings(pca$ProjectedData)
## # A tibble: 17 × 2
##    feature_name                                                            value
##    <chr>                                                                   <dbl>
##  1 "tsfresh_values__ratio_value_number_to_time_series_length"              0.241
##  2 "TSFEL_0_Entropy"                                                       0.241
##  3 "tsfresh_values__linear_trend__attr_\"stderr\""                         0.216
##  4 "tsfresh_values__last_location_of_maximum"                              0.182
##  5 "tsfresh_values__first_location_of_maximum"                             0.182
##  6 "tsfresh_values__agg_linear_trend__attr_\"stderr\"__chunk_len_5__f_agg… 0.166
##  7 "tsfresh_values__permutation_entropy__dimension_3__tau_1"               0.164
##  8 "tsfresh_values__fft_coefficient__attr_\"angle\"__coeff_49"             0.161
##  9 "tsfresh_values__permutation_entropy__dimension_4__tau_1"               0.158
## 10 "tsfresh_values__fft_coefficient__attr_\"angle\"__coeff_3"              0.158
## 11 "tsfresh_values__fft_coefficient__attr_\"angle\"__coeff_12"             0.156
## 12 "tsfresh_values__fft_coefficient__attr_\"angle\"__coeff_7"              0.155
## 13 "tsfresh_values__ar_coefficient__coeff_2__k_10"                         0.155
## 14 "tsfresh_values__fft_coefficient__attr_\"angle\"__coeff_70"             0.155
## 15 "tsfresh_values__permutation_entropy__dimension_7__tau_1"               0.155
## 16 "Kats_trend_strength"                                                   0.153
## 17 "Kats_heterogeneity"                                                    0.153

What does all of this suggest? Well, based on the types of features that contribute to most of the variance explained in the data, it appears that information about ordered temporal patterns and frequency domain phase information are important in explaining this dataset. However, it can be hard to interpret this list without knowing how the classes differ in terms of their distributions on the features. Let’s plot the class distributions for a few of them:

features %>%
  filter(names %in% c("0_Entropy", "values__first_location_of_maximum", "trend_strength")) %>%
  plot(type = "violin", feature_names = c("0_Entropy", "values__first_location_of_maximum", 
                                          "trend_strength"))

One of the immediate things that leaps out is the size of the distribution of values for soil moisture on all the chosen features except for entropy (which is operationalised in tsfel as normalised entropy), where its values are clustered tightly about \(1\), which represents a high degree of uncertainty (i.e., randomness or ‘surprise’). Put another way, \(z\)-score soil moisture time series have a near uniform distribution of probabilities across all unique values observed within the time series (i.e., each unique value is equally likely to occur). Pretty fascinating stuff!

Stochastic dimension reduction techniques

Since we have all the power of theftdlc at our fingertips, let’s try one more dimension reduction method. Here is a t-SNE version with a perplexity value of 30 to start with:

features %>%
  project(norm_method = "zScore",
          low_dim_method = "tSNE",
          perplexity = 30,
          seed = 123) %>%
  plot(show_covariance = FALSE)

How about with a much lower perplexity value?

features %>%
  project(norm_method = "zScore",
          low_dim_method = "tSNE",
          perplexity = 5,
          seed = 123) %>%
  plot(show_covariance = FALSE)

Oh boy! Hopefully now you can see that visualisations associated with stochastic dimension reduction techniques can be problematic to interpret! Please see this incredible interactive article for more.

Finding high-performing features

A useful next step for a setting such as this can be to understand which individual features best distinguish between the four classes. theftdlc contains a function classify which fits a resample-based classification procedure (first introduced in this seminal paper). classify handles all the preprocessing for you, making sure to normalise both the train and test set against only the train set (to keep the test set data fully unseen) and retain original class proportions in each resample. We’ll just use 10 resamples here so my computer doesn’t die rendering this blog post, and we’ll use a radial basis function support vector machine to let us model a non-linear decision boundary compared to the default option which is a linear SVM. classify also allows us to calculate a set of ‘null’ models where class labels are shuffled (therefore severing the input-output relationship) which we can use to form an empirical null distribution against which to compare performance of the main unshuffled models. Keep in mind that this is more useful with a large number of resamples, though! The upcoming theft ecosystem paper walks through this permutation testing methodology in far more detail, so please stay tuned for that. But broadly speaking, we would expect the empirical null distributions to be distributed about a mean equivalent to the chance probability for the problem, which, given we have four classes here would be \(25\%\).

rbf <- function(formula, data){
  mod <- e1071::svm(formula, data = data, kernel = "radial", scale = FALSE,
                    probability = TRUE)
}

classifiers <- classify(features,
                        classifier = rbf,
                        by_set = FALSE,
                        n_resamples = 10,
                        use_null = TRUE)

We can make sense of this vast amount of information by computing comparisons; namely, we can compare each feature against its empirical null distribution or we can compare features pairwise. Since we just want to get a sense for what works well, let’s compare features to their nulls and pull out the top ten according to the mean classification accuracy (five samples per main and null model isn’t enough to robustly use the p-value here). compare_features in theftdlc was designed for exactly this task, and it can be easily parallelised for efficiency:

comps <- compare_features(classifiers,
                          by_set = FALSE,
                          hypothesis = "null",
                          n_workers = 6)

comps %>%
  slice_max(feature_mean, n = 20) %>%
  dplyr::select(-c(names, original_names, feature_set))
##                                                             hypothesis   metric
## 1  tsfresh_values__permutation_entropy__dimension_4__tau_1 != own null accuracy
## 2  tsfresh_values__permutation_entropy__dimension_5__tau_1 != own null accuracy
## 3  tsfresh_values__permutation_entropy__dimension_6__tau_1 != own null accuracy
## 4  tsfresh_values__permutation_entropy__dimension_7__tau_1 != own null accuracy
## 5  tsfresh_values__permutation_entropy__dimension_3__tau_1 != own null accuracy
## 6                          TSFEL_0_Positive turning points != own null accuracy
## 7                        tsfresh_values__number_peaks__n_1 != own null accuracy
## 8                          TSFEL_0_Negative turning points != own null accuracy
## 9            tsfresh_values__ar_coefficient__coeff_1__k_10 != own null accuracy
## 10                                          TSFEL_0_MFCC_0 != own null accuracy
## 11          tsfresh_values__partial_autocorrelation__lag_2 != own null accuracy
## 12      tsfresh_values__fft_aggregated__aggtype_"kurtosis" != own null accuracy
## 13      tsfresh_values__fft_aggregated__aggtype_"variance" != own null accuracy
## 14                                 TSFEL_0_Spectral spread != own null accuracy
## 15                               TSFEL_0_Spectral kurtosis != own null accuracy
## 16                               TSFEL_0_Maximum frequency != own null accuracy
## 17                               TSFEL_0_Spectral roll-off != own null accuracy
## 18           tsfresh_values__ar_coefficient__coeff_2__k_10 != own null accuracy
## 19                                          TSFEL_0_LPCC_1 != own null accuracy
## 20                                         TSFEL_0_LPCC_11 != own null accuracy
##    feature_mean null_mean t_statistic      p.value
## 1     0.8404726 0.2456555    4.791151 4.929049e-04
## 2     0.8401677 0.2561738    4.373946 8.933965e-04
## 3     0.8384146 0.2577744    4.830720 4.664852e-04
## 4     0.8285823 0.2701982    4.760058 5.147934e-04
## 5     0.8228659 0.2300305    6.182517 8.111924e-05
## 6     0.8220274 0.2724848    7.057098 2.969744e-05
## 7     0.8220274 0.2724848    7.057098 2.969744e-05
## 8     0.8185213 0.2641768    8.091412 1.010588e-05
## 9     0.8040396 0.2325457    3.948442 1.681320e-03
## 10    0.7967226 0.2625762    3.519201 3.261813e-03
## 11    0.7907774 0.2570122    3.155916 5.813174e-03
## 12    0.7757622 0.2978659    3.515284 3.281947e-03
## 13    0.7754573 0.2686738    3.883694 1.855224e-03
## 14    0.7749238 0.3028963    3.728541 2.354033e-03
## 15    0.7420732 0.3217988    3.693299 2.485950e-03
## 16    0.7419207 0.2732470    3.532125 3.196296e-03
## 17    0.7419207 0.2732470    3.532125 3.196296e-03
## 18    0.7406250 0.2103659    4.944764 3.985041e-04
## 19    0.7333841 0.2587652    3.777485 2.182938e-03
## 20    0.7333841 0.2587652    3.777485 2.182938e-03

We see that the top \(20\) features all vastly outperform their respective empirical null distributions, which are all, as expected, distributed around the chance probability of \(25\%\) (with some variability due to the fact we only used \(10\) resamples). We see some similar features to those that loaded strongly onto the first principal component – such as permutation entropy – but we also see a few new high-performers. Features associated with turning points (i.e., on a differenced time series, positive turning points represent the number of times the signal value crosses from below to above zero), autocorrelation, and statistical properties of the spectral domain appear to distinguish between classes well. Individually, each of the top ten features (as determined by mean classification accuracy) classifies the data with \(>80\%\) accuracy. A single feature! Let’s plot the top performer – values__permutation_entropy__dimension_4__tau_1 from tsfresh – and see what the class distributions look like to give us some insight into the differences in temporal dynamics:

features %>%
  filter(names %in% c("values__permutation_entropy__dimension_4__tau_1")) %>%
  plot(type = "box", feature_names = c("values__permutation_entropy__dimension_4__tau_1"))

We see that the interquartile range of all four classes are well separated along values for this feature, which is no surprise given its classification performance. We see that soil moisture exhibits far lower average values that the other three classes (with quite a wide dispersion) and rainfall exhibits the highest values (with comparatively tight dispersion). But what does a high or low value mean on this feature? Recall that permutation entropy measures the complexity of a dynamic system through the calculation of a probability distribution of ordinal patterns. That sounds like a lot, but it essentially works like this:

  1. Divide the time series into sub-windows of size \(d\), starting every \(\tau\)
  2. Replace each sub-window by the permutation that captures the ordinal ranking of the data (i.e., a pattern of [4,7,9] would get a ranking system of [0,1,2], remembering that tsfresh is a Python library which is 0-indexed)
  3. Count the frequencies of each permutation, convert to probabilities \(p\) (i.e., via a proportion) and calculate their entropy (\(-\sum_{i}p_{i}\text{ln}p_{i}\)), noting that tsfresh uses natural logarithm, not log base 2 like many applications of entropy in information theory

So, our case study feature here implements this procedure with \(d=4\) and \(\tau=1\), meaning that windows along the time series of 4 points wide are created for every 1 time step. All of this leads to the high-level takeaway message for this top-performing feature which is that higher values indicate more noise or randomness. So, rainfall appears to exhibit more randomness in its signal when considering the ordinal patterns in small windows sizes along it. Interestingly, soil moisture apparently exhibits the least randomness in its signal when measured this way – a finding that opposes what we found earlier regarding TSFEL’s entropy feature. This suggests that when considering the whole signal (i.e., as the entropy feature does), soil moisture takes on a wide range of unique values, but values close in time in small sub-windows tend to be similar. Now, I’m not a climate expert, but I guess that makes sense in this context, given that rainfall can be sporadic when considered over a short window, yet I imagine soil moisture holds some degree of consistency over the same window. Pretty interesting stuff!

A final question you might have after all of this might be Trent this is all very interesting, but what can I do with this information? Aside from just being cool to dive into, there are a few applications that come to mine. First, if new time series were made available but it was not known which measurement they came from, we could train a feature-based classifier to predict with what seems to be quite high accuracy what class of data the time series belongs to. Second, we could use features to forecast future values of each time series, rather than relying on a simple univariate forecasting approach or a multivariate method such as vector autoregression. Third, by calculating features over sliding windows and observing future periods, we might be able to train a classifier to predict the probability of a lack of soil moisture, or an elevated Forest Fire Danger Index, given the short-range historical temporal dynamics. Maybe I’ll get around to exploring one of these in the future.

Conclusion

Thanks for making it to the end! I hope you found it some value in seeing how a feature-based time-series analysis workflow can provide detailed understanding about your data. If you have any questions, please reach out on any of my socials found on the home page of my website – I’d love to chat time series! Please check back in the near future for hopefully the published version of the paper introducing theft and theftdlc.