Using deep learning in PyTorch on rainfall data to predict Australian area remoteness

17 minute read

Published:

Introduction

Ahoy! Following on from my last blog post we are again going to make use of a pretty sweet dataset of Australian climate data from 1 January 1980 and up until 1 January 2020 for every ABS Statistical Area 2. I did a bunch of behind-the-scenes SQL queries to construct this dataset using information stored on a warehouse, meaning that unfortunately it’s not publicly available to access, but hopefully the code contained here will still be useful!

The last post explored the differences in temporal dynamics between four different climate measurements – temperature, rainfall, soil moisture, and the Forest Fire Index. What we are going to do this time is use time-series features computed on the rainfall data for every ABS SA2 to train a deep neural network in PyTorch to predict whether a time series comes from one of the following ABS remoteness areas:

  1. Major Cities of Australia
  2. Inner Regional Australia
  3. Outer Regional Australia
  4. Remote Australia
  5. Very Remote Australia

Importantly, we are not making these predictions based on average rainfall or another similarly simple idea. While this may in fact work, it’s also not particularly interesting for the purpose of a blog post! Instead, our approach here – if successful – will provide evidence that differences in empirical temporal dynamics in rainfall can be used to classify Australia’s various granular statistical areas. Sounds like a pretty novel idea to me!

Let’s start by loading the packages we need:

library(dplyr)
library(tidyr)
library(tsibble)
library(theft)
library(theftdlc)
library(reticulate)

Here is a snippet of the data frame that we are working with after my preprocessing:

head(climate)
## # A tsibble: 6 x 4 [1]
## # Key:       sa2_name_2016, remoteness_name_2016 [1]
##   sa2_name_2016    remoteness_name_2016 timepoint rainfall_mm
##   <chr>                           <dbl>     <int>       <dbl>
## 1 ACT - South West                    1         1        2.91
## 2 ACT - South West                    1         2        2.60
## 3 ACT - South West                    1         3        2.30
## 4 ACT - South West                    1         4        2.12
## 5 ACT - South West                    1         5        2.44
## 6 ACT - South West                    1         6        2.19

Note that I have coded the remoteness areas into integers reading for machine learning by using the following taxonomy:

  • 0 = Major Cities of Australia
  • 1 = Inner Regional Australia
  • 2 = Outer Regional Australia
  • 3 = Remote Australia
  • 4 = Very Remote Australia

Calculate time-series features

Recall that the core idea of feature-based time-series analysis is to reduce a time series down to a single scalar summary statistic (or a set of summary statistics) which we call a ‘feature’. This means we can reduce the (potentially) cumbersome temporal domain data of our time series \(\times\) time matrix down to a potentially more usable time series \(\times\) feature matrix. Time-series features vary substantially in their complexity, ranging from measures that ignore temporality – such as the moments of the distribution – all the way to coefficients of autoregressive model fits, information criteria, and countless other statistical properties.

My theft package for R simplifies the computation of time-series features by providing instant and standardised access to three feature sets in R (catch22, tsfeatures, and feasts) and three feature sets in Python (tsfresh, tsfel, and kats), all in the one place. You can also calculate your own features too, if you want! Since we want to use some of the Python libraries here, we will first usetheft to create a new virtual environment (we’ll call it "theftclimate") and install the requisite Python packages into it, then point R to the virtual environment. Note that Python \(\geq3.10\) is required.

install_python_pkgs("theftclimate", python = "/usr/local/bin/python3.10")
init_theft("theftclimate")

We can now calculate time-series features using the calculate_features function in theft. We are going to use catch22, kats, and tsfel here because each set contains relatively distinct features from one-another and the resulting time series \(\times\) feature matrix will not be so big that we risk issues with training the neural network. Note that I am also z-scoring the time series prior to calculating features to eliminate feature sensitivity to the first two moments of the distribution (i.e., mean and variance) because I care more about temporal dynamics here – i.e., features sensitive to patterns over time.

features <- calculate_features(climate, feature_set = c("catch22", "kats", "tsfel"), 
                               z_score = TRUE, n_jobs = 5, warn = FALSE)

This produces a data frame that looks like this:

head(features)
##                 id group feature_set                    names      values
## 1 ACT - South West     1     catch22       DN_HistogramMode_5 -0.67214804
## 2 ACT - South West     1     catch22      DN_HistogramMode_10 -0.40865463
## 3 ACT - South West     1     catch22                CO_f1ecac  9.42065730
## 4 ACT - South West     1     catch22           CO_FirstMin_ac 22.00000000
## 5 ACT - South West     1     catch22 CO_HistogramAMI_even_2_5  0.51096367
## 6 ACT - South West     1     catch22            CO_trev_1_num -0.00262028

Prepare the feature data for deep learning

Alright, this is the final step before we transition to Python and get the party started. We are going to prepend the feature set name to each feature name in order to safely create unique column names and then widen our tidy data frame to get a matrix. From there, we’ll drop the class labels from the data matrix and put them in their own output vector.

featmat <- features %>%
  mutate(names = paste0(feature_set, "_", names)) %>%
  pivot_wider(id_cols = c("id", "group"), names_from = "names", values_from = "values") %>%
  dplyr::select(where(~!any(is.na(.)))) # Remove features with NA values

X <- featmat %>%
  dplyr::select(-c(id, group))

y <- featmat %>%
  pull(group) %>%
  as.integer()

Build a deep learning model in PyTorch

It’s time to rock and roll! We are now going to switch to Python within this same document due to the power of R Markdown and the glorious reticulate package. We’ll start by loading all the libraries we need:

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data_utils
import matplotlib.pyplot as plt
from torch.optim import Adam, SGD
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

Create train-test split

First of all, we will z-score our time-series features to make training the neural network more effective and efficient and then produce a train-test split for the data. Note the prepended r. – since we are using R and Python within a single R Markdown document, we need to tell Python (through `reticulate``) to access the objects from the R environment.

scaler = StandardScaler()
X = scaler.fit_transform(r.X)
X_train, X_test, y_train, y_test = train_test_split(X, r.y, test_size=0.2, random_state=123)

Convert data to tensors

With our data split into normalised train-test sets, we now just need to convert the pandas data frames and NumPy arrays (i.e., class labels) into tensors ready for use with PyTorch. We are also going to set the batch size which is essentially the number of training samples processed in a forward and backward pass through the algorithm. Passing in the entire data would be computationally expensive and this is one way around that.

X_train = torch.tensor(X_train, dtype=torch.float)
X_test = torch.tensor(X_test, dtype=torch.float)
y_train = torch.tensor(y_train, dtype=torch.long)
y_test = torch.tensor(y_test, dtype=torch.long)
batch_size = 32
torch.manual_seed(123) # For reproducibility
## <torch._C.Generator object at 0x12afe8830>
train_tensor = data_utils.TensorDataset(X_train, y_train) 
train_loader = data_utils.DataLoader(dataset = train_tensor, batch_size = batch_size, shuffle = True)
test_tensor = data_utils.TensorDataset(X_test, y_test) 
test_loader = data_utils.DataLoader(dataset = test_tensor, batch_size = batch_size, shuffle = True)

Build neural network architecture

With the data prepared, we can now define our model. The basic structure of a neural network in PyTorch contains an initialisation followed by an intialisation of the parent class we defined, followed by definitions of how each layer is connected and what the structure of those connections look like (i.e., the number of neurons in each layer). Finally, we also need to define what a forward pass through the network’s layers we just defined looks like. This involves the specification of activation functions to pass the neurons from the previous layer to those in the next layer (or output). Note the use of log_softmax right at the end – this ensures the final output is probabilities of membership to each class which is essential for a multiclass classification problem.

class ClimateClassifier(nn.Module):
  def __init__(self, n_features, n_classes, neurons_h1, neurons_h2):
    super().__init__()
    self.input_to_h1 = nn.Linear(n_features, neurons_h1)
    self.h1_to_h2 = nn.Linear(neurons_h1, neurons_h2)
    self.h2_to_output = nn.Linear(neurons_h2, n_classes)
    self.log_softmax = nn.LogSoftmax(dim=1)
    
  def forward(self, x):
    x = self.input_to_h1(x)
    x = torch.relu(x)
    x = self.h1_to_h2(x)
    x = torch.relu(x)
    x = self.h2_to_output(x)
    x = self.log_softmax(x)
    return x

Basically, in plain English, we are just saying ‘connect the number of columns in the input data to neurons_h1 number of neurons in the first hidden layer using a ReLU activation function, then connect these neurons_h1 neurons to neurons_h2 number of neurons in the second hidden layer also using a ReLU activation function, and then finally connect these neurons_h2 number of neurons to n_classes (i.e., 5 in our case; one for each remoteness area) number of classes using probabilities produced by the softmax function followed by a logarithm.’

From there, we can define these parameters so we don’t need to hardcode them into the above architecture and then formally define the model:

n_features = X.shape[1]
neurons_h1 = 64
neurons_h2 = 32
n_classes = len(set(r.y))
model = ClimateClassifier(n_features=n_features, n_classes=n_classes, neurons_h1=neurons_h1, neurons_h2=neurons_h2)
print(model)
## ClimateClassifier(
##   (input_to_h1): Linear(in_features=215, out_features=64, bias=True)
##   (h1_to_h2): Linear(in_features=64, out_features=32, bias=True)
##   (h2_to_output): Linear(in_features=32, out_features=5, bias=True)
##   (log_softmax): LogSoftmax(dim=1)
## )

Train the neural network

Here is where the coding process starts to ramp up. We now have the critical task of defining the training process. This can be an incredibly complex task, but a fundamental version involve specifying the following key components:

  • criterion – the loss function
  • learning rate – how much the model changes in response to the estimated error
  • epochs – number of complete passes through the entire training dataset
  • optimiser – algorithm that adjusts model parameters to minimise our criterion (i.e., loss function)
  • device (optional) – whether PyTorch should use a GPU if CUDA is available for it (for immense performance gains) or else the computer’s CPU

We are going to use cross-entropy loss (which is the standard loss function for multiclass classification problems) with the Adam optimiser. The other parameters are a little hand-wavy for now and just based on my intuition, but we are going to set a learning rate of \(0.01\) and \(100\) epochs.

criterion = nn.CrossEntropyLoss()
lr = 0.01
optimiser = Adam(model.parameters(), lr=lr)
epochs = 100
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Now we can actually train our model! We’ll set up a list to store our losses for each iteration and then specify the training process. Basically, we are passing the training data to whatever device PyTorch detected (GPU or CPU), then computing predicted class labels on the training data, determining the most likely class for each sample (through finding the class label with the corresponding maximum probability produced by the softmax function), computing the loss as the difference between the predicted class label and the actual label, and then engaging backpropagation to update model parameters by first zeroing the gradient (as PyTorch accumulates gradients on subsequent backward passes), then initialising the backpropagation process through loss.backward(), and then performing a single optimisation step to update parameters based on the gradients. That sounds like a lot, but thanks to PyTorch, the code is relatively straightforward once you get used to the syntax:

losses = []
epoch_losses = []

for epoch in range(epochs):
  running_loss = 0.0
  for X, y in train_loader:
    inputs, labels = X.to(device), y.to(device)
    # Forward pass
    preds = model(inputs)
    pred_labels = torch.argmax(preds, axis=1)
    loss = criterion(preds, labels)
    losses.append(loss.item())

    # Backpropagation and optimisation
    optimiser.zero_grad()
    loss.backward()
    optimiser.step()
    running_loss += loss.item()
    
  epoch_loss = running_loss / len(train_loader)
  epoch_losses.append(epoch_loss)

We can plot the loss curve over the epochs to visualise the learning process:

plt.figure(figsize=(8,6))
plt.plot(epoch_losses, label = "Training loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training loss over epochs")
plt.legend()
plt.grid(True)
plt.show()

Evaluate test set performance

With our model trained we can now evaluate predictive performance. We are going to first set the mode to evaluation mode which disables layers and operations such as dropout, batch normalisation etc. This is essential when ‘inferencing’ or evaluating the model. We are then going to disable gradient calculations via torch.no_grad() which is not needed outside of training. This greatly improves computational speed. Finally, we will iterate through our test_loader of test data and calculate model class predictions (where the class label with the highest probability for each sample is assigned as the prediction) followed by a tallying of whether our model predicted correctly or not. This can then be used to calculate the final accuracy statistic. Here is all of that in code:

model.eval()
## ClimateClassifier(
##   (input_to_h1): Linear(in_features=215, out_features=64, bias=True)
##   (h1_to_h2): Linear(in_features=64, out_features=32, bias=True)
##   (h2_to_output): Linear(in_features=32, out_features=5, bias=True)
##   (log_softmax): LogSoftmax(dim=1)
## )
with torch.no_grad():
  num_correct = 0
  num_samples = 0
  
  for x, y in test_loader:
        x = x.to(device)
        y = y.to(device)
        scores = model(x)
        _, predictions = scores.max(1)
        num_correct += (predictions == y).sum().item()
        num_samples += predictions.size(0)
        
accuracy = float(num_correct) / float(num_samples) * 100
print(f"Test Accuracy: {accuracy:.2f}%")
## Test Accuracy: 90.19%

We get a test set accuracy of \(90.19\%\). Pretty good!

Try a different optimiser

As a comparison to the previous model which used the Adam optimiser, we will also try one other method: stochastic gradient descent (SGD). Typically, Adam converges faster than SGD which is why I tried it first, and can sometimes be more robust to poor hyperparameters. However, SGD is a certified classic in the machine learning community, so why not apply it here?

model_sgd = ClimateClassifier(n_features=n_features, n_classes=n_classes, neurons_h1=neurons_h1, neurons_h2=neurons_h2)
optimiser_sgd = SGD(model_sgd.parameters(), lr=lr)
losses_sgd = []

for epoch in range(epochs):
  for X, y in train_loader:
    inputs, labels = X.to(device), y.to(device)
    preds_sgd = model_sgd(inputs)
    pred_labels_sgd = torch.argmax(preds_sgd, axis=1)
    loss_sgd = criterion(preds_sgd, labels)
    losses_sgd.append(loss_sgd.item())
    optimiser_sgd.zero_grad()
    loss_sgd.backward()
    optimiser_sgd.step()
    
model_sgd.eval()
## ClimateClassifier(
##   (input_to_h1): Linear(in_features=215, out_features=64, bias=True)
##   (h1_to_h2): Linear(in_features=64, out_features=32, bias=True)
##   (h2_to_output): Linear(in_features=32, out_features=5, bias=True)
##   (log_softmax): LogSoftmax(dim=1)
## )
with torch.no_grad():
  num_correct_sgd = 0
  num_samples_sgd = 0
  
  for x, y in test_loader:
        x = x.to(device)
        y = y.to(device)
        scores_sgd = model_sgd(x)
        _, predictions_sgd = scores_sgd.max(1)
        num_correct_sgd += (predictions_sgd == y).sum().item()
        num_samples_sgd += predictions_sgd.size(0)
        
accuracy_sgd = float(num_correct_sgd) / float(num_samples_sgd) * 100
print(f"Test Accuracy: {accuracy_sgd:.2f}%")
## Test Accuracy: 91.70%

Woah, we outperformed the Adam optimiser with an accuracy of \(91.70\%\)! This highlights why careful consideration and thorough exploration of different model architectures and parameters is crucial in real applications for developing production-grade and state-of-the-art models.

Try a simpler neural network architecture

As one final example, let’s define a model with just a single hidden layer and see how we do:

class SimpleClimateClassifier(nn.Module):
  def __init__(self, n_features, n_classes, neurons_h1):
    super().__init__()
    self.input_to_h1 = nn.Linear(n_features, neurons_h1)
    self.h1_to_output = nn.Linear(neurons_h1, n_classes)
    self.log_softmax = nn.LogSoftmax(dim=1)
    
  def forward(self, x):
    x = self.input_to_h1(x)
    x = torch.relu(x)
    x = self.h1_to_output(x)
    x = self.log_softmax(x)
    return x
  
simple_model = SimpleClimateClassifier(n_features=n_features, n_classes=n_classes, neurons_h1=64)
optimiser_simple = SGD(simple_model.parameters(), lr=lr)

simple_losses = []

for epoch in range(epochs):
  for X, y in train_loader:
    inputs, labels = X.to(device), y.to(device)
    # Forward pass
    preds_simple = simple_model(inputs)
    pred_labels_simple = torch.argmax(preds_simple, axis=1)
    loss_simple = criterion(preds_simple, labels)
    simple_losses.append(loss_simple.item())

    # Backpropagation and optimisation
    optimiser_simple.zero_grad()
    loss_simple.backward()
    optimiser_simple.step()
    
simple_model.eval()
## SimpleClimateClassifier(
##   (input_to_h1): Linear(in_features=215, out_features=64, bias=True)
##   (h1_to_output): Linear(in_features=64, out_features=5, bias=True)
##   (log_softmax): LogSoftmax(dim=1)
## )
with torch.no_grad():
  num_correct_simple = 0
  num_samples_simple = 0
  
  for x, y in test_loader:
        x = x.to(device)
        y = y.to(device)
        scores_simple = simple_model(x)
        _, predictions_simple = scores_simple.max(1)
        num_correct_simple += (predictions_simple == y).sum().item()
        num_samples_simple += predictions_simple.size(0)
        
accuracy_simple = float(num_correct_simple) / float(num_samples_simple) * 100
print(f"Test Accuracy: {accuracy_simple:.2f}%")
## Test Accuracy: 92.08%

Oooh, \(92.08\%\)! This outperforms both of our more complex with two hidden layers and \(32\) more neurons. This suggests that the more sophisticated models are either overfitting or are just too complex for the amount of training data and features that we have. These findings highlight a very important takeaway message that I like to emphasise in my own research: Incremental complexity is important – start with a simple baseline and add complexity as it justifiably improves performance.

Conclusion

Thanks for making it through! After all of the above, what did we learn? Well, based on the strong predictive performance, I think we can say that the empirical temporal dynamics of rainfall (as quantified by a set of time-series features) can meaningfully distinguish major cities from regional areas from remote areas of Australia without considering just ‘do remote areas get less rainfall than cities’ which is not a particularly novel or exciting question (at least in my opinion!). In other words, ABS SA2s across Australia varying in the dynamics of their rainfall time series, and these dynamics are informative enough to construct a classifier that can predict if an unseen time series comes from one of five classifications of remoteness with a high degree of accuracy. A natural next step would be to investigate which of the time-series features contribute most to the model, but I’ll save that for a future post (probably).