Using tidymodels to make automated assessments • tidyassessments

library(tidyverse)
library(glue)
library(tidymodels)
library(tidyassessments)

Automated assessment methods use parameters or predictors to predict or assign preliminary IUCN Red List categories to species. tidymodels can help to provide a consistent interface to a range of statistical and machine learning models.

Here, I’ll go through the general flow of setting up and running a model using tidymodels.

Input data

Before getting to tidymodels, the first thing we need is some data to train our model with.

We usually need a set of parameters or predictors calculated at the species level to run an automated conservation assessment. So we can focus on the tidymodels approach, we’ll just use something pre-prepared.

I’ve included an example dataset in this package called myrcia_predictors. This is a set of 13 continuous numeric quantities calculated from GBIF occurrence records for all accepted species in the genus Myrcia. We calculated these as part of our paper “Evidence-based guidelines for automated conservation assessments of plant species”.

glimpse(myrcia_predictors)
#> Rows: 666
#> Columns: 18
#> $ wcvp_id              <chr> "1005985-2", "165955-2", "165982-2", "166012-2", …
#> $ eoo                  <dbl> 5.852548e+06, 2.189483e+05, 5.216963e+06, 7.21987…
#> $ human_footprint      <dbl> 1.40550539, 0.79760157, 6.43746221, 3.37791242, 1…
#> $ human_population     <dbl> 2.046968e-02, 3.165635e-04, 1.983079e-02, 2.38729…
#> $ forest_loss          <dbl> 0.0229078334, 0.0058167546, 0.0870502486, 0.01589…
#> $ annual_temperature   <dbl> 256.0336, 227.5473, 256.3566, 264.0668, 235.7427,…
#> $ precipitation_driest <dbl> 323.52557, 223.27577, 234.64264, 238.36092, 163.9…
#> $ elevation            <dbl> 1523.86560, 1507.69812, 1911.90186, 2229.42627, 7…
#> $ conr_eoo             <dbl> 6048292, 218811, 5277809, 726755, 10752, NA, 4007…
#> $ conr_aoo             <dbl> 328, 76, 132, 212, 28, 4, 176, 1004, 132, 308, 32…
#> $ conr_locs            <dbl> 59, 16, 33, 38, 7, 1, 34, 222, 27, 71, 4, 172, 8,…
#> $ conr_obs             <dbl> 102, 22, 41, 68, 7, 1, 58, 301, 45, 93, 14, 264, …
#> $ n_specimens          <dbl> 181, 27, 61, 107, 9, 2, 89, 470, 102, 127, 15, 59…
#> $ centroid_latitude    <dbl> 0.05035375, 5.47690836, -1.49773510, 4.25111231, …
#> $ name                 <chr> "Myrcia clusiifolia", "Myrcia rotundata", "Myrcia…
#> $ category             <chr> "LC", "LC", "LC", "LC", "EN", "DD", "LC", "LC", "…
#> $ genus                <chr> "Myrcia", "Myrcia", "Myrcia", "Myrcia", "Myrcia",…
#> $ section              <chr> "Myrcia", "Aulomyrcia", "Myrcia", "Aulomyrcia", "…

As you can see, there are a variety of numeric predictors alongside the species name and identifier (“wcvp_id”). There is also some taxonomic info (“genus”, “section”) and the IUCN Red List category (“category”).

First we’ll split the species into ones that have been assessed (labelled) and those that haven’t (unlabelled). We’ll include Data Deficient species in unlabelled.

labelled <- filter(myrcia_predictors, category != "DD", ! is.na(category))
unlabelled <- filter(myrcia_predictors, category == "DD" | is.na(category))

glue("{nrow(labelled)} labelled examples for training and evaluation", 
     "{nrow(unlabelled)} unlabelled examples for prediction", .sep="\n")
#> 346 labelled examples for training and evaluation
#> 320 unlabelled examples for prediction

Next, we’ll make our prediction target the binarised categories “threatened” (VU, EN, CR) and “not threatened” (LC and NT), to simplify things.

labelled$obs <- ifelse(labelled$category %in% c("LC", "NT"), "not threatened", "threatened")
labelled$obs <- factor(labelled$obs, levels=c("threatened", "not threatened"))

glue("{percent_format(accuracy=0.1)(mean(labelled$obs == 'threatened'))}",
     "of training/evaluation species are threatened", .sep=" ")
#> 44.8% of training/evaluation species are threatened

Data budget

Before training any model, we need to determine our data budget - how we’ll split our data for the different steps of our process. How we split the data will depend on what exactly we need to do. This paper from Sebastian Raschka gives a really detailed overview of different approaches to splitting data, and when you might want to use what approach.

In our case, we just want to estimate how our model will perform on unseen data. We have a relatively small dataset, so a single training/test split might give us a biased performance estimate. Instead, we’ll use 5-fold cross-validation. Alternatively, we could use bootstrap resampling, which would also give us confidence intervals on the performance.

folds <- vfold_cv(v=5, data=labelled)
folds
#> #  5-fold cross-validation 
#> # A tibble: 5 × 2
#>   splits           id   
#>   <list>           <chr>
#> 1 <split [276/70]> Fold1
#> 2 <split [277/69]> Fold2
#> 3 <split [277/69]> Fold3
#> 4 <split [277/69]> Fold4
#> 5 <split [277/69]> Fold5

Evaluation metrics

Another thing to set up before we actually get to training a model is what metric we want to use to evaluate it.

For our purposes we care about a few things:

the proportion of species our model correctly classifies, measured by accuracy
the proportion of threatened species our model correctly identifies, measured by sensitivity
the proportion of not threatened species our model correctly identifies, measured by specificity

The picture of model performance given by these three measures can be contradictory sometimes (e.g. a high accuracy with high specificity, but a low sensitivity) and it can be hard to weigh up how good a model is without a single measure of performance. Associating different costs with wrongly classifying a threatened species and wrongly classifying a not threatened species (in terms of money, time, or some other quantity) would be ideal. But when we’re not clear on the costs, and we can use a single measure that balances the different types of misclassification.

In this case, we’ll use the true skill statistic (also called Youden’s J-index), which is defined as sensitivity + specificity - 1.

metrics <- metric_set(accuracy, sensitivity, specificity, j_index)
metrics
#> # A tibble: 4 × 3
#>   metric      class        direction
#>   <chr>       <chr>        <chr>    
#> 1 accuracy    class_metric maximize 
#> 2 sensitivity class_metric maximize 
#> 3 specificity class_metric maximize 
#> 4 j_index     class_metric maximize

We’ll calculate all four of these metrics so we get a full picture of how our model is performing.

Pre-processing

The next step in our process is to define which predictors our model will use and any pre-processing steps we need to do to them.

To define our predictors, we use the R formula specification. For many machine learning models, we just need to add each predictor as a term in the formula. For some other models, like a GLM, you might need to specify interactions between terms as well.

form <- formula(
  obs ~ eoo + centroid_latitude + human_population + precipitation_driest + forest_loss
)

Once we have our formula, we set up a recipe that details each of our pre-processing steps. The ones we’ll use here:

impute missing predictor values using K-nearest neighbours (step_impute_knn)
remove any predictors with correlation above 0.9 to another predictor (step_corr)
remove any predictors with zero-variance (step_zv)
mean center and scale each predictor by it’s standard deviation (step_normalize)

rec <- 
    recipe(form, data=labelled) %>%
    step_impute_knn(all_predictors()) %>%
    step_corr(all_predictors(), threshold=0.9) %>%
    step_zv(all_predictors()) %>%
    step_normalize(all_predictors())

rec
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          5
#> 
#> Operations:
#> 
#> K-nearest neighbor imputation for all_predictors()
#> Correlation filter on all_predictors()
#> Zero variance filter on all_predictors()
#> Centering and scaling for all_predictors()

The important thing here is that we haven’t applied any of these steps to our data yet, this is just the specification.

One advantage of this is it prevents any data leakage between our training and test sets, because all calculations (like the mean and standard deviation for normalisation) will only be performed on the training data sets. It also lets us reuse the same recipe for different models and different datasets.

Model specification

Now we get to specifying the model we want to use.

The thinking behind these specifications is to separate the type of model from the package that implements it.

So below, we’ve said that we want a random forest model with 1000 trees.

spec <-
  rand_forest(trees=1000) %>%
  set_engine("randomForest", importance=TRUE) %>%
  set_mode("classification")

spec
#> Random Forest Model Specification (classification)
#> 
#> Main Arguments:
#>   trees = 1000
#> 
#> Engine-Specific Arguments:
#>   importance = TRUE
#> 
#> Computational engine: randomForest

After that, we say that we want to use the implementation from the “randomForest” package. We can also specify which engine-specific arguments we want at this point - for instance that we want to calculate feature importance.

The final part of the specification is the mode we want to use. This is determined by the problem we’re trying to solve. In the case of automated assessments, we’re usually trying to predict which IUCN category a species belongs to, so we chose “classification”.

Creating a workflow

The final part of setting everything up is to link the pre-processing recipe to the model specification in a workflow.

wf <- 
  workflow() %>%
  add_recipe(rec) %>%
  add_model(spec)

wf
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: rand_forest()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#> 
#> • step_impute_knn()
#> • step_corr()
#> • step_zv()
#> • step_normalize()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Random Forest Model Specification (classification)
#> 
#> Main Arguments:
#>   trees = 1000
#> 
#> Engine-Specific Arguments:
#>   importance = TRUE
#> 
#> Computational engine: randomForest

As we mentioned before, this helps to prevent data leakage when training and evaluating our model. It also has the benefit of allowing us to set up multiple workflows, using the same pre-processing recipe with a range of different model specifications, or multiple pre-processing recipes with the same model.

The possibilities are endless!

Evaluating our model

We can now evaluate the performance of our whole workflow using the data budget we set up before.

results <- fit_resamples(
  wf, 
  folds, 
  metrics=metrics,
  control=control_resamples(save_pred=TRUE, save_workflow=TRUE)
)

collect_metrics(results)
#> # A tibble: 4 × 6
#>   .metric     .estimator  mean     n std_err .config             
#>   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy    binary     0.795     5  0.0125 Preprocessor1_Model1
#> 2 j_index     binary     0.587     5  0.0309 Preprocessor1_Model1
#> 3 sensitivity binary     0.756     5  0.0436 Preprocessor1_Model1
#> 4 specificity binary     0.831     5  0.0338 Preprocessor1_Model1

This gives us an estimate of how our model will perform on completely new data, that it hasn’t seen before.

Predicting threat

The final step is to use our model to make predictions.

We want our model to learn from as much data as possible, so we fit the model for a final time on our whole labelled dataset. This is fine, because we don’t intend to evaluate the performance of our model on any part of it again.

m <- fit(wf, labelled)
m
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: rand_forest()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#> 
#> • step_impute_knn()
#> • step_corr()
#> • step_zv()
#> • step_normalize()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> 
#> Call:
#>  randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, importance = ~TRUE) 
#>                Type of random forest: classification
#>                      Number of trees: 1000
#> No. of variables tried at each split: 2
#> 
#>         OOB estimate of  error rate: 19.65%
#> Confusion matrix:
#>                threatened not threatened class.error
#> threatened            118             37   0.2387097
#> not threatened         31            160   0.1623037

One thing that you might notice from the model summary above is that it has its own report of the model performance called the “OOB estimate of error rate”. This is the “out-of-bag” error - each tree of the random forest model is trained on a different subset of the training data, and the OOB error is evaluated for each tree on the data not included in that subset.

We could use that as our estimate of how well the model performs, but generally I prefer to use the cross-validation setup we used above. One reason is that it allows the same cross-validation splits to be used to compare the performance of multiple models. Another is that sometimes the OOB error might be a little bit more pessimistic that cross-validation, but generally they agree quite well.

Now we have our final model, we can at last estimate the total proportion of Myrcia species that are threatened (including those already assessed).

preds <- predict(m, myrcia_predictors)
prop_threatened <- mean(preds$.pred_class == "threatened")
glue("{percent_format(accuracy=0.1)(prop_threatened)} of Myrcia species predicted threatened")
#> 61.9% of Myrcia species predicted threatened

Much higher than the proportion of assessed Myrcia species that are threatened!