SDMs with `tidymodels`

Species Distribution Modelling relies on several algorithms, many of which have a number of hyperparameters that require turning. The tidymodels universe includes a number of packages specifically design to fit, tune and validate models. The advantage of tidymodels is that the models syntax and the results returned to the users are standardised, thus providing a coherent interface to modelling. Given the variety of models required for SDM, tidymodels is an ideal framework. tidysdm provides a number of wrappers and specialised functions to facilitate the fitting of SDM with tidymodels.

This article provides an overview of the how tidysdm facilitates fitting SDMs. Further articles, detailing how to use the package for palaeodata, fitting more complex models and how to troubleshoot models can be found on the tidisdm website. As tidysdm relies on tidymodels, users are advised to familiarise themselves with the introductory tutorials on the tidymodels website.

When we load tidysdm, it automatically loads tidymodels and all associated packages necessary to fit models:

library(tidysdm)
#> Loading required package: tidymodels
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
#> ✔ broom        1.0.5      ✔ recipes      1.0.10
#> ✔ dials        1.2.1      ✔ rsample      1.2.0 
#> ✔ dplyr        1.1.4      ✔ tibble       3.2.1 
#> ✔ ggplot2      3.5.0      ✔ tidyr        1.3.1 
#> ✔ infer        1.0.6      ✔ tune         1.1.2 
#> ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
#> ✔ parsnip      1.2.0      ✔ workflowsets 1.0.1 
#> ✔ purrr        1.0.2      ✔ yardstick    1.3.0
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Search for functions across packages at https://www.tidymodels.org/find/
#> Loading required package: spatialsample

Preparing your data

We start by reading in a set of presences for a species of lizard that inhabits the Iberian peninsula, Lacerta schreiberi.

data(lacerta)
lacerta
#> # A tibble: 1,297 × 3
#>           ID latitude longitude
#>        <dbl>    <dbl>     <dbl>
#>  1 858029749     42.6     -7.09
#>  2 858029738     42.6     -7.09
#>  3 614631090     41.4     -7.90
#>  4 614631085     41.3     -7.81
#>  5 614631083     41.3     -7.81
#>  6 614631080     41.4     -7.83
#>  7 614631072     41.4     -7.81
#>  8 614559731     40.3     -7.70
#>  9 614559728     40.4     -7.70
#> 10 614559657     40.4     -7.56
#> # ℹ 1,287 more rows

First, let us visualise our presences by plotting on a map. tidysdm works with sf objects to represent locations, so we will cast our coordinates into an sf object, and set its projections to standard lonlat (crs = 4326).

library(sf)
#> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
lacerta <- st_as_sf(lacerta, coords = c("longitude", "latitude"))
st_crs(lacerta) <- 4326

It is usually advisable to plot the locations directly on the raster that will be used to extract climatic variables, to see how the locations fall within the discrete space of the raster. For this vignette, we will use WorldClim as our source of climatic information. We will access the WorldClim data via the library pastclim; even though this library, as the name suggests, is mostly designed to handle palaeoclimatic reconstructions, it also provides convenient functions to access present day reconstructions and future projections. pastclim has a handy function to get the land mask for the available datasets, which we can use as background for our locations. We will cut the raster to the Iberian peninsula, where our lizard lives. For this simply illustration, we will not bother to project the raster, but an equal area projection would be desirable…

library(pastclim)
download_dataset(dataset = "WorldClim_2.1_10m")
land_mask <-
  get_land_mask(time_ce = 1985, dataset = "WorldClim_2.1_10m")

# Iberia peninsula extension
iberia_poly <-
  terra::vect(
    "POLYGON((-9.8 43.3,-7.8 44.1,-2.0 43.7,3.6 42.5,3.8 41.5,1.3 40.8,0.3 39.5,
     0.9 38.6,-0.4 37.5,-1.6 36.7,-2.3 36.3,-4.1 36.4,-4.5 36.4,-5.0 36.1,
    -5.6 36.0,-6.3 36.0,-7.1 36.9,-9.5 36.6,-9.4 38.0,-10.6 38.9,-9.5 40.8,
    -9.8 43.3))"
  )

crs(iberia_poly) <- "lonlat"
# crop the extent
land_mask <- crop(land_mask, iberia_poly)
# and mask to the polygon
land_mask <- mask(land_mask, iberia_poly)

#> Loading required package: terra
#> terra 1.7.71
#> 
#> Attaching package: 'terra'
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
#> The following object is masked from 'package:scales':
#> 
#>     rescale

For plotting, we will take advantage of tidyterra, which makes handling of terra rasters with ggplot a breeze.

library(tidyterra)
#> 
#> Attaching package: 'tidyterra'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(ggplot2)
ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_1985)) +
  geom_sf(data = lacerta)

Thinning step

Now, we thin the observations to have one per cell in the raster (it would be better if we had an equal area projection…):

set.seed(1234567)
lacerta <- thin_by_cell(lacerta, raster = land_mask)
nrow(lacerta)
#> [1] 231

ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_1985)) +
  geom_sf(data = lacerta)

Now, we thin further to remove points that are closer than 20km. However, note that the standard map units for a ‘lonlat’ projection are meters. tidysdm provides a convening conversion function, km2m(), to avoid having to write lots of zeroes):

set.seed(1234567)
lacerta_thin <- thin_by_dist(lacerta, dist_min = km2m(20))
nrow(lacerta_thin)
#> [1] 113

Let’s see what we have left of our points:

ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_1985)) +
  geom_sf(data = lacerta_thin)

Now we sample pseudo-absences (we will constrain them to be at least 50km away from any presences), selecting as 3 times as many points as presences:

set.seed(1234567)
lacerta_thin <- sample_pseudoabs(lacerta_thin,
  n = 3 * nrow(lacerta_thin),
  raster = land_mask,
  method = c("dist_min", km2m(50))
)

Let’s see our presences and absences:

ggplot() +
  geom_spatraster(data = land_mask, aes(fill = land_mask_1985)) +
  geom_sf(data = lacerta_thin, aes(col = class))

Generally, we can use pastclim to check what variables are available for the WorldClim dataset:

climate_vars <- get_vars_for_dataset("WorldClim_2.1_10m")

We first download the dataset at the right resolution (here 10 arc-minutes):

download_dataset("WorldClim_2.1_10m")

And then create a terra SpatRaster object. The dataset covers the period 1970-2000, so pastclim dates it as 1985 (the midpoint). We can directly crop to the Iberian peninsula:

climate_present <- pastclim::region_slice(
  time_ce = 1985,
  bio_variables = climate_vars,
  data = "WorldClim_2.1_10m",
  crop = iberia_poly
)

We start by selecting variables for which presences are markedly different from the underlying background. First, let’s extract climate for all presences and pseudo-absences:

lacerta_thin <- lacerta_thin %>%
  bind_cols(terra::extract(climate_present, lacerta_thin, ID = FALSE))

We can use violin plots to compare the distribution of the climatic variables for presences and pseudo-absences:

lacerta_thin %>% plot_pres_vs_bg(class)

We want to choose variables for which presences use values different from the background (pseudo-absences). We can qualitatively look at the plots, or use a quantitative approach that ranks them based on the overlap of the respective density plots:

lacerta_thin %>% dist_pres_vs_bg(class)
#>     bio19     bio12     bio16     bio02     bio13     bio07     bio05     bio04 
#> 0.6935026 0.6929654 0.6885097 0.6776561 0.6679894 0.6542063 0.6243068 0.6071527 
#>     bio09     bio10     bio17     bio15     bio18     bio06     bio08     bio14 
#> 0.5506741 0.4855064 0.4318872 0.4241999 0.3600587 0.3454245 0.3108560 0.2954351 
#>     bio11     bio03     bio01  altitude 
#> 0.2317079 0.2173934 0.2045136 0.1554400

As a first step, we can focus on variables that have at least 30% of non-overlapping distribution between presences and pseudo-absences:

vars_to_keep <- lacerta_thin %>% dist_pres_vs_bg(class)
vars_to_keep <- names(vars_to_keep[vars_to_keep > 0.30])
lacerta_thin <- lacerta_thin %>% select(all_of(c(vars_to_keep, "class")))
vars_to_keep
#>  [1] "bio19" "bio12" "bio16" "bio02" "bio13" "bio07" "bio05" "bio04" "bio09"
#> [10] "bio10" "bio17" "bio15" "bio18" "bio06" "bio08"

These are still a lot of variables. Among them, there are several that have been argued in the past to be important for the distribution of the species of interest. Based on this paper (https://doi.org/10.1007/s10531-010-9865-2), we are interested in:

suggested_vars <- c("bio05", "bio06", "bio13", "bio14", "bio15")

Environmental variables are often highly correlated, and collinearity is an issue for several types of models. We can inspect the correlation among variables with:

pairs(climate_present[[suggested_vars]])

We can see that some variables have rather high correlation (e.g. bio05 vs bio14). We can subset to variables below a certain threshold correlation (e.g. 0.7) with:

climate_present <- climate_present[[suggested_vars]]
vars_uncor <- filter_high_cor(climate_present, cutoff = 0.7)
vars_uncor
#> [1] "bio15" "bio05" "bio13" "bio06"
#> attr(,"to_remove")
#> [1] "bio14"

So, removing bio14 leaves us with a set of uncorrelated variables.

lacerta_thin <- lacerta_thin %>% select(all_of(c(vars_uncor, "class")))
climate_present <- climate_present[[vars_uncor]]

Fit the model by cross-validation

Next, we need to set up a recipe to define how to handle our dataset. We don’t want to do anything to our data in terms of transformations, so we just need to define the formula (class is the outcome, all other variables are predictors; note that, for sf objects, geometry is automatically replaced by X and Y columns which are assigned a role of coords, and thus not used as predictors):

lacerta_rec <- recipe(lacerta_thin, formula = class ~ .)
lacerta_rec
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 4
#> coords:    2

In classification models for tidymodels, the assumption is that the level of interest for the response (in our case, presences) is the reference level. We can confirm that we have the data correctly formatted with:

lacerta_thin %>% check_sdm_presence(class)
#> [1] TRUE

We now build a workflow_set of different models, defining which hyperparameters we want to tune. We will use glm, random forest, boosted_trees and maxent as our models (for more details on how to use workflow_sets, see this tutorial). The latter three models have tunable hyperparameters. For the most commonly used models, tidysdm automatically chooses the most important parameters, but it is possible to fully customise model specifications (e.g. see the help for sdm_spec_rf).

lacerta_models <-
  # create the workflow_set
  workflow_set(
    preproc = list(default = lacerta_rec),
    models = list(
      # the standard glm specs
      glm = sdm_spec_glm(),
      # rf specs with tuning
      rf = sdm_spec_rf(),
      # boosted tree model (gbm) specs with tuning
      gbm = sdm_spec_boost_tree(),
      # maxent specs with tuning
      maxent = sdm_spec_maxent()
    ),
    # make all combinations of preproc and models,
    cross = TRUE
  ) %>%
  # tweak controls to store information needed later to create the ensemble
  option_add(control = control_ensemble_grid())

We now want to set up a spatial block cross-validation scheme to tune and assess our models. We will do an 80:20 split, by creating 5 folds. We use the spatial_block_cv function from the package spatialsample. spatialsample offers a number of sampling approaches for spatial data; it is also possible to convert objects created with blockCV (which offers further features for spatial sampling, such as stratified sampling) into an rsample object suitable to tisysdm with the function blockcv2rsample.

library(tidysdm)
set.seed(100)
lacerta_cv <- spatial_block_cv(lacerta_thin, v = 5)
autoplot(lacerta_cv)

We can now use the block CV folds to tune and assess the models (to keep computations fast, we will only explore 3 combination of hyperparameters per model; this is far too little in real life!):

set.seed(1234567)
lacerta_models <-
  lacerta_models %>%
  workflow_map("tune_grid",
    resamples = lacerta_cv, grid = 3,
    metrics = sdm_metric_set(), verbose = TRUE
  )
#> i    No tuning parameters. `fit_resamples()` will be attempted
#> i 1 of 4 resampling: default_glm
#> ✔ 1 of 4 resampling: default_glm (328ms)
#> i 2 of 4 tuning:     default_rf
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> ✔ 2 of 4 tuning:     default_rf (1.2s)
#> i 3 of 4 tuning:     default_gbm
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> ✔ 3 of 4 tuning:     default_gbm (6.6s)
#> i 4 of 4 tuning:     default_maxent
#> ✔ 4 of 4 tuning:     default_maxent (2.6s)

Note that workflow_set correctly detects that we have no tuning parameters for glm. We can have a look at the performance of our models with:

autoplot(lacerta_models)

Now let’s create an ensemble, selecting the best set of parameters for each model (this is really only relevant for the random forest, as there were not hype-parameters to tune for the glm and gam). We will use the Boyce continuous index as our metric to choose the best random forest and boosted tree. When adding members to an ensemble, they are automatically fitted to the full training dataset, and so ready to make predictions.

lacerta_ensemble <- simple_ensemble() %>%
  add_member(lacerta_models, metric = "boyce_cont")
lacerta_ensemble
#> A simple_ensemble of models
#> 
#> Members:
#> • default_glm
#> • default_rf
#> • default_gbm
#> • default_maxent
#> 
#> Available metrics:
#> • boyce_cont
#> • roc_auc
#> • tss_max
#> 
#> Metric used to tune workflows:
#> • boyce_cont

And visualise it

autoplot(lacerta_ensemble)

A tabular form of the model metrics can be obtained with:

lacerta_ensemble %>% collect_metrics()
#> # A tibble: 12 × 5
#>    wflow_id       .metric     mean std_err     n
#>    <chr>          <chr>      <dbl>   <dbl> <int>
#>  1 default_glm    boyce_cont 0.633 0.133       5
#>  2 default_glm    roc_auc    0.963 0.00580     5
#>  3 default_glm    tss_max    0.866 0.0257      5
#>  4 default_rf     boyce_cont 0.596 0.133       5
#>  5 default_rf     roc_auc    0.980 0.00684     5
#>  6 default_rf     tss_max    0.899 0.0256      5
#>  7 default_gbm    boyce_cont 0.789 0.0906      5
#>  8 default_gbm    roc_auc    0.975 0.0105      5
#>  9 default_gbm    tss_max    0.892 0.0332      5
#> 10 default_maxent boyce_cont 0.802 0.0676      5
#> 11 default_maxent roc_auc    0.988 0.00430     5
#> 12 default_maxent tss_max    0.899 0.0247      5

Projecting to the present

We can now make predictions with this ensemble (using the default option of taking the mean of the predictions from each model).

prediction_present <- predict_raster(lacerta_ensemble, climate_present)
ggplot() +
  geom_spatraster(data = prediction_present, aes(fill = mean)) +
  scale_fill_terrain_c() +
  # plot presences used in the model
  geom_sf(data = lacerta_thin %>% filter(class == "presence"))

We can subset the ensemble to only use the best models, based on the Boyce continuous index, by setting a minimum threshold of 0.8 for that metric. We will also take the median of the available model predictions (instead of the mean, which is the default). The plot does not change much (the models are quite consistent).

prediction_present_boyce <- predict_raster(lacerta_ensemble, climate_present,
  metric_thresh = c("boyce_cont", 0.8),
  fun = "median"
)
ggplot() +
  geom_spatraster(data = prediction_present_boyce, aes(fill = median)) +
  scale_fill_terrain_c() +
  geom_sf(data = lacerta_thin %>% filter(class == "presence"))

Sometimes, it is desirable to have binary predictions (presence vs absence), rather than the probability of occurrence. To do so, we first need to calibrate the threshold used to convert probabilities into classes (in this case, we optimise the TSS):

lacerta_ensemble <- calib_class_thresh(lacerta_ensemble,
  class_thresh = "tss_max"
)

And now we can predict for the whole continent:

prediction_present_binary <- predict_raster(lacerta_ensemble,
  climate_present,
  type = "class",
  class_thresh = c("tss_max")
)
ggplot() +
  geom_spatraster(data = prediction_present_binary, aes(fill = binary_mean)) +
  geom_sf(data = lacerta_thin %>% filter(class == "presence"))

Projecting to the future

WorldClim has a wide selection of projections for the future based on different models and Shared Socio-economic Pathways (SSP). Type help("WorldClim_2.1") for a full list. We will use predictions based on “HadGEM3-GC31-LL” model for SSP 245 (intermediate green house gas emissions) at the same resolution as the present day data (10 arc-minutes). We first download the data:

download_dataset("WorldClim_2.1_HadGEM3-GC31-LL_ssp245_10m")

Let’s see what times are available:

get_time_ce_steps("WorldClim_2.1_HadGEM3-GC31-LL_ssp245_10m")

#> [1] 2030 2050 2070 2090

We will predict for 2090, the further prediction in the future that is available.

Let’s now check the available variables:

get_vars_for_dataset("WorldClim_2.1_HadGEM3-GC31-LL_ssp245_10m")

#>  [1] "bio01" "bio02" "bio03" "bio04" "bio05" "bio06" "bio07" "bio08" "bio09"
#> [10] "bio10" "bio11" "bio12" "bio13" "bio14" "bio15" "bio16" "bio17" "bio18"
#> [19] "bio19"

Note that future predictions do not include altitude (as that does not change with time), so if we needed it, we would have to copy it over from the present. However, it is not in our set of uncorrelated variables that we used earlier, so we don’t need to worry about it.

climate_future <- pastclim::region_slice(
  time_ce = 2090,
  bio_variables = vars_uncor,
  data = "WorldClim_2.1_HadGEM3-GC31-LL_ssp245_10m",
  crop = iberia_poly
)

And predict using the ensemble:

prediction_future <- predict_raster(lacerta_ensemble, climate_future)

ggplot() +
  geom_spatraster(data = prediction_future, aes(fill = mean)) +
  scale_fill_terrain_c()

Visualising the contribution of individual variables

It is sometimes of interest to understand the relative contribution of individual variables to the prediction. This is a complex task, especially if there are interactions among variables. For simpler linear models, it is possible to obtain marginal response curves (which show the effect of a variable whilst keeping all other variables to their mean) using step_profile() from the recipes package. We use step_profile() to define a new recipe which we can then bake to generate the appropriate dataset to make the marginal prediction. We can then plot the predictions against the values of the variable of interest. For example, to investigate the contribution of bio05, we would:

bio05_prof <- lacerta_rec %>%
  step_profile(-bio05, profile = vars(bio05)) %>%
  prep(training = lacerta_thin)

bio05_data <- bake(bio05_prof, new_data = NULL)

bio05_data <- bio05_data %>%
  mutate(
    pred = predict(lacerta_ensemble, bio05_data)$mean
  )

ggplot(bio05_data, aes(x = bio05, y = pred)) +
  geom_point(alpha = .5, cex = 1)

It is also possible to use DALEX,to explore tidysdm models; see more details in the tidymodels additions article.

Repeated ensembles

The steps of thinning and sampling pseudo-absences can have a bit impact on the performance of SDMs. As these steps are stochastic, it is good practice to explore their effect by repeating them, and then creating ensembles of models over these repeats. In tidysdm, it is possible to create repeat_ensembles. We start by creating a list of simple_ensembles, by looping through the SDM pipeline. We will just use two fast models to speed up the process.

# empty object to store the simple ensembles that we will create
ensemble_list <- list()
set.seed(123) # make sure you set the seed OUTSIDE the loop
for (i_repeat in 1:3) {
  # thin the data
  lacerta_thin_rep <- thin_by_cell(lacerta, raster = climate_present)
  lacerta_thin_rep <- thin_by_dist(lacerta_thin_rep, dist_min = 20000)
  # sample pseudo-absences
  lacerta_thin_rep <- sample_pseudoabs(lacerta_thin_rep,
    n = 3 * nrow(lacerta_thin_rep),
    raster = climate_present,
    method = c("dist_min", 50000)
  )
  # get climate
  lacerta_thin_rep <- lacerta_thin_rep %>%
    bind_cols(terra::extract(climate_present, lacerta_thin_rep, ID = FALSE))
  # create folds
  lacerta_thin_rep_cv <- spatial_block_cv(lacerta_thin_rep, v = 5)
  # create a recipe
  lacerta_thin_rep_rec <- recipe(lacerta_thin_rep, formula = class ~ .)
  # create a workflow_set
  lacerta_thin_rep_models <-
    # create the workflow_set
    workflow_set(
      preproc = list(default = lacerta_thin_rep_rec),
      models = list(
        # the standard glm specs
        glm = sdm_spec_glm(),
        # maxent specs with tuning
        maxent = sdm_spec_maxent()
      ),
      # make all combinations of preproc and models,
      cross = TRUE
    ) %>%
    # tweak controls to store information needed later to create the ensemble
    option_add(control = control_ensemble_grid())

  # train the model
  lacerta_thin_rep_models <-
    lacerta_thin_rep_models %>%
    workflow_map("tune_grid",
      resamples = lacerta_thin_rep_cv, grid = 10,
      metrics = sdm_metric_set(), verbose = TRUE
    )
  # make an simple ensemble and add it to the list
  ensemble_list[[i_repeat]] <- simple_ensemble() %>%
    add_member(lacerta_thin_rep_models, metric = "boyce_cont")
}
#> i    No tuning parameters. `fit_resamples()` will be attempted
#> i 1 of 2 resampling: default_glm
#> ✔ 1 of 2 resampling: default_glm (296ms)
#> i 2 of 2 tuning:     default_maxent
#> ✔ 2 of 2 tuning:     default_maxent (8.1s)
#> i    No tuning parameters. `fit_resamples()` will be attempted
#> i 1 of 2 resampling: default_glm
#> ✔ 1 of 2 resampling: default_glm (298ms)
#> i 2 of 2 tuning:     default_maxent
#> ✔ 2 of 2 tuning:     default_maxent (8.1s)
#> i    No tuning parameters. `fit_resamples()` will be attempted
#> i 1 of 2 resampling: default_glm
#> ✔ 1 of 2 resampling: default_glm (287ms)
#> i 2 of 2 tuning:     default_maxent
#> ✔ 2 of 2 tuning:     default_maxent (7.9s)

Now we can create a repeat_ensemble from the list:

lacerta_rep_ens <- repeat_ensemble() %>% add_repeat(ensemble_list)
lacerta_rep_ens
#> A repeat_ensemble of models
#> 
#> Number of repeats:
#> • 3
#> 
#> Members:
#> • default_glm
#> • default_maxent
#> 
#> Available metrics:
#> • boyce_cont
#> • roc_auc
#> • tss_max
#> 
#> Metric used to tune workflows:
#> • boyce_cont

We can summarise the goodness of fit of models for each repeat with collect_metrics(), but there is no autoplot() function for repeated_ensemble objects.

We can then predict in the usual way (we will take the mean and median of all models):

lacerta_rep_ens <- predict_raster(lacerta_rep_ens, climate_present,
  fun = c("mean", "median")
)
ggplot() +
  geom_spatraster(data = lacerta_rep_ens, aes(fill = median)) +
  scale_fill_terrain_c()

tidysdm overview

SDMs with tidymodels