Canadian Malus CWR Species Distribution Modeling

Introduction

UBC Botanical Garden is advancing action for adaptation through the development of community-based adaptation plans with the Sustainable Communities Field School. Ensuring the long-term preservation and sustainable utilization of apple genetic diversity across Canada is of paramount importance. This collaborative initiative aims to inventory and understand existing apple genetic diversity within the country. Additionally, it seeks to investigate how the suitability of habitats for apples may be affected by changing climatic conditions and emerging challenges. In response to these potential impacts, the project aims to identify urgent and longer-term adaptation strategies necessary for the continued resilience of apple populations in Canada.

This work is associated with a in-progress manuscript on Malus crop wild relatives (CWR) that are native to Canada, including M. coronaria (Sweet Crabapple) and M. fusca (Pacific Crabapple).

The following analysis was written in R, a free programming language and software environment primarily used for statistical computing and graphics. Visit https://www.r-project.org/ to get started with using R. Also recommended is the use of RStudio, a integrated development environment (IDE) for R. It provides a user-friendly interface for writing code, viewing plots and data, managing files, and installing packages. Once R is installed, visit https://www.rstudio.com/products/rstudio/download/ to download RStudio.

The species distribution modeling in the following analysis also requires that you have Java (another free programming language) installed. Visit https://www.oracle.com/ca-en/java/technologies/downloads/ and install the Java Development Kit (JDK).

Please note: The following is a reproducible example, and is not intended to be a introductory tutorial of SDM. The markdown is written for someone with intermediate to advanced experience in R.

Analysis Overview

The following is a reproducible workflow for the development of species distribution models (SDMs), using presence/background occurrence data for M. coronaria and M. fusca crabapple tree species. This data will be explained in more detail bellow. Occurrence data will be downloaded from the open source Global Biodiversity Information Facility (GBIF). There are several types of SDMs, however in this workflow we use the widely adopted MaxEnt (Maximum Entropy), a machine learning algorithum which can handle presence/background data. The outputs of this analysis are models of predicted habitat suitability of fundamental niches from environmental (climate) data. These models predict suitable from historical (1970s-2000s), to future projected climate conditions due to climate change under two Shared Socioeconomic Pathways (SSP2-4.5 and SSP5-8.5). This is essential for understanding how these species may respond to changes in climate, so that they can be better managed, sampled, or conserved for the preservation of these species genetic diversity for future generations.

See the following for a introduction and description of MaxEnt: Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E., & Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists. Diversity and distributions, 17(1), 43-57. https://doi.org/10.1111/j.1472-4642.2010.00725.x.

Here is a brief overview of the steps and associated scripts for each part of the analysis.

Step	Script	Purpose
1	`gbif_occ.R`	Prepare data request and download from GBIF
2	`occ_clean.R`	Clean occurrence (presence) data points
3	`occ_thin.R`	Thin occurrence points to a single observation per predictor (raster) cell
4	`malus_bg.R`	Download ecoregions, sample background points, predictor cluster, 3D Kernel density plots
5	`malus_sdm.R`	Climate predictor data, Maxent modeling and tuning, habitat suitability predictions
6	`sdm_plot.R`	Produce publication quality plots of SDM predictions
7	`malus_gap`	Conservation gap analysis to assess the level of conservation of species

IMPORTANT NOTE: This workflow and code differs slightly from the code published to the “https://github.com/TerrellRoulston/malus” repository, and is an example rather than a standard on how to perform species distribution. One major nuanced difference is Roulston et al., (in prep) limited the number of predictor variables using a list of well performing variables rather than using them all like in this workflow.

GBIF Species Occurrences Download

The Global Biodiversity Information Facility (GBIF) is a valuable resource for SDM due to its extensive collection of biodiversity data from around the world. There are several sources of occurrences, such as research and herbarium collections and community science observations. Note that this data is presence-only data, meaning there is no information about the absence of species. Also, all occurrences are research grade meaning they are expert verified identifications or several community members agree on identification.

NOTE: You need to have a GBIF user profile, visit https://www.gbif.org/ to sign up!

Load libraries

library(tidyverse) # data management, grammar
library(rgbif) # access GBIF data

Initiate user profile information. Note that I have redacted my own information.

# GBIF user info
user='REDACTED'
pwd='REDACTED'
email='REDACTED'

Navigate to https://www.gbif.org/ and search your taxon of interest to get the taxon key identifier. In this case the taxon keys are: M. coronaria = 3001166 and M. fusca = 3001080.

For the purposes of this workflow I will only show the code for one of the species. Code is just replicated for the other species.

# M. coronaria download 
taxonKey <- 3001166
basisOfRecord <- c('PRESERVED_SPECIMEN', 'HUMAN_OBSERVATION', 'OCCURRENCE') # excluded living specimens and material samples (germplasm)
hasCoordinates <- TRUE # limit to records with coordinates

NOTE: We do not limit the year of observation in our data collection. There are reasons why you might. However, for the purposes of this analysis it is reasonable to include all observations. Because the main reason why more historical occurrences may be missing today (or say post 1970) is more likely to be due to habitat destruction, rather than changes in climate affecting realized niches. Furthermore, crabtrees are slow growing, and migrate even slower, thus their distribution are likely lagging behind changes in their fundamental niches. This highlights the importance of understanding the ecology of your species of interest.

Prepare a download request for the GBIF occurrence data. There are several formats, I choose .csv because it is familiar and easy to work with.

# Download data
# Use 'pred()' if there is a single argument, or 'pred_in()' if there are multiple
down_code = rgbif::occ_download(
  pred("taxonKey", taxonKey),
  pred_in("basisOfRecord", basisOfRecord),
  pred("hasCoordinate", hasCoordinates),
  format = "SIMPLE_CSV",
  user=user, pwd=pwd, email=email)

Download and save the GBIF occurrence data as a .csv.

getwd() # check your working directory (wd)
setwd("./occ_data/") # set wd to a location where you want to save the csv file.
download_coronaria <- occ_download_get(down_code[1], overwrite = TRUE)
# extract csv from zipper folder and save as clearly named csv in excel or equivalent.

gbif_cor <- read.csv(file = "occ_coronaria.csv") # load named csv data

NOTE: It is good practice to keep this .csv as a ‘raw’ data file, and manipulate it within the R environment in following workflow rather than make changes to the raw file.

Clean Species Occurrence Data

Now we have downloaded all observations from GBIF for M. coronaria and M. fusca that fit a set of defined criteria suitable for our research question. In this case the criteria are limited to species-level taxonomic resolution, that come from preserved herbarium specimens, human observation (i.e. iNaturalist) and research collections documenting species occurrences, all of which are geo-referenced (i.e. geographic coordinates). Also note that we do not limit the geographic location of occurrences, yet. At the time of downloading this data (02/14/2024) M. coronaria had 1373 records, and M. fusca had 2282 records.

As eluded to above, we now need to make some more choices about what species occurrences are appropriate to include and from where. Thus, we must clean the species occurrence data. To do this we are going to visually inspect the species occurrences as well as take advantage of R packages such as CoordinateCleaner to assist in removing spurious occurrences replicates, those with poor referencing such located in the center of a country or in a body of water, etc.

Load Libraries

library(tidyverse) # grammar, data management
library(CoordinateCleaner) # helpful functions to clean data
library(terra) # working with vector/raster data
library(geodata) # download basemaps
library(scales) # alpha adjust colours

We begin with loading the saved .csv files of species occurrences from the last script.

# set wd
getwd()
setwd("../occ_data/") # use relative path to open from folder where .csv is saved

# load occurrence csv files
gbif_cor <- read.csv(file = "occ_coronaria.csv") # load coronaria data
gbif_fusca <- read.csv(file = "occ_fusca.csv") # load fusca data

Now we want to initiate a data.frame and select columns that are useful from the .csv files. At the same time as we build this data.frame we also are going to use several functions to filter occurrences out of the ‘cleaned’ data set. For more information on the functions used see the library documentation of CoordinateCleaner. Also see https://data-blog.gbif.org/post/gbif-filtering-guide/ for a great blog post published by GBIF on post-processing GBIF downloads.

# filter M. corornia data
occ_cor <- gbif_cor %>% 
  filter(countryCode %in% c('US', 'CA')) %>% #limit to CA and US
  filter(!is.na(decimalLongitude)) %>% # remove records w/o coords
  filter(coordinateUncertaintyInMeters < 30000 | is.na(coordinateUncertaintyInMeters)) %>% 
  cc_cen(buffer = 2000) %>% # remove records within 2km of country centroids
  cc_inst(buffer = 2000) %>% # remove records within 2km of herbariums, botanical gardens, and other institutions 
  cc_sea() %>% 
  distinct(decimalLatitude, decimalLongitude, speciesKey, datasetKey, .keep_all = T) %>%
  filter(decimalLongitude >= -100) %>% # remove some records on west coast, two from bot gardens
  filter(!(decimalLatitude < 35 & decimalLongitude < -86)) %>% # remove records from Texas, Oklahoma, Louisiana
  filter(!(decimalLatitude > 45)) %>% # remove record from New Brunswick
  filter(!(decimalLongitude < -98)) %>% # remove iNat record from south Kansas, northern Kansas record verified by taxonomist
  dplyr::select(species, countryCode, decimalLatitude, 
         decimalLongitude, coordinateUncertaintyInMeters, year, basisOfRecord
         )

# Note it is helpful to plot the occurrences bellow, and then add more conditions to clean inaccurate points
# Pay special attention to points at the edge of the range of occurrences, as these are most likely to be suspicious
# as well as influence the model in strange ways, in comparison to inaccurate points well within the other occurrences

After filter the occurrences we are left with 978 occurrences for M. coronaria and 1197 occurrences for M. fusca.

You will notice that I have added a few manual filters in the code above using function filter and the ! operator, to remove records that were verified to be inaccurate. For cleaning these inaccurate points, it is useful to begin by checking occurrences at the edges of the range of a species, or any other outlier points. These points at the edges are most likely to negatively influence the model if they are not true presence points. I personally find using a combination of the GBIF map and table view of observations, and Google Maps to verify the truth of these questionable points to be a good approach.

NOTE: I am using piper operators (%>%) from the tidyverse ecosystem to assist in the ease of data processing. It sequentially applies each argument or function to the data.frame, and so be careful in what order you specify functions.

Now we will save the data.frame object as a .Rdata file. .Rdata files are the best way to store/load data within the R environment. We will routinely save intermediate data files to help speed up analysis of downstream workflow, so that results can be stored rather than having to be repeated every time the R session is closed (INCLUDING WHEN IT CRASHES UNEXPECTEDLY.

getwd() # check what directory you are in
setwd('../occ_data/')

saveRDS(occ_cor, file = "occ_cor.Rdata") # name the file with the .Rdata file extension

Now when revisiting an analysis you can reload the saved cleaned data frame rather than re-running the cleaning functions which are slow computations.

setwd('./occ_data/')
occ_cor <- readRDS(file = "occ_cor.Rdata") # read in the saved .Rdata object 
head(occ_cor) # take a peak at your dateframe

##           species countryCode decimalLatitude decimalLongitude
## 1 Malus coronaria          US        43.08833        -84.72617
## 2 Malus coronaria          US        43.61555        -84.24722
## 3 Malus coronaria          US        40.89888        -74.70638
## 4 Malus coronaria          US        41.83333        -88.08333
## 5 Malus coronaria          US        42.97083        -82.42500
## 6 Malus coronaria          US        35.58000        -82.55583
##   coordinateUncertaintyInMeters year      basisOfRecord
## 1                            NA 2013  HUMAN_OBSERVATION
## 2                         20000 1931 PRESERVED_SPECIMEN
## 3                             3 1945 PRESERVED_SPECIMEN
## 4                            NA 1894 PRESERVED_SPECIMEN
## 5                          5000 1892 PRESERVED_SPECIMEN
## 6                         20000   NA PRESERVED_SPECIMEN

Early I mentioned that we are not limiting the years in which species occurrences are recorded, however it is still important to verify that this decision is not adding ‘weird’ bias to the data. To do this I am going to split the occurrences in to three date frames pre and post 1970, and those without dates. This data is commonly associated with when changes in significant changes in global climate began. If occurrences from pre-1970 are vastly different in geography, then we may feel uncomfortable including them in the analysis. Similarly those without any date should also be verified.

# compare pre/post 1970 and nd occurrences
occ_cor_pre <- occ_cor %>% # pre 1970
  filter(year < 1970)

occ_cor_post <- occ_cor %>% # post 1970
  filter(year >= 1970)

occ_cor_nd <- occ_cor %>% 
  filter(year %in% NA)

The best way to compare these pre/post/nd occurrences is by visualizing them to detect any differences in spatial coordinates. To do this we need to utilize the terra and geodate packages. The terra package is useful for working with Raster data and will frequently be used throughout this workflow. The geodata package will also be used frequently, and allows easy download of geographic boundaries from Global Administrative Areas (GADM).

For now plotting will be done quickly using base R graphics and terra. At the end we will produce publication qaulity maps.

First download the basemaps, and put them in a location to access them again later to avoid having to redownload them. These files are large, so consider not pushing them to GitHub if you are working in a repo.

setwd('../occ_data/')
# download/load maps
us_map <- gadm(country = 'USA', level = 1, resolution = 2,
               path = "../occ_data/base_maps") #USA basemap w. States

ca_map <- gadm(country = 'CA', level = 1, resolution = 2,
               path = '../occ_data/base_maps') #Canada basemap w. Provinces

canUS_map <- rbind(us_map, ca_map) #combine US and Canada vector map

Now plot the occurrence points. Remember that you might have to go back and add additional filters to the code above like I have for example by removing records from Texas, Oklahoma, Louisiana, and New Brunswick.

# M coronaria
plot(canUS_map, xlim = c(-100, -60), ylim = c(25, 50))
# plot Malus coronia occurrences
# pre-1970
points(occ_cor_pre$decimalLongitude, occ_cor_pre$decimalLatitude, pch = 16,
       col = alpha("red", 0.2))
# post-1970
points(occ_cor_post$decimalLongitude, occ_cor_post$decimalLatitude, pch = 16,
       col = alpha("blue", 0.2))
# No date
points(occ_cor_nd$decimalLongitude, occ_cor_nd$decimalLatitude, pch = 16,
       col = alpha('black', 0.2))

legend(x= -75,
       y = 33,
       title = 'M. coronaria',
       legend = c('Pre-1970 (n=412)', 'Post-1970 (n=514)', 'No Date (n=53)'),
       fill = c('red', 'blue', 'black'))

dev.off() # close graphics plot window

NOTE: Plotting geographic coordinates can be tricky, and often you will have to play around with the plot limits and legend location to get it to something that is visually appealing.

Inspecting the plot of M. coronria occurrences there is a large amount of overlap between occurrences from pre and post 1970, thus we feel comfortable keeping all the data no matter the year in the analysis. It is important you evaluate this for your own species and ask if it fits your research question.

After visually inspecting the occurrence locations, verifying that there are no duplicates, and no occurrences from strange locations (such as records from the west coast, when dealing with an east coast species) you can save your occurrence data.frame (e.g. occ_cor) for downstream analysis.

Thinning Occurence Data

So we have downloaded and cleaned the occurrence data for M. coronaria and M. fusca however we still need to do more processing before we can move producing the SDM. We mush now deal with sampling bias in the occurrences. Sampling bias is an issue because with presence-only data it is hard to distinguish if species are being detected because it is truly preferred habitat, or whether those are just the locations that have been sufficiently sampled. This issue can be (partially) addressed by thinning records, so that multiple records from the same are represented by only one (or sometimes a few) occurrence sample. This method is imperfect, but does start to deal with some biases in the spatial representation of the data.

There are several methods for thinning occurrence records; some more sophisticated then others. A simple approach, and the one used in this workflow is randomly sampling a single occurrence per grid cell in the predictor raster (in this case the climate variables).

# Load libraries
library(tidyverse) # Grammar and data management 
library(terra) # Working with spatial data
library(geodata) # Basemaps and climate data

Start by loading the cleaned species occurrence data saved in the .Rdata file we made in the occ_clean script.

setwd("../occ_data/")
occ_cor <- readRDS(file = "occ_cor.Rdata")

When working with spatial data it is important to make sure that different data sources share the same coordinate reference system (a.k.a. projection). We are going to be using climatic data from WorldClim as our predictor raster layers, which is projected using the common WGS84 projection. In order to make sure that the coordinates of occurrences line up accurately within the grid cells of the WorldClim data we are going to create a SpatVector object using terra::vect and specify the crs argument so that it knows to project as WGS84.

occ_cor <- vect(occ_cor, geom = c('decimalLongitude', 'decimalLatitude'),
                crs = "+proj=longlat +datum=WGS84")

Now we are going to download the WorldClim predictor rasters. WorldClim is a global data set of bioclimatic variables commonly used in ecological research, particular in SDM workflows. Variables in this data set represent a comprehensive set of climatic conditions, including tempature seasonality, perciptication in wettest/driest months and temperature extremes. See https://www.worldclim.org/data/bioclim.html for a complete list of all 19 bioclimatic variables included.

NOTE: There are three resolutions, 10, 5 and 2.5 arcminutes. I choose to use 2.5 arcminutes (and generally recommend) as it is the finest resolution which will give a greater accuracy of environmental conditions, but this is at a cost of increased computation time. For reference 2.5 arcminutes is about 3.3 km at 45 degree latitude. Pick which resolution you think is best for your use.

IMPORTANT NOTE: These downloads are large (several thousand Mb), so if you are working in a GitHub repository make sure to add these files to the .gitignore file to make sure copies are not sent to the repo.

I like to create a separate file for WorldClim data, it makes workflow easier to follow and more reproducible. Create a folder within the head of the project.

setwd("../wclim_data/")
# Note DO NOT PUSH wclim data**
wclim <- worldclim_global(var = 'bio', res = 2.5, version = '2.1', path = "../wclim_data/")

NOTE: Climate data is available at several resolutions (size of grid cells) ranging from 10 - 0.5 minutes of degree. In this workflow we opted to go with 2.5 arc-mins (approx. 5 km) as it provides a compromise in the fine-scale of the cell size and practical computation time.

# plot a raster layer to check it downloaded properly
plot(wclim$wc2.1_2.5m_bio_1, main = expression(atop('WorldClim Bioclimatic Predictor 1', 'Mean Annual Tempature (1970-2000)')))

Now we have vectorized occurrence data and rasterised predictor variables, it is time to thin the occurrences data. We will use the terra::spatSample() function to take a random sample of a single occurrence per raster cell.

set.seed(1337) # set random generator seed to get reproducible results

# M. coronaria thinning
occ_cor <- spatSample(occ_cor, size = 1, 
                      strata = wclim) # sample one occurrence from each climatic cell

And like last time, we will save this thinned data as a .Rdata file for futher downsteam analysis. NOTE: that I call this file a slightly different name, so that I know it is the thinned points.

setwd("../occ_data/")
saveRDS(occ_cor, file = 'occThin_cor.Rdata')

Backgroud (bg) Point Sampling

In the Analysis Overview I mentioned in passing that this is a presence/background SDM workflow. Presence/background data sets are cases in which occurrence data is presence-only (like on GBIF), meaning there is no information about the absence of species. This is an important distinction because if we are going to appropriately interpret results, we need to understand the model assumptions. In the presence/background workflow, background (bg) points are randomly sampled from the environment surrounding and overlapping known presence points (more discussion on bg sampling bellow). This effectively samples the distribution of the ‘background’ environment across your study extent. The Max modeling workflow uses this background data to make inferences about absence points, using Baye’s Rule. Thus this workflow produces similar results to those as if you were modeling a presence/absence data set (but if presence/absence data is available, use it!).

See the following for more on background points: Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E., & Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists. Diversity and distributions, 17(1), 43-57. https://doi.org/10.1111/j.1472-4642.2010.00725.x.

So just exactly what is the background area and how can you define a study extent? Should it be simply the convex hull around occurrences? Or should it be all of North America? This depends entirely on the spatial distribution and scale of your data, model assumptions you are willing to live with, and ecological first principles depending on the life histories of your key species and research question. NOTE: Any decision you make will create assumptions and introduce spatial biases to the analysis, and therefore affect the interpretation and predictions of your SDMs.

See the following citation for in depth discussion on defining study extent: Barve, N., Barve, V., Jiménez-Valverde, A., Lira-Noriega, A., Maher, S. P., Peterson, A. T., … & Villalobos, F. (2011). The crucial role of the accessible area in ecological niche modeling and species distribution modeling. Ecological modelling, 222(11), 1810-1819. https://doi.org/10.1016/j.ecolmodel.2011.02.011.

One approach to defining study extent identified by Barve, et al. (2011), is by limiting the background to Biotic regions (hereafter, ecoregions). In short, this method allows for a compromise between real-world biology of species and model tractability. In this workflow I am using North American Ecoregions, as defined by an Internatial collaboration (between Canada, Mexico and the United States) known as the Commission for Environmental Cooperation. Ecoregions are defined at three Levels (I, II and III) or sptail scales, and are available open source at https://www.epa.gov/eco-research/ecoregions-north-america. Choose an appropriate scale for your research question.

In this workflow I opted to go with the Level-II ecoregions, as this scale compromise the scale at which sets of environmental conditions and biogeography are grouped together while still covering relatively large spatial areas that span across NA.

# Load libraries
library(tidyverse)
library(terra) 
library(predicts)
library(geodata)
library(ENMTools)
library(plotly) # 3D surface Kernel bivariate plots
library(MASS)

Due to the large size of the raster files, I have saved a local copy to my machine, and will not push these fills to the GitHub repo.

ecoNA <- vect(x = "C:/Users/terre/Documents/Acadia/Malus Project/maps/eco regions/na_cec_eco_l2/", layer = 'NA_CEC_Eco_Level2')
ecoNA <- project(ecoNA, 'WGS84') # project ecoregion vector to same coords ref as basemap

NOTE: I project the ecoregion rasters the same CSR as the climate predictor rasters.

Let’s take a look at the ecoregions (plotted in red), over top of North America.

# download/load maps
getwd()
setwd('../occ_data/')
us_map <- gadm(country = 'USA', level = 1, resolution = 2,
               path = "../occ_data/base_maps") #USA basemap w. States

ca_map <- gadm(country = 'CA', level = 1, resolution = 2,
               path = '../occ_data/base_maps') #Canada basemap w. Provinces

canUS_map <- rbind(us_map, ca_map) #combine US and Canada vector map

# plot basemap
plot(canUS_map, xlim = c(-180, -50))
# plot ecoregions 
lines(ecoNA, col = 'red')

dev.off()

Now lets load our thinned occurrences from the occ_thin.R script.

# load occurrence data
setwd("../occ_data/")
occThin_cor <- readRDS(file = 'occThin_cor.Rdata')

Now that we have the ecoregions and occurrence points loaded, lets load the WorldClim predictor rasters. We want to mask and crop these predictor layers to the ecoregions. This will defined the study extent, and subsequently allow us to sample the background.

First we need to know which of the level-II ecoregions are known to have species occurrences. To do this we will make use of the function terra::extract. NOTE: It could take a few mins for the extraction to complete

# M. coronaria ecoregions 
# extract ecoregion polygon that contain M. coronaria occurrence points
eco_cor <- extract(ecoNA, occThin_cor) # extract what polygons contained points 

# return vector of eco region codes of the polygons that contain occurrences
eco_cor_code <- eco_cor$NA_L2CODE %>% unique() 
eco_cor_code <- eco_cor_code[eco_cor_code != '0.0']  #remove the 'water' '0.0' ecoregion

# CODES: "8.1" "8.2" "5.3" "8.4" "8.3" "9.4" "8.5" "5.2"
# View the legend on the epa website for the names of these ecoregions.

Now that we have extracted what ecoregion codes contain at least one species occurrence, we can mask the ecoregion raster to only contain those selected above.

ecoNA_cor <- terra::subset(ecoNA, ecoNA$NA_L2CODE %in% eco_cor_code) # subset ecoregion spat vector by the codes

plot(ecoNA_cor) # plot the subsetted M. coronaria ecoregions
points(occThin_cor, pch = 3, col = 'red') # plot M. coronaria points

And as always we want to save this intermediate object as a .Rdata file, so that we do not need to repeat the long computational step of extracting the ecoregions.

setwd('../occ_data/eco_regions')
saveRDS(ecoNA_cor, file = 'ecoNA_cor.Rdata')

Now we can use these ecoregion rasters to crop our predictor climatic rasters. Again we are doing this so that we can define the study extent and sample the environmental background. Make sure to set the mask argument equal to TRUE.

# crop+mask extent of WorldClim data to the Malus ecoregions
wclim_cor <- terra::crop(wclim, ecoNA_cor, mask = T)

# Save cropped wclim data for downsteam SDM workflow
saveRDS(wclim_cor, file = 'wclim_cor.Rdata')

Now we want to randomly sample background points, and store the associated predictor values as SpatVectors. Alternatively you could also use SpatRasters if you please, but I found SpatVectors easier to work with.

IMPORTANT NOTE: How many background points should be sampled? There is not a set default, althought it is common in some studies to see 5000 or 10000 background points. We made the decision to sample 20000 points due to the very large spatial extent of this study. With this many points we can feel confident we the background environment is adequately sampled. Note that this is random, and has not yet addressed any spatial biases that may arise when modeling backgrounds (this is foreshadowing).

set.seed(1337) # set a seed to ensure reproducible results

# NOTE: Set arguments as.raster = T to return raster
# OR as.points to return spatvector = T to return spatvector

# M. coronaria background
# SpatVector

# Note upped bg points from 5000 to 20000 to be more suitable to better reflect a mean probability of presence 1 - a/2

cor_bg_vec <- spatSample(wclim_cor, 20000, 'random', na.rm = T, as.points = T) #ignore NA values
plot(wclim_cor[[1]])
points(cor_bg_vec, cex = 0.01)

We can also measure how many km^2 these ecoregions comprise, to get an idea of how many samples are taken per km^2.

expanse(wclim_cor[[1]], unit = 'km') # total area of raster in km^2
# 5683684 km^2
20000/5683684
# 0.00035 samples/km

Annnnd you guessed it, we want to save these SpatVectors of background points for downsteam workflow.

# Save background SpatVectors
setwd("../occ_data/")
saveRDS(cor_bg_vec, file = 'cor_bg_vec.Rdata')

You also may want to extract the actual predictor values and put them into a data.frame for both the occurrence points and background points. To do this you can again make use of the terra::extract function like above, and the terra::values function to get the values from the already bg SpatVectors. I show both as it can be useful, but I use mainly the SpatVectors downstream.

# Extracting presence-background raster values
cor_predvals <- extract(wclim_cor, occThin_cor) # M. coronaria
cor_predvals <- cor_predvals[-1] # drop ID column

cor_bgvals <- values(cor_bg_vec) # Extract raster values for bg points

Now we can create and save a data.frame which may come in handy later.

cor_pb <- c(rep(1, nrow(cor_predvals)), rep(0, nrow(cor_bgvals))) #binary (0,1) presence or background string

# combine presence and background data frames for SDM
cor_sdmData <- data.frame(cbind(cor_pb, rbind(cor_predvals, cor_bgvals)))

setwd("../occ_data/")
saveRDS(cor_sdmData, file = 'cor_sdmData.Rdata')

So we now have successfully defined a study extent using ecoregions that contain occurrences, and croped a predictor raster so that background points could be sampled from it. Then using these background and occurrence points we extracted the bioclimatic values in the predictor raster.

Using these values in a data frame (e.g. cor_sdmData) we can complete several colinearity checks. These checks allow us to understand how the climatic variables relate to one another. It is important to note, that in a classical statistical approach it is considered good practice to limit colinear variables, to prevent over fitting of a model. This step is refereed to as variable selection. With that said, MaxEnt is not a linear modeling approach, and thus does not need to follow assumptions about predictor colinearity. On top of that, MaxEnt is known to be robust to including many predictors.

HOWEVER, in the final versions of SDMs published by Roulston et al., we do subset five bio climatic variables that are mostly likely to be predictive of tree species habitat suitability which includes Bio1 - Tmean, Bio4 - Tvar, Bio11 - Tcoldq, Bio10 - Twarmq, Bio15 - Pvar and, Bio16 - Pwetq. Ultimatelty these differences can and did impact the results of SDMs so it is important to consider when modeling yourself.

We can start by simply plotting the relationship between all predictor pairs, using the base R graphics. Use the function pairs. This can get messy quickly with some many variables, so it is a bit hard to read. It may take a minute to plot…

# M. coronaria
pairs(cor_sdmData[,-1]) # drop the first column of 0/1

Let’s use a more helpful way to detect colinearity of predictor variables, a dendrogram, which visualizes correlation matrices. NOTE: This is only considering two-dimensional covariation, and cannot detect higher dimensional correlations.

# Dendogram cluster of predictor colinearity
threshold <- 0.7 # set the threshold for colinearity 
cor_cors <- raster.cor.matrix(wclim_cor) # pearson correlation
cor_dist <- as.dist(1 - abs(cor_cors)) # calculate distance of predictors

cor_clust <- hclust(cor_dist, method = 'single') # calculate cluster dendogram
cor_groups <- cutree(cor_clust, h = 1 - threshold) #calculate groupings of variables

plot(cor_clust, hang = -1, main = "M. coronaria Predictors")
rect.hclust(cor_clust, h = 1 - threshold)

We can also visualize the predictor covariate values to understand how they inform the model about the probability of a true presence. This involves the visualization of the distribution of the values at sampled points, presence and background, which allows us to get a peak into how the model constraints come about.

To do this we can compare the density of covariates from occurrence points and background points. I like to visualize this density using Kernel Density Plots. This plot allows you to visualize three dimensions, in this case two of the bioclimatic predictor variables, and the density of points sampled (pseudo-probability). We will build two separate plots to visualize the density of occurrence points and of background points.

First we will create new data frames for each of the climatic variables (in this case mean annual temperature, and mean annual precipitation), so that we can plot them below.

# Predictor Kernel Density Plots
# It is helpful to visualize two predictors pairs of
# Presence points and background points


cor_occ.temp <- cor_sdmData %>% filter(cor_pb == 1) %>% # Presence points
  dplyr::select(wc2.1_2.5m_bio_1) %>% # Mean annual temp
  drop_na() %>% 
  unlist()
cor_bg.temp <- cor_sdmData %>% filter(cor_pb == 0) %>% # Background points
  dplyr::select(wc2.1_2.5m_bio_1) %>% # Mean annual temp
  drop_na() %>% 
  unlist()

cor_occ.perc <- cor_sdmData %>% filter(cor_pb == 1) %>% # Presence points
  dplyr::select(wc2.1_2.5m_bio_12) %>% # Annual precipitation
  drop_na() %>% 
  unlist()

cor_bg.perc <- cor_sdmData %>% filter(cor_pb == 0) %>% # Background points
  dplyr::select(wc2.1_2.5m_bio_12) %>% # Annual precipitation
  drop_na() %>% 
  unlist()

Now that all the data is prepared, we will build the build the Kernel bivariate grid, using the function MASS::kde2d. This function needs three values, the x coordinate = temperature values, y coordinate = precipitation values, and the z coordinate is density of occurrence (pseudo-probability).

cor_occ.3d <- kde2d(cor_occ.temp, cor_occ.perc)
cor_bg.3d <- kde2d(cor_bg.temp, cor_bg.perc)

And plot the results using the package plot_ly. These build interactive 3D plots, and in this case can be used to visualize the Kernel plots.

plot_cor.occ_3d <- plot_ly(x=cor_occ.3d$x, y=cor_occ.3d$y, z=cor_occ.3d$z) %>% 
  add_surface() %>% 
  layout(scene = list(xaxis = list(title = 'Mean Annual Temp (C)', autotick = F, nticks = 5, tickvals = list(0,5,10,15,20)), 
                      yaxis = list(title = 'Mean Annual Percip. (mm)', tick0=0, tick1=2000, dtick=200), 
                      zaxis = list(title = 'Kernel Density', tick0=0, tick1=0.001, dtick=0.0002)),
         title = list(text = "<i>M. coronaria<i> Occurrence Points", 
                      y = 0.95, x = 0.5, 
                      xanchor = 'center', 
                      yanchor = 'top'))

plot_cor.occ_3d # run to view

plot_cor.bg_3d <- plot_ly(x=cor_bg.3d$x, y=cor_bg.3d$y, z=cor_bg.3d$z) %>% 
  add_surface() %>% 
  layout(scene = list(xaxis = list(title = 'Mean Annual Temp (C)', tick0=0, tick1=20, dtick=5), 
                      yaxis = list(title = 'Mean Annual Percip. (mm)', tick0=0, tick1=2000, dtick=200), 
                      zaxis = list(title = 'Kernel Density')),
         title = list(text = "<i>M. coronaria<i> Background Points", 
                      y = 0.95, x = 0.5, 
                      xanchor = 'center', 
                      yanchor = 'top'))

plot_cor.bg_3d

In this example I only visualize two of the 19 different bioclimatic variables. You may choose to visualize more for your own purposes. Remember this is only a visual comparison, but is valuable for understanding what the model is ‘seeing’ - of course it is using multidimensional space of all covariates, not just the two here.

Species Distribution Modeling and Climate Projections

So now have all we need (almost) to build the SDMs, finally. We have the presence (occurrence) points, and background points. We also have the climatic predictor rasters (WorldClim bioclimatic variables). What we don’t have is the future bioclimatic data, which we will use with the results of the SDM to make predictions about future suitable habitat, under different climatic scenarios.

library(tidyverse) # Grammar and data management
library(terra)# Spatial Data package
library(predicts) # SDM package
library(geodata) # basemaps
library(rJava) # MaxEnt models are dependant on JDK
library(ENMeval) # Another modeling package, useful for data partitioning (Checkerboarding)
library(raster) # RasterStack dependancy (a now deprecated function)
library(ecospat) # Useful spatial ecology tools
library(parallel) # speed up computation by running in parallel
library(doParallel) # added functionality to parallel

Let’s start by loading in the thinned occurrence points and background points, along with some base maps of Canada, USA, and Mexico. (Note: In the end I decided not to use these boundaries for plotting, but you may want to for your purposes so I left the chunk here).

# Download/load basemaps
us_map <- gadm(country = 'USA', level = 1, resolution = 2,
               path = "../occ_data/base_maps") #USA basemap w. States

ca_map <- gadm(country = 'CA', level = 1, resolution = 2,
               path = '../occ_data/base_maps') # Canada basemap w. Provinces

mex_map <-gadm(country = 'MX', level = 1, resolution = 2,
               path = '../occ_data/base_maps') # Mexico basemap w. States

canUSMex_map <- rbind(us_map, ca_map, mex_map) # Combine Mexico, US and Canada vector map


NA_ext <- ext(-180, -30, 18, 85) # Set spatial extent of analyis to NA in Western Hemisphere

canUSMex_map <- crop(canUSMex_map, NA_ext) # crop to Western Hemisphere

plot(canUSMex_map) # plot basemap

We are also going to load a shape file that outlines the Great Lakes. This helps to make the figures more pretty, and we can also use the vector layer to mask the predictor layers so that no predictions are made over the Great Lakes (Malus cannot swim…). You can download the shape files from the url below.

# Great Lakes shapefiles for making pretty maps
# Shape files downloaded from the USGS (https://www.sciencebase.gov/catalog/item/530f8a0ee4b0e7e46bd300dd)
great_lakes <- vect('C:/Users/terre/Documents/Acadia/Malus Project/maps/great lakes/combined great lakes/')

Now lets download the future bioclimatic data. For this we will need to specify a few additional arguments then before when downloading the historical climate data. We will again use the geodata package, but instead of worldclim_global we will use cmip6_world.

This function will download data for projected future climates from Coupled Model Intercomparison Project Phase 6 (CMIP6) generated from Global Climate Models (GCMs). CMIP6 projections are based on the Shared Socio-economic Pathway (SSP) scenarios, included in the Intergovernmental Panel on Climate Change (IPCC) Assessment Report (AR6).

In There are six different SSPs, four of which are available through geodata, “126”, “245”, “370” and “585”; for SSP1-2.6, SSP2-4.5, SSP3-7.0, SSP5-8.5. This is also available for three time periods: “2021-2040”, “2041-2060”, and “2061-2080”; or early century (2030), mid century (2050) and late century (2070). In this workflow I will model all three timeframes, and I chose two SSPs, SSP2-4.5 and SSP5-8.5. The former repersents a ‘middle of the road’ scenario with high climate adaptation, low climate mitigation; and the later is a ‘worst case’ or ‘bussiness as usual’ scenario where the global development continues down a unsustainable route with increased fossil fuel reliance. It is worth noting that SSP5-8.5 is the current tracking scenario: https://news.stanford.edu/2023/01/30/ai-predicts-global-warming-will-exceed-1-5-degrees-2030s/.

The final decision to make is which CMIP6 climate model you are going to choose. There are several models available through geodata, all of which have their individual strengths and weaknesses. For the purposes of this study, as I am primary concerned with Malus habitat in Canada, I chose to use the Canadian Earth System Model version 5 (CanESM5). This model is still a global model, but it is most commonly employed for climate research in Canada. Read more here: https://gmd.copernicus.org/articles/12/4823/2019/gmd-12-4823-2019.html

NOTE: Climate projections, are just that, projections. These models are useful, but can be flawed for a host of reasons. Recently it has come to the attention of climate modelers that many of the models included in CMIP6 are ‘too hot’, meaning the rates of warming may be too high, although this is not true for every region. Read more here: https://www.science.org/content/article/use-too-hot-climate-models-exaggerates-impacts-global-warming.

# Historical climate 1970-2000
wclim <- geodata::worldclim_global(var = 'bio',
                                   res = 2.5, 
                                   version = '2.1', 
                                   path = "../wclim_data/") %>% 
  crop(NA_ext) %>% #crop raster to NA 
  mask(great_lakes, inverse = T) # cut out the great lakes

# SSP (Shared social-economic pathway) 2.45 
# middle of the road projection, high climate adaptation, low climate mitigation
ssp245_2030 <- cmip6_world(model = "CanESM5",
                           ssp = "245",
                           time = "2021-2040",
                           var = "bioc",
                           res = 2.5,
                           path = "../wclim_data/") %>% 
  crop(NA_ext) %>% #crop raster to NA 
  mask(great_lakes, inverse = T) # cut out the great lakes

ssp245_2050 <- cmip6_world(model = "CanESM5",
                           ssp = "245",
                           time = "2041-2060",
                           var = "bioc",
                           res = 2.5,
                           path = "../wclim_data/") %>% 
  crop(NA_ext) %>% #crop raster to NA 
  mask(great_lakes, inverse = T) # cut out the great lakes

ssp245_2070 <- cmip6_world(model = "CanESM5",
                           ssp = "245",
                           time = "2061-2080",
                           var = "bioc",
                           res = 2.5,
                           path = "../wclim_data/") %>% 
  crop(NA_ext) %>% #crop raster to NA 
  mask(great_lakes, inverse = T) # cut out the great lakes

# SPP 5.85 
# low regard for enviromental sustainability, increased fossil fuel reliance, this is the current tracking projection
ssp585_2030 <- cmip6_world(model = "CanESM5",
                           ssp = "585",
                           time = "2021-2040",
                           var = "bioc",
                           res = 2.5,
                           path = "../wclim_data/") %>% 
  crop(NA_ext) %>% #crop raster to NA 
  mask(great_lakes, inverse = T) # cut out the great lakes

ssp585_2050 <- cmip6_world(model = "CanESM5",
                           ssp = "585",
                           time = "2041-2060",
                           var = "bioc",
                           res = 2.5,
                           path = "../wclim_data/") %>% 
  crop(NA_ext) %>% #crop raster to NA 
  mask(great_lakes, inverse = T) # cut out the great lakes

ssp585_2070 <- cmip6_world(model = "CanESM5",
                           ssp = "585",
                           time = "2061-2080",
                           var = "bioc",
                           res = 2.5,
                           path = "../wclim_data/")%>% 
  crop(NA_ext) %>% #crop raster to NA 
  mask(great_lakes, inverse = T) # cut out the great lakes

NOTE: Just like with the previous climate data, DO NOT push these large files to a github repo due to the size of the files.

Now that we have the future climate data we need to do a bit more prep work to get ready for the SDMs. First we should load in the cropped climate rasters from the background sampling script. I am doing this so I can visualize a spatial sampling method called checkerboarding (more details to come below).

# These Rasters are useful for sampling spatial checkerboards 
# and making habitat suitability predictions (Historical and under future SSPs climate scenarios)

setwd('../wclim_data')
# Historical (1970-2000)
wclim_cor <- readRDS(file = 'wclim_cor.Rdata')
wclim_cor_stack <- raster::stack(wclim_cor) # covert SpatRaster to RasterStack for dependency in ENMeval checkboarding

wclim_fus <- readRDS(file = 'wclim_fus.Rdata')
wclim_fus_stack <- raster::stack(wclim_fus) # covert SpatRaster to RasterStack for dependency in ENMeval checkboarding

NOTE: The stack function is from the deprecated package raster so it may not work in future versions of ENMeval (package for checkerboarding and SDM). Similar function from the more up to data terra package is terra::c, which can be used to combine SpatRasters, but does not have the required RasterStack object signature.

Then for ease of making predictions downsteam we will extract the names of each bioclimatic raster layer so that we can rename the layers from the projected climates. This is so that when making predictions using our SDM downstearm the names correspond.

climate_predictors <- names(wclim_cor) # extract climate predictor names, to rename layers in the rasters below
# This is important to do for making predictions once the SDMs have been made on future climate data
# Note that the names of the layers still correspond to the same environmental variables

# Future SSPs
# Do not need to create RasterStacks
# SSP 245
names(ssp245_2030) <- climate_predictors #rename raster layers for downsteam analysis
names(ssp245_2050) <- climate_predictors 
names(ssp245_2070) <- climate_predictors 

# SSP 585
names(ssp585_2030) <- climate_predictors #rename raster layers for downsteam analysis
names(ssp585_2050) <- climate_predictors 
names(ssp585_2070) <- climate_predictors

So now we have all our data prepped and its time to start getting things sorted for the SDM. In the next few pieces of code I am going to show you how to manually partition data into training and testing groups. There are several reasons for doing this, including and most importantly evaluating the performance and generalization of the model. This is called cross-validation, whereby the species occurrence data is partitioned into training data that gets inputted into the model, and testing data that is withheld and used to evaluate how well the trained model fits the ‘new’ testing occurrences. There are several different ways to partition data, all of which have pros/cons.

The simplest method is k-fold random, where data is equally binned into groups (‘folds’). However, this can be flawed as can preserve bias in the training and testing data. It is preferred to use spatial partitioning to reduce spatial autocorrelation preserve spatial heterogeneity, which can help reduce model overfitting. This creates partitions that are much more ecologically relevant so that the model can learn how the species is spatially organized, which in turn makes the predictions of the model more generalizable for future climate predictions.

There are several different methods to spatially partition occurrence data. A common approach is to overlay a square grids and take a checkerboard pattern of occurrences, where occurrences are partitioned depending on what square they fall in. The method I will demonstrate will complete 4-fold cross validation, where four partitions are taken - three will be used for training and one for testing. As I mentioned I will show you how to do this manually, but the MaxEnt function we will use later will do this partitioning for us, as well are cycle through and withhold each group.

Read more about spatial partitioning and checkerboarding in the following two articles.

Radosavljevic, A., & Anderson, R. P. (2014). Making better Maxent models of species distributions: complexity, overfitting and evaluation. Journal of biogeography, 41(4), 629-643. https://doi.org/10.1111/jbi.12227.

Muscarella, R., Galante, P. J., Soley‐Guardia, M., Boria, R. A., Kass, J. M., Uriarte, M., & Anderson, R. P. (2014). ENM eval: An R package for conducting spatially independent evaluations and estimating optimal model complexity for Maxent ecological niche models. Methods in ecology and evolution, 5(11), 1198-1205. https://doi.org/10.1111/2041-210X.12261.

Let’s get started with the partitions.

# Spatial partitioning preparation

occ_cor_coords <- as.data.frame(geom(occThin_cor)[,3:4]) # extract longitude, lattitude from occurence points
bg_cor_coords <- as.data.frame(geom(cor_bg_vec)[,3:4]) # extract longitude, lattitude from background points

occ_fus_coords <- as.data.frame(geom(occThin_fus)[,3:4]) # extract longitude, lattitude from occurence points
bg_fus_coords <- as.data.frame(geom(fus_bg_vec)[,3:4]) # extract longitude, lattitude from background points

The geom function will get the geometry of the occurrences from the SpatVectors and store them in data frames.

# Spatial Checkerboard partitioning of occurrence and background points cross-fold validation
set.seed(1337)

agg_factor <- c(9,9) # this defines how many adjacent cells are aggregated into the same checkboard 'sqaure'
# I visually tested different aggregations and found 9,9 was the best fit given the spatial extent

Setting a random generator seed will ensure that randomized partitions yield the same results consistently. The aggregation (agg) factor determines at what scale the occurrences are aggregated in each grid cell. The first value species the fine scale

NOTE: We are using the get.checkerboard2 function to get the 4-fold cross validation. This function is convenient but there is no reason why you could not do more partitions if you wanted to.

There are four required arguments: 1. the species occurrences coordinates, 2. background point coordinates, 3. a predictor RasterStack that is cropped to the extent of the background, and 4. the above defined aggregation factor.

Then we can use the evalplot.grps function to visualize the partitions.

cor_cb <- get.checkerboard2(occs = occ_cor_coords, 
                            bg = bg_cor_coords, 
                            envs = wclim_cor_stack, 
                            aggregation.factor = agg_factor
                            )

evalplot.grps(pts = occ_cor_coords, pts.grp = cor_cb$occs.grp, envs = wclim_cor_stack, pts.size = .75) # plot the checkerboard partitions of occurrence points
evalplot.grps(pts = bg_cor_coords, pts.grp = cor_cb$bg.grp, envs = wclim_cor_stack, pts.size = .75) # plot the checkerboard partitions of occurrence bg points
# background points may return an error if the checkboard sampling, this can happen due to the aggregation factor removing some obs 
# It is fine as we do not need to visualize groups. If you want to, must remove the difference in observations.

First lets look at the species occurrence partitions. But, it is hard to see the checkerboard pattern.

Now let’s look at the background points. As you can see the checkerboarding is much clearer. I also show how adjusting the aggregation factor changes the scale at which the partitions are taken from.

3,3 aggregation

6,6 aggregation

9,9 aggregation, which I chose to use!

12,12 aggregation

Okay, now that we have played around with the understanding how background point sampling works lets dive into developing our MaxEnt model. We are going to use parallel processing, to utilize multiple processors (CPUs) simultaneously. This will increase the amount of available memory and significantly decrease the computational time. If you are using a Windows computer (like me), then you only have one type of parallel processing available to you, known as ‘socket’ clusters. Read more here: https://www.r-bloggers.com/2019/06/parallel-r-socket-or-fork/#:~:text=As%20a%20result%2C%20it%20runs,40%25%20faster%20than%20the%20socket.

First let’s take a look at the code chunk and then I will break each argument down.

# Build Species Distribution Model using MaxEnt from the <ENMeval> package

# Run prediction in a parallel using 'socket' clusters to help speed up computation
# <ENMeval> implements parallel functions natively
# But load <parallel> library for additional functions like <decectCores()>
cn <- detectCores(logical = F) # logical = F, is number of physical RAM cores in your computer
set.seed(1337)

cor_maxent <- ENMevaluate(occ = occ_cor_coords, # occurrence records
                            envs = wclim_cor, # environment from background training area
                            n.bg = 20000, # 20000 bg points
                            tune.args =
                              list(rm = seq(0.5, 8, 0.5),
                                   fc = c("L", "LQ", "H",
                                          "LQH", "LQHP", "LQHPT")),
                            partition.settings =
                              list(aggregation.factor = c(9, 9),  gridSampleN = 20000), # 9,9 agg 
                            partitions = 'checkerboard2',
                            parallel = TRUE,
                            numCores = cn - 1, # leave one core available for other apps
                            parallelType = "doParallel", # use doParrallel on Windows - socket cluster  
                            algorithm = 'maxent.jar')

NOTE: The cn object is the number of physical CPUs your computer has. The function detectCores requires that you load the parallel library. Also, make sure to set the random generator seed to achieve consistent results.

Breakdown of the ENMevaluate function

The first argument that it is expecting is a data.frame of all species occurrences with two columns, longitude and latitude, in that specific order.
Like in the example we did above, we want to set n.bg = 20,000 background points to get a comprehensive sample.
Next the the envs argument is SpatRaster of bioclimatic predictor data cropped to the background extent (or ecoregions). We can reuse the cropped raster we produced in the malus_bg.R script.
Now we can adjust, or tune, the arguments of the MaxEnt algorithm with tune.args.
- rm stands for “regularization multiplier”, which affects how focused or closely-fitted the output distribution is. Smaller values can lead to model overfitting, so it is good to test different values.
- fc stands for “feature class”. This corresponds to a mathematical transformation of the different covariates used in the model to allow complex relationship to be modeled. L = linear, Q = quadratic, H = hinge, P = product and T = threshold.
- partition.settings is where to control the aggregation factor we played around with above. As well make sure to set the number of grid cells included in the partitioning equal to the number of background points sampled. It is 10,000 by default.
- partitions is how to select what cross-validation method will be used. Like above we are going to use the 4-fold checkerboad partitioning. The benefit of doing this in the ENMevalutate function is that the model will parition and then cross-validate across all groupings, meaning it will rotate what data is used in training and what is used for testing. It will then produce a continuous Boyce Index (cbi) to correlate each set of testing data against model fits.
- parallel set this equal to TRUE to ensure parallel processing is enabled. Likely if not enabled you will get a runtime or memory error - especially with complex models.
- numCores is where we want to specify how many cores RStudio can simultaneously access. It is good practice to leave one core free to run other processes on your computer.
- parallelType set to “doParallel” to run a socket cluster. This is the only type available on Windows. The other method, “doSNOW”, will run a fork cluster - which with can decrease computational time depending - but is only available on Mac and Linux.
- algorithm is how to select which SDM you want to use. In this case I am using the maxent.jar algorithm, which is a version of MaxEnt implemented in the dismo package using Java Script. Another version of MaxEnt is also available, implemented in native R with the maxnet package. It is worth noting that their are subtle differences. Personally I find the memory usage with maxnet to be much greater than maxent.jar, which equals longer computational time. The benefit of maxnet is you do not have to install JDK, which can be a challenge.

NOTE: The above model took 39 minutes and 32.8 seconds to compute. My laptop has a 3.2 GHz AMD Ryzen 7 5800H processor with eight cores. Depending on your own PC specs, model complexity including number of different tunings, and study extent your modeling could take longer/shorter depending.

And just like all the previous long computations we want to save a .Rdata file to reload the model if needed. NOTE: These files are large (>200 Mb), DO NOT PUSH to git.

# Save the MaxEnt model so you do not have to waste time re-running the model
setwd('../sdm_output')
saveRDS(cor_maxent, file = 'cor_maxent.Rdata') # save
cor_maxent <- readRDS(file = 'cor_maxent.Rdata') # load

The next step after running several model combinations (96 to be precise), is to select what ‘best’ model (with particular tuning arguments). A common approach is to use Akaike information criterion (AIC), specifically delta AICc, which is a corrected AIC value, expressed as the relative difference in AICc values of each model. Delta AICc values less than 2, are considered to be equally good models. In my case, there was only a single model less than 2, the best performing model. Typically, to deal with tiebreaks between well performing models, the least complex (meaning simplest fc) and/or the model with the lowest rm value.

# Note that maxent results provide Continuous Boyce Index (cbi)
best_cor_maxent <- subset(cor_maxent@results, delta.AICc == 0) # selects the best performing model based on delta AICc - returns data frame object
mod.best_cor_maxent <- eval.models(cor_maxent)[[best_cor_maxent$tune.args]] # extracts the best model - returns MaxEnt object

So now that we have selected our best performing MaxEnt model we can start to make some predictions! We are going to use the terra package to create a SpatRaster of habitat suitability scores. First, lets start by making some predictions using the historical WorldClim data. In this case we want to use the uncropped WorldClim SpatRaster, to make predictions across North America. This will become more clear why once we get to the future climate change predictions.

# Wclim is the historical climatic conditions (1970-2000)
cor_pred_hist <- terra::predict(wclim, mod.best_cor_maxent, cores = cn - 1)

plot(cor_pred_hist)
points(occThin_cor, cex = 0.05)

We have made our first prediction of the fundamental niche of M. coronaria! But there are a few more things to consider before moving on.

First, is how confident can we be of these predictions? Using metrics like Area Under the Curve (AUC) and Continous Boyce Index can help us, however they have their own flaws, and so its important to think critically about the results of your predictions. Note that ENMevaluate and the MaxEnt algorithm automatically compute several indices, including separate indices calculated on the training data, and those calculated on the validation (testing) data.

AUC is the visualization of the Relative Operating Characteristic (ROC) curve, which plots the False Positive Rate (1 - Specificity) vs. the True Positive Rate (Sensitivity). The closer the ROC curve follows the y-axis, the greater the area under the curve. In other words AUC is the sum of the area under the ROC curve.
- An AUC value of 0.5 indicates the model cannot make predictions better than random.
- An AUC value of 1 represents a perfect prediction.
- As a general rule of thumb is a score over 0.7 is a moderately well performing model, and a score over 0.9 is excellent.
- NOTE: Some authors points out that AUC can be misleading and should not be used as an ultimate method to evaluate model performance, and should be used in caution especially when comparing performance of different models.
- Relevant reading: Lobo, J. M., Jiménez-Valverde, A., & Real, R. (2008). AUC: A misleading measure of the performance of predictive distribution models. Global Ecology and Biogeography, 17(2), 145–151. https://doi.org/10.1111/j.1466-8238.2007.00358.x.
CBI is a plot of predicted-to-expected ratios, grouped into classes. The classes (or bins) are grouped by habitat suitability values. By default ecospat.boyce uses a moving window, where it forms classes by 1/10th of the suitability range. The Boyce Index is then calculated using a Spearman rank correlation metric between P/E scores and habitat suitability scores. What we expect to see is a steady increase in P/E with an increase suitability scores.
- NOTE: It is important to plot the PE plot, and not just trust the correlation score at face value. If there is a sudden jump in P/E scores this can be suspicious.
- Relevant reading: Hirzel, Alexandre H., Gwenaëlle Le Lay, Véronique Helfer, Christophe Randin, and Antoine Guisan. “Evaluating the ability of habitat suitability models to predict species presences.” Ecological modelling 199, no. 2 (2006): 142-152. https://doi.org/10.1016/j.ecolmodel.2006.05.017.

# Evaluate predictions using Boyce Index
# the number of true presences should decline with suitability groups 100-91, 90-81, etc. 
# First extract suitability values for the background and presence points, make sure to omit NA values
corPred_bg_val <- terra::extract(cor_pred_hist, bg_cor_coords)$lyr1 %>% 
  na.omit()

corPred_val_na <- terra::extract(cor_pred_hist, occ_cor_coords)$lyr1 %>% 
  na.omit()

# Evaluate predictions using Boyce Index
ecospat.boyce(fit = corPred_bg_val, # vector of predicted habitat suitability of bg points
                           obs = corPred_val, # vector of 
                           nclass = 0, 
                           PEplot = TRUE,
                           method = 'spearman')

Second, is it best to leave these suitability scores as a continuous gradient of habitat suitability? Myself (and many others) find maps with colour gradients illegible, or less informative at a glance. Some take the approach of creating a binary threshold, where habitat suitability is boolean. There are several methods to determine this binary threshold, which can be done using the predicts package. A flaw of this approach is it less tangible what these thresholds mean in lay terms, and biology is never as simple as YES/NO.

With that said (foreshadowing), these binary thresholds are very useful for completing gap analyses. Gap analyses are a natural next step after completing a SDM to understand how much of a species suitable habitat is covered by protected areas (PAs) so that we know how much of the specie’s habitat is protected.

cor_pa <- predicts::pa_evaluate(p = occ_cor_coords_mat, a = bg_cor_coords_mat, model = cor_maxent, x = wclim_cor)
cor_threshold <- predicts::threshold(cor_pa)
cor_hist_habitat <- cor_pred_hist > cor_threshold$max_spec_sens #the threshold at which the sum of the sensitivity (true positive rate) and specificity (true negative rate) is highest

However, I think it is more practical to bin suitability scores into several categories. Lets say three categories: “High suitability”, “Moderate suitability” and “Low suitability”. These categories are determined using qauntiles, where suitability scores in the lowest percentiles can be considered to be lowly or moderately suitable, and higher scores considered to be of high suitability. In this case I use the 1st, 10th, and 50th percentile to classify these categories. This means that in the low habitat category we are only omitting 1% of predicted suitable habitat, compared to the 50th, were we are omitting 50% of predicted values. It is important to understand that this approach also has flaws, but typically the 10th percentile is considered a acceptable minimum threshold.

corPred_val <- terra::extract(cor_pred_hist, occ_cor_coords)$lyr1
corPred_threshold_1 <- quantile(corPred_val, 0.01, na.rm = T) # Low suitability
corPred_threshold_10 <- quantile(corPred_val, 0.1, na.rm = T) # Moderate suitability
corPred_threshold_50 <- quantile(corPred_val, 0.5, na.rm = T) # High suitability

# Save suitability thresholds

setwd('../sdm_output/thresholds')
saveRDS(corPred_threshold_1, file = 'corPred_threshold_1.Rdata')
saveRDS(corPred_threshold_10, file = 'corPred_threshold_10.Rdata')
saveRDS(corPred_threshold_50, file = 'corPred_threshold_50.Rdata')

Now let’s visualize the results! As mentioned previously, we will tidy things up downstream to make publication quality plots. But, this is a great start!

legend_labs <- c('Low Suitability', 'Moderate Suitability', 'High Suitability')
# brewer.pal(3, "YlOrBr") # Brewer palettes are helpful for mapping colour gradients
fill_cols <- c("#FFF7BC", "#FEC44F", "#D95F0E")

par(mar = c(5, 5, 5, 5))
terra::plot(cor_pred_hist > corPred_threshold_1, col = c('#E8E8E8', '#FFF7BC'), legend = F, xlim = c(-100, -50), ylim = c(30, 60), main = expression(atop(italic('Malus coronaria'), " Historical Suitability (1970-2000)")), background = 'lightskyblue1')
terra::plot(cor_pred_hist > corPred_threshold_10, col = c(NA, '#FEC44F'), add = T, legend = F)
terra::plot(cor_pred_hist > corPred_threshold_50, col = c(NA, '#D95F0E'), add = T, legend = F)
terra::plot(canUSMex_map, add = T)
points(occThin_cor, col = 'black', cex = 0.75, pch = 4)
legend(x = -75, y = 30, xpd = NA, inset = c(5, 0), 
       title = 'Habitat Suitability', 
       legend = legend_labs,
       fill = fill_cols)

Now we can move on to make predictions for the future climate scenarios. We will still use the same threshold values from the historical prediction to keep interpretation consistent. As always, make sure to save the predictions to save time when having to reload.

# Future Climate predictions
# SSP 245
cor_pred_ssp245_30 <- terra::predict(ssp245_2030, mod.best_cor_maxent, cores = cn - 1)
cor_pred_ssp245_50 <- terra::predict(ssp245_2050, mod.best_cor_maxent, cores = cn - 1)
cor_pred_ssp245_70 <- terra::predict(ssp245_2070, mod.best_cor_maxent, cores = cn - 1)

# SSP585
cor_pred_ssp585_30 <- terra::predict(ssp585_2030, mod.best_cor_maxent, cores = cn - 1)
cor_pred_ssp585_50 <- terra::predict(ssp585_2050, mod.best_cor_maxent, cores = cn - 1)
cor_pred_ssp585_70 <- terra::predict(ssp585_2070, mod.best_cor_maxent, cores = cn - 1)

# Save M cor. SDM predictions 
setwd('../sdm_output')
saveRDS(cor_pred_hist, file = 'cor_pred_hist.Rdata')

saveRDS(cor_pred_ssp245_30, file = 'cor_pred_ssp245_30.Rdata')
saveRDS(cor_pred_ssp245_50, file = 'cor_pred_ssp245_50.Rdata')
saveRDS(cor_pred_ssp245_70, file = 'cor_pred_ssp245_70.Rdata')

saveRDS(cor_pred_ssp585_30, file = 'cor_pred_ssp585_30.Rdata')
saveRDS(cor_pred_ssp585_50, file = 'cor_pred_ssp585_50.Rdata')
saveRDS(cor_pred_ssp585_70, file = 'cor_pred_ssp585_70.Rdata')

For the purposes of this tutorial I am only going to show results of one of the future climate predictions - SSP5-8.5 2020-2040 (the current tracking projection). I am using the same code as above for the historical suitability plot.

We can see a clear shift in the fundamental niche of M. coronaria northward. At the same time, it seems that the climate for M. coronaria will be favourable in Southern Ontario, as well as an expansion in Quebec and the Maritimes. Also pay attention to areas where fundamental habitat is being lost, especially in the US. Remember that these are simply predictions, that carry a lot of assumptions.

Now that we have made all our predictions and thresholds saved as .Rdata files lets move onto to tidying up plots to prepare them for a results section and publication! Also, if you have made it this far take a minute to appreciate just how much work has been accomplished.

Publication Plots

One of my favourite, but most time consuming steps in a SDM workflow is producing figures. Designing a plot that effectively display results at the same time as being simple and aesthetically pleasing can take a lot of time and fine tuning. Not to mention that plotting spatial data can be difficult to say the least. With that said once you do nail a plot it is so rewarding!

It is important to note that there are several different approaches and packages that you can use to plot spatial vectors and rasters. One of the most popular packages is ggplot2 (a.k.a ggplot), which is highly maintained and polished. It can be a bit of a learning curve, but once you get the hang of it it is THE way to make (most) plots. Not to mention there are endless packages that add functionality to ggplot2 that can do a lot of cool things.

Now, with that said for the purposes of my publication and workflow I will be using the plotting functionality built into terra. As this is a newer package, there is less support and documentation in ggplot2 but a package called tidyterra is well into development. Also, the terra::plot function builds off the base R graphics package, which can also have its advantages.

Now what exactly goes into a ‘publication quality’ plot. There is a long list of things that separates a good figure from a bad figure. I will go over a few decisions I have made, but I will not begin to scratch the surface. I highly reccomend taking a look at this YouTube video by the Riffomonas Project on creating publication quality figures (https://youtu.be/3y7pJYURspA?si=NsdrJU9MxnmNF6b_). The creator, Dr. Pat Schloss, is a microbiologist at the University of Michigan. He has an immense experience base in computational statistics, and a talent for simply explaining complex code. Checkout his catalog of videos and his website!! I take a lot of inspiration from his philosophy on open access statistics and coding - and certainly have learned a lot from him. Okay that is enough about my statics crush…

For a terra specific introduction to plotting checkout this great introduction to making simple maps with terra by my collaborator Dr. Tyler Smith, (https://plantarum.ca/2023/02/13/terra-maps/). He has recently updated the blog to be up to date with all the new terra functionality. Also checkout the rest of his blogs!

As always, let’s start by loading the raster layers and thresholds we saved from the SDM outputs.

NOTE: Due to the Lambert projection the x any y axes are now a projected coordinate system, and no longer standard geographic coordinates (Lat and Lon). Thus the axes limits must be set differently. IF you want to still plot lat/lon coordinates you need to project a graticule, which I will cover below.

Let’s begin…

# Load occurrences and  raster/vectors
# Occurrence Points in SpatVectors
getwd()
setwd("../occ_data/")
occThin_cor <- readRDS(file = 'occThin_cor.Rdata') # M. coronaria
occThin_fus <- readRDS(file = 'occThin_fus.Rdata') # M. fusca

# Great Lakes shapefiles for making pretty maps
# Shape files downloaded from the USGS (https://www.sciencebase.gov/catalog/item/530f8a0ee4b0e7e46bd300dd)
great_lakes <- vect('C:/Users/terre/Documents/Acadia/Malus Project/maps/great lakes/combined great lakes/')

While plotting the SDM results in the previous script I realized that the gadm country boundaries were making the results of the SDM really hard to see in areas at the coast. This was especially true for Malus fusca on the west coast of British Columbia, where there are many many inlets and bays (see photo example of a prelimary map below). Because of this I chose to scarp the admin boundaries that outline the entirity of Canada and the US, and download a shape file of just the border instead.

# Canada/US Border for showing the international line. Gadm admin boundaries trace the entire country, vs this is just the border
# Much easier to see the SDM results along coastlines where tracing obscures the data
# Downloaded from  https://koordinates.com/layer/111012-canada-and-us-border/
can_us_border <- vect('C:/Users/terre/Documents/Acadia/Malus Project/maps/can_us border')

A small tweak I made to the vector was to remove the border lines drawn in the ocean on both coasts. This was purely for aesthetic reasons.Thankfully each segment of the vector is clearly named and easy to drop.

can_us_border$SectionEng # access variable names of each segment
# Two line segments are in the water and are not needed in this case, lets remove them to make the maps look prettier
segments_to_remove <- c("Gulf of Maine", "Straits of Georgia and Juan de Fuca")
can_us_border <- can_us_border[!can_us_border$SectionEng %in% segments_to_remove, ]

Up until now all the maps produced above have been done using the World Geodectic System 1984 (a.k.a WGS84); which is a pseudo-Mercator projection that is the de facto standard for most web apps. With that said it is not very nice for producing figures of maps because it distorts shape and size of areas. So we will be using the Lambert Conformal Conic projection. This projection is ideal for mapping Canada and the US because its east-west orientation minimizes distortion across these large, elongated regions, providing a more accurate representation of geographic features and distances. It is also pretty much the standard projection in every Canadian geography textbook and it is also the official mapping projection of the Government of Canada.

For an introduction into projections and ‘Coordinate Reference Systems’ (CRS) visit in R https://rspatial.org/raster/spatial/6-crs.html.

projLam <- "+proj=lcc +lat_1=49 +lat_2=77 +lat_0=49 +lon_0=-95 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs"

As I mentioned above, when creating a new projection in R other than the WGS84, you will need to produce graticules to show the latitude and longitude on the new coordinate system you are using. There is a function in terra to do this. I am not going to use it for my plots but it is good to know how to do it!

# Gratitcules
g.cor <- terra::graticule(
  lon = seq(-110, -55, by = 10),  # Longitude from 130°W to 50°W
  lat = seq(30, 55, by = 10),     # Latitude from 25°N to 75°N
  crs = projLam
)

So now we need to get all of the spatial data we are going to project on the same coordinate system (The Lambert). This can take a few minutes to run!

# Great lakes
great_lakes.lcc <- project(great_lakes, projLam)

# Can/US border
can_us_border.lcc <- project(can_us_border, projLam)

# Occurrences
occThin_cor.lcc <- project(occThin_cor, projLam)
occThin_fus.lcc <- project(occThin_fus, projLam)

# Habitat suitability
cor_pred_hist.lcc <- project(cor_pred_hist, projLam)

cor_pred_ssp245_30.lcc <- project(cor_pred_ssp245_30, projLam)
cor_pred_ssp245_50.lcc <- project(cor_pred_ssp245_50, projLam)
cor_pred_ssp245_70.lcc <- project(cor_pred_ssp245_70, projLam)

cor_pred_ssp585_30.lcc <- project(cor_pred_ssp585_30, projLam)
cor_pred_ssp585_50.lcc <- project(cor_pred_ssp585_50, projLam)
cor_pred_ssp585_70.lcc <- project(cor_pred_ssp585_70, projLam)

Ok we have done all the prep work we needed to do, now let’s make some figures!

# We need three colours that clearly show the threshold categories 
brewer.pal(3, "YlOrBr") # Brewer palettes are helpful for mapping colour gradients

legend_labs <- c('Low Suitability', 'Moderate Suitability', 'High Suitability')
fill_cols <- c("#FFF7BC", "#FEC44F", "#D95F0E")

# Then we need to determine some plot limits. Unfortunately it is a bit more tricky compared to the Mercator projection
# It can take some trial and error

cor.xlim <- c(-5*10^5, 3.1*10^6)
cor.ylim <- c(-2*10^6, 2*10^6)

I want to make a figure with multiple panels. You could absolutely do this in R, BUT sometimes the simplier solution can be easier. In this case I am going to make all the elements/panels of the figure and then edit it all together in PowerPoint! To start I am going to produce a legend.

# Plot a legend that can be saved on its own
plot(NULL ,xaxt='n',yaxt='n',bty='n',ylab='',xlab='', xlim=0:1, ylim=0:1)
legend('center', xpd = NA, title = c(as.expression(bquote(bold('Habitat Suitability')))), legend = legend_labs, fill = fill_cols, cex = 3)

# Plot historical distribution 
terra::plot(cor_pred_hist.lcc > corPred_threshold_1, col = c('#E8E8E8', '#FFF7BC'),
            background = 'lightskyblue1',
            legend = F, 
            xlim = cor.xlim, ylim = cor.ylim, 
            main = " Historical Suitability (1970-2000)",
            cex.main = 3,
            axes = F,
            box = T,
            mar = c(5, 5, 5, 5))
terra::plot(cor_pred_hist.lcc > corPred_threshold_10, col = c(NA, '#FEC44F'), add = T, legend = F)
terra::plot(cor_pred_hist.lcc > corPred_threshold_50, col = c(NA, '#D95F0E'), add = T, legend = F)
terra::plot(can_us_border.lcc, add = T)
#terra::plot(g.cor, add = T, retro = T, col = '#999999',lab.lat = c(2,3,4), lab.lon = c(2,3,4,5), lab.cex = 1.5, mar = c(5,5,5,5), lab.loc = 1, off.lat = -0.01) # Code for adding a gratitcule to the plot.

Now I have repeated the plotting process for all time periods for both the SSPs. I took all those photos and went into PowerPoint and produced the following figure!

Now let’s break down some of the decisions I made when designing the map.

I chose the sequential Brewer palettes as light colours stress lower values and darker greater values. This makes sense for habitat suitability going from low to high. The shades of brown are quick and easy to distinguish (but are not colour blind friendly). They are also soft hues that are _aesthetic.
The light grey background of North America is easily recognizable but it is not too dark to take the readers attention away from SDM results.
As previously mentioned, in previous iterations I had administrative boundaries for each province/state but it was making the plots very busy. The question I asked myself is, is the cluster worth the information being presented - or is it even relevant at all. No, not really! BUT it is important to show the international border!
The ocean could have remained white, but it is a bit more clear what is water if we make it blue. The light blue matches the soft browns and grey. It also distinguishes the background of the map from the rest of the figure.
The arrangement of the historical plot and then the SSPs which it creates a sensible hierarchy with each year as a column and row as a different SSP. It is more natural to read left to right, following the time progression then to compare up and down.
The legend is large and clear.
The species is in the top left to be the first thing the reader sees, with a accompanying picture that draws the reader in and reminds them what species they are looking at. The same picture is used in a previous figure introducing the species to create continuity.
Each panel is labeled with letters so that they can be easily referenced in-text.
Overall the plot is minimal, nothing is excessive, and the whole plot is soft and easy to look at. (At least in my humble opinion).

Gap Analysis

The natural next step after completing a SDM is to complete a gap analysis.

A gap analysis is a method in conservation biology to assess the effectiveness of existing protected areas (PAs) and ex situ collections in conserving biodiversity. The analysis can identify where the species of interest (or ecosystem) is protected or not, and to what extent they are represented within those protected areas. There are many different approaches and software for completing this analysis, but to as we have come this far we will be sticking with R. Dan Carver et al. 2021 (https://nsojournals.onlinelibrary.wiley.com/doi/full/10.1111/ecog.05430) published a great package for completing gap analyses in R, but there are now several deprecated dependencies which has made the package inoperable. With that said we can still use their equations to complete the analysis ourselves.

Unfortunately for the Malus species I am working with there are simply not enough ex situ data in existence in Canada to complete it. We do however thankfully have enough data to do an in situ analysis.

library(tidyverse) # Grammar and data management
library(terra) # Spatial Data package
library(geodata) # basemaps
library(ggpubr) # arrange ggplot figures

# The following gap analysis is referencing Carver et al. (2021)
# https://nsojournals.onlinelibrary.wiley.com/doi/full/10.1111/ecog.05430

# For the purposes of the analysis we are restricting the study area to just Canada

To begin we need to get spatial data of the protected areas in Canada. The Canadian government releases all their protected and conserved areas - this includes terrestrial and marine areas, as well as effective area-based conservation measures (OECM). In this analysis I am only considering PAs, although for your purposes you may want to consider including OECMs. Link to Canadian Protected and Conserved Areas Database (CPCAD) https://www.canada.ca/en/environment-climate-change/services/national-wildlife-areas/protected-conserved-areas-database.html.

# Protected Area and OECM data from 
# Download .kmz file and vectorize using terra

pro_area <- terra::vect("C:/Users/terre/Documents/Acadia/Malus Project/maps/canada_pa/ProtectedConservedArea_2023/ProtectedConservedArea_2023/ProtectedConservedArea_2023.gdb")
pro_area <- pro_area[pro_area$BIOME == 'T'] # filter only the terrestrial protected areas and OECMs
pro_area <- pro_area[pro_area$PA_OECM_DF == '1'] # filter only protected areas (remove OECMs)
pro_area <- project(pro_area, 'WGS84') # project the PAs to be the same CRS as the SDM layers

NOTE: The CPCAD data uses a different projection than what we saved our spatial data as so make sure to change the projection.

Here we can see all the terrestrial protected areas in Canada as of Sept. 2024. Note that this does not include OECMs.

Now let’s load in all the necessary SDM layers, species occurrences and base maps.

# I am using the complete suitability rasters and thresholds separately
# Could also load pre-categorized layers as well

# M. coronaria
getwd()
setwd('../sdm_output/')

cor_pred_hist <- readRDS(file = 'cor_pred_hist.Rdata')

cor_pred_ssp245_30 <- readRDS(file = 'cor_pred_ssp245_30.Rdata')
cor_pred_ssp245_50 <- readRDS(file = 'cor_pred_ssp245_50.Rdata')
cor_pred_ssp245_70 <- readRDS(file = 'cor_pred_ssp245_70.Rdata')

cor_pred_ssp585_30 <- readRDS(file = 'cor_pred_ssp585_30.Rdata')
cor_pred_ssp585_50 <- readRDS(file = 'cor_pred_ssp585_50.Rdata')
cor_pred_ssp585_70 <- readRDS(file = 'cor_pred_ssp585_70.Rdata')

# Thresholds
# M. coronaria
setwd('../sdm_output/thresholds')
corPred_threshold_1 <- readRDS(file = 'corPred_threshold_1.Rdata')
corPred_threshold_10 <- readRDS(file = 'corPred_threshold_10.Rdata')
corPred_threshold_50 <- readRDS(file = 'corPred_threshold_50.Rdata')

# Load a Canada Admin boundary
getwd()
setwd('../occ_data')
ca_bound <- gadm(country = 'CA', level = 0, resolution = 1,
               path = '../occ_data/base_maps')

NOTE: I am following Carver et al., 2021. I will not be explaining in very much detail what each metric represents so go checkout the paper!

ANOTHER NOTE: One major difference in this workflow to Carver et al., 2021 is that I am using the projected SSPs to look at the future scores of these metrics - if the PAs were to stay the same as they are today.

Now the first gap analysis statistic we want to calculate is the sampling representativeness score in situ (SRSin). To do this we will need to crop and mask the predicted SDM layers to Canada’s borders. Then we will take the product of that and crop and mask it to only the protected areas (PAs).

I will note that the way I am coding it here is not likely the most efficient or clean - but it will get the job done! Let’s hope Dan can get get the GapAnalysisR package back to complete strength. But this workflow does serve as an example on how to manually complete these calculations on your own!

# Mask and crop SDM layer to Canada

# M. coronaria
# Historical
can_cor_pred_hist <- crop(cor_pred_hist, ca_bound, mask = T) 

# SSP245
can_cor_pred_ssp245_30 <- crop(cor_pred_ssp245_30, ca_bound, mask = T)
can_cor_pred_ssp245_50 <- crop(cor_pred_ssp245_50, ca_bound, mask = T)
can_cor_pred_ssp245_70 <- crop(cor_pred_ssp245_70, ca_bound, mask = T)

# SSP585
can_cor_pred_ssp585_30 <- crop(cor_pred_ssp585_30, ca_bound, mask = T)
can_cor_pred_ssp585_50 <- crop(cor_pred_ssp585_50, ca_bound, mask = T)
can_cor_pred_ssp585_70 <- crop(cor_pred_ssp585_70, ca_bound, mask = T)

# Mask Suitable Habitat rasters to the PA polygons

#M. coronaria
#Historical 
pa_cor_pred_hist <- crop(can_cor_pred_hist, pro_area, mask = T) 

# SSP245
pa_cor_pred_ssp245_30 <- crop(cor_pred_ssp245_30, pro_area, mask = T)
pa_cor_pred_ssp245_50 <- crop(cor_pred_ssp245_50, pro_area, mask = T) 
pa_cor_pred_ssp245_70 <- crop(cor_pred_ssp245_70, pro_area, mask = T) 

# SSP585
pa_cor_pred_ssp585_30 <- crop(cor_pred_ssp585_30, pro_area, mask = T) 
pa_cor_pred_ssp585_50 <- crop(cor_pred_ssp585_50, pro_area, mask = T) 
pa_cor_pred_ssp585_70 <- crop(cor_pred_ssp585_70, pro_area, mask = T)

So now that the layers are properly cropped and masked to the only the areas we want to consider, we need to extract all the true presences. As you will remember from above these are the cases where occurrences fall within an area that is predicted as suitable. Then we will do the same for all the true positive occurences that fall within PAs.

# Extract occurrence points that fall within the SDM prediction
# M. coronaria
# Historical
index_cor_pos_hist_low <- extract(can_cor_pred_hist > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_cor_pos_hist_mod <- extract(can_cor_pred_hist > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_hist_high <- extract(can_cor_pred_hist > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)

# SSP245
index_cor_pos_ssp245_30_low <- extract(can_cor_pred_ssp245_30 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_cor_pos_ssp245_30_mod <- extract(can_cor_pred_ssp245_30 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp245_30_high <- extract(can_cor_pred_ssp245_30 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)

index_cor_pos_ssp245_50_low <- extract(can_cor_pred_ssp245_50 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_cor_pos_ssp245_50_mod <- extract(can_cor_pred_ssp245_50 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp245_50_high <- extract(can_cor_pred_ssp245_50 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)

index_cor_pos_ssp245_70_low <- extract(can_cor_pred_ssp245_70 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_cor_pos_ssp245_70_mod <- extract(can_cor_pred_ssp245_70 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp245_70_high <- extract(can_cor_pred_ssp245_70 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)

# SSP585
index_cor_pos_ssp585_30_low <- extract(can_cor_pred_ssp585_30 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_cor_pos_ssp585_30_mod <- extract(can_cor_pred_ssp585_30 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp585_30_high <- extract(can_cor_pred_ssp585_30 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)

index_cor_pos_ssp585_50_low <- extract(can_cor_pred_ssp585_50 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_cor_pos_ssp585_50_mod <- extract(can_cor_pred_ssp585_50 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp585_50_high <- extract(can_cor_pred_ssp585_50 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)

index_cor_pos_ssp585_70_low <- extract(can_cor_pred_ssp585_70 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_cor_pos_ssp585_70_mod <- extract(can_cor_pred_ssp585_70 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp585_70_high <- extract(can_cor_pred_ssp585_70 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)


# Extract that occurrences that fall within the PAs
# M. coronaria
# Historical
index_pa_cor_pos_hist_low <- extract(pa_cor_pred_hist > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_pa_cor_pos_hist_mod <- extract(pa_cor_pred_hist > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_hist_high <- extract(pa_cor_pred_hist > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)

# SSP245
index_pa_cor_pos_ssp245_30_low <- extract(pa_cor_pred_ssp245_30 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_pa_cor_pos_ssp245_30_mod <- extract(pa_cor_pred_ssp245_30 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp245_30_high <- extract(pa_cor_pred_ssp245_30 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)

index_pa_cor_pos_ssp245_50_low <- extract(pa_cor_pred_ssp245_50 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_pa_cor_pos_ssp245_50_mod <- extract(pa_cor_pred_ssp245_50 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp245_50_high <- extract(pa_cor_pred_ssp245_50 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)

index_pa_cor_pos_ssp245_70_low <- extract(pa_cor_pred_ssp245_70 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_pa_cor_pos_ssp245_70_mod <- extract(pa_cor_pred_ssp245_70 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp245_70_high <- extract(pa_cor_pred_ssp245_70 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)

# SSP585
index_pa_cor_pos_ssp585_30_low <- extract(pa_cor_pred_ssp585_30 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_pa_cor_pos_ssp585_30_mod <- extract(pa_cor_pred_ssp585_30 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp585_30_high <- extract(pa_cor_pred_ssp585_30 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)

index_pa_cor_pos_ssp585_50_low <- extract(pa_cor_pred_ssp585_50 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_pa_cor_pos_ssp585_50_mod <- extract(pa_cor_pred_ssp585_50 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp585_50_high <- extract(pa_cor_pred_ssp585_50 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)

index_pa_cor_pos_ssp585_70_low <- extract(pa_cor_pred_ssp585_70 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_pa_cor_pos_ssp585_70_mod <- extract(pa_cor_pred_ssp585_70 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp585_70_high <- extract(pa_cor_pred_ssp585_70 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)

Note: We need to filter the layers to only include the ‘lyr1’ value equal to T (or True).

Now we have a list of all the true positive occurrences for the entire predicted suitable area in Canada, and those cases that fall within PAs in Canada. So now we want to count how of each there are and calculate the SRSin metric.

# return count of the number of occurrences
# positive occurrences (overall)
# list of obj names from above
occ_names <- list(
  index_cor_pos_hist_low,
  index_cor_pos_hist_mod,
  index_cor_pos_hist_high,
  index_cor_pos_ssp245_30_low,
  index_cor_pos_ssp245_30_mod,
  index_cor_pos_ssp245_30_high,
  index_cor_pos_ssp245_50_low,
  index_cor_pos_ssp245_50_mod,
  index_cor_pos_ssp245_50_high,
  index_cor_pos_ssp245_70_low,
  index_cor_pos_ssp245_70_mod,
  index_cor_pos_ssp245_70_high,
  index_cor_pos_ssp585_30_low,
  index_cor_pos_ssp585_30_mod,
  index_cor_pos_ssp585_30_high,
  index_cor_pos_ssp585_50_low,
  index_cor_pos_ssp585_50_mod,
  index_cor_pos_ssp585_50_high,
  index_cor_pos_ssp585_70_low,
  index_cor_pos_ssp585_70_mod,
  index_cor_pos_ssp585_70_high)

# return counts in a vector
SRSin_occ <- occ_names %>% 
  lapply(., nrow) %>% 
  unlist()

# positive occurrences within PAs
pa_names <- list(
  index_pa_cor_pos_hist_low,
  index_pa_cor_pos_hist_mod,
  index_pa_cor_pos_hist_high,
  index_pa_cor_pos_ssp245_30_low,
  index_pa_cor_pos_ssp245_30_mod,
  index_pa_cor_pos_ssp245_30_high,
  index_pa_cor_pos_ssp245_50_low,
  index_pa_cor_pos_ssp245_50_mod,
  index_pa_cor_pos_ssp245_50_high,
  index_pa_cor_pos_ssp245_70_low,
  index_pa_cor_pos_ssp245_70_mod,
  index_pa_cor_pos_ssp245_70_high,
  index_pa_cor_pos_ssp585_30_low,
  index_pa_cor_pos_ssp585_30_mod,
  index_pa_cor_pos_ssp585_30_high,
  index_pa_cor_pos_ssp585_50_low,
  index_pa_cor_pos_ssp585_50_mod,
  index_pa_cor_pos_ssp585_50_high,
  index_pa_cor_pos_ssp585_70_low,
  index_pa_cor_pos_ssp585_70_mod,
  index_pa_cor_pos_ssp585_70_high)

# return counts in a vector
SRSin_pa <- pa_names %>% 
            lapply(., nrow) %>% 
            unlist() 

# Calculate SRSin score

SRSin_score <- ((SRSin_pa/SRSin_occ) * 100) %>% replace_na(., 0) # some values are NAs as there are 0 in the denominator

NOTE: Some NA values are produced in the SRSin calculation, as there are cases where in the future SDM predictions (total area) there is no suitable area covering where occurrence data is found. So we take those cases to equal zero here.

Now we want to create a dataframe to store all the results in. We also should be thinking about how we can make it easy on ourselves to plot this data in ggplot2 downstream.

in_situ <- data.frame(
  species = c(rep('Malus coronaria', times = 21)),
  ssp = c(rep('historical', times = 3), rep(245, times = 9), rep(585, times = 9)),
  suitability = rep(c('low', 'moderate', 'high'), times = 7),
  period = c(rep(2000, times = 3), rep(2030, times = 3), rep(2050, times = 3), rep(2070, times = 3), 
             rep(2030, times = 3), rep(2050, times = 3), rep(2070, times = 3)),
  SRSin = SRSin_score
)

print(in_situ)

Now lets calculate the geographical representativeness score in situ (GRSin) metric. This is a score of the total area of PAs that is suitable habitat divided by all the area that the SDM predicted was suitable habitat. To do that we need to begin by calculating the area of the total suitable habitat and that of the protected area containing suitable habitat.

# Overal suitable area
# create a list of rasters to calculate total suitable area for each threshold, time period and SSP

suitable_area_rasts <- list(
can_cor_pred_hist > corPred_threshold_1,
can_cor_pred_hist > corPred_threshold_10,
can_cor_pred_hist > corPred_threshold_50,
can_cor_pred_ssp245_30 > corPred_threshold_1,
can_cor_pred_ssp245_30 > corPred_threshold_10,
can_cor_pred_ssp245_30 > corPred_threshold_50,
can_cor_pred_ssp245_50 > corPred_threshold_1,
can_cor_pred_ssp245_50 > corPred_threshold_10,
can_cor_pred_ssp245_50 > corPred_threshold_50,
can_cor_pred_ssp245_70 > corPred_threshold_1,
can_cor_pred_ssp245_70 > corPred_threshold_10,
can_cor_pred_ssp245_70 > corPred_threshold_50,
can_cor_pred_ssp585_30 > corPred_threshold_1,
can_cor_pred_ssp585_30 > corPred_threshold_10,
can_cor_pred_ssp585_30 > corPred_threshold_50,
can_cor_pred_ssp585_50 > corPred_threshold_1,
can_cor_pred_ssp585_50 > corPred_threshold_10,
can_cor_pred_ssp585_50 > corPred_threshold_50,
can_cor_pred_ssp585_70 > corPred_threshold_1,
can_cor_pred_ssp585_70 > corPred_threshold_10,
can_cor_pred_ssp585_70 > corPred_threshold_50)

# the following will return a vector of the total suitable habitat in km^2
# lapply allows you to iterate functions over a list
# terra::extract calculates the total area of a raster
# I then initiate a anaymous function to create a dataframe of the listed values so that
# they can be filtered such that only TRUE suitable areas (value = 1) are selected
# and then the area is returned using pull (extracts single column) for all cases that match value = 1
# finally I unlist the temporary df so that the area is returned in a vector

total_area <- lapply(suitable_area_rasts, terra::expanse, byValue = T, unit = 'km') %>% 
  lapply(., function(df) {
    df %>% filter(value == 1) %>% pull(area)
  }) %>% 
  unlist()

And we simply repeat what’s above for protected areas.

pa_area_rasts <- list(
  pa_cor_pred_hist > corPred_threshold_1,
  pa_cor_pred_hist > corPred_threshold_10,
  pa_cor_pred_hist > corPred_threshold_50,
  pa_cor_pred_ssp245_30 > corPred_threshold_1,
  pa_cor_pred_ssp245_30 > corPred_threshold_10,
  pa_cor_pred_ssp245_30 > corPred_threshold_50,
  pa_cor_pred_ssp245_50 > corPred_threshold_1,
  pa_cor_pred_ssp245_50 > corPred_threshold_10,
  pa_cor_pred_ssp245_50 > corPred_threshold_50,
  pa_cor_pred_ssp245_70 > corPred_threshold_1,
  pa_cor_pred_ssp245_70 > corPred_threshold_10,
  pa_cor_pred_ssp245_70 > corPred_threshold_50,
  pa_cor_pred_ssp585_30 > corPred_threshold_1,
  pa_cor_pred_ssp585_30 > corPred_threshold_10,
  pa_cor_pred_ssp585_30 > corPred_threshold_50,
  pa_cor_pred_ssp585_50 > corPred_threshold_1,
  pa_cor_pred_ssp585_50 > corPred_threshold_10,
  pa_cor_pred_ssp585_50 > corPred_threshold_50,
  pa_cor_pred_ssp585_70 > corPred_threshold_1,
  pa_cor_pred_ssp585_70 > corPred_threshold_10,
  pa_cor_pred_ssp585_70 > corPred_threshold_50)

pa_area <- lapply(pa_area_rasts, terra::expanse, byValue = T, unit = 'km') %>% 
  lapply(., function(df) {
    df %>% filter(value == 1) %>% pull(area)
  }) %>% 
  unlist()

Now we should have everything we need to make the GRSin calculation.

# calculate the GRSin score
GRSin_score <- (pa_area/total_area) * 100 

# add it to the data frame from above

in_situ <- in_situ %>% mutate(GRSin = GRSin_score)

The next metric is the calculation of the ecological representativeness score in situ (ERSin).

NOTE: another deviation from Carver et al., 2021 is we will be using a different set of ecoregions. They are using ecoregion data from the Nature Conservacy Geospatial Atlas (https://geospatial.tnc.org/datasets/7b7fb9d945544d41b3e7a91494c42930_0). Wereas we will be using the ecoregion data from the background analysis script from the EPA (https://www.epa.gov/eco-research/ecoregions-north-america>), sticking with level II as used above.

# Download NA Ecoregion shapefile from: https://www.epa.gov/eco-research/ecoregions-north-america
# Load shapefile from local files
ecoNA <- vect(x = "C:/Users/terre/Documents/Acadia/Malus Project/maps/eco regions/na_cec_eco_l2/", layer = 'NA_CEC_Eco_Level2')
ecoNA <- project(ecoNA, 'WGS84') # project ecoregion vector to same coords ref as basemap

So now lets take our list of rasters that we made for the GRSin calculation and reuse it here. We want to convert all the polygons in the suitable habitat rasters to points, so that we can extract the names of the eco regions from the eco region vector. To do that we will use the lapply() function to convert all the rasters to points, while also defining an anonymous function to only select the ‘true’ layer. Remember we only want the points of the TRUE cases where habitat is in fact suitable.

# suitable habitat eco regions
suitable_area_points <- lapply(suitable_area_rasts, terra::as.points) %>% 
  lapply(., function (x) {
    x[x$lyr1 == 1, ]
  }) 

# protected area ecoregions
pa_area_points <- lapply(pa_area_rasts, terra::as.points) %>% 
        lapply(., function (x) {
          x[x$lyr1 == 1, ]
        })

IMPORTANT NOTE This next step is computationally expensive and will take several hours to complete. In my case it took a total of 17 hours to complete this chunk. It is worth noting that this can be reduced if you use less input spatial points (ie. the SDM rasters converted to points), if you were to use fewer suitability thresholds, fewer time periods and/or fewer SSPs.

start <- Sys.time()
suitable_area_eco <- lapply(suitable_area_points, function(point) {
  terra::extract(x = ecoNA, y = point)
})
print( Sys.time() - start )

# took 12.93733 hours to complete

start <- Sys.time()
pa_area_eco <- lapply(pa_area_points, function(point) {
  terra::extract(x = ecoNA, y = point)
})
print( Sys.time() - start )

# took 3.679289 hours to complete

We especially want to save these intermidate objects after the scripts are complete

# SAVE/LOAD eco regiion results 
getwd()
setwd('../gap_analysis')
saveRDS(suitable_area_eco, file = 'suitable_area_eco.Rdata')
saveRDS(pa_area_eco, file = 'pa_area_eco.Rdata')

suitable_area_eco <- readRDS(file = 'suitable_area_eco.Rdata')
pa_area_eco <- readRDS('pa_area_eco.Rdata')

So now we want to filter the data so that we can find out the number of ecoregions repesented in each layer supplied.

# return unique names and number of eco regions

suit_area_eco_num <- suitable_area_eco %>% 
                      bind_rows(.id = 'source_list') %>% # bind the lists of lists together 
                      unnest(cols = c(NA_L2NAME)) %>% # unnest the lists into columns of a dataframe
                      distinct(source_list, NA_L2CODE) %>% # return only the unique eco regions grouped by source list (aka the raster layers)
                      filter(!NA_L2CODE %in% c(NA, '0.0')) %>% # remove NA and 0.0 (water) valyes
                      add_count(source_list, name = 'num_eco') %>%  # count the number of nrow for each source list (or number of eco regions)
                      distinct(source_list, num_eco) # return the number of eco regions per source list



pa_area_eco_num <- pa_area_eco %>% 
                    bind_rows(.id = 'source_list') %>% # bind the lists of lists together 
                    unnest(cols = c(NA_L2NAME)) %>% # unnest the lists into columns of a dataframe
                    distinct(source_list, NA_L2CODE) %>% # return only the unique eco regions grouped by source list (aka the raster layers)
                    filter(!NA_L2CODE %in% c(NA, '0.0')) %>% # remove NA and 0.0 (water) valyes
                    add_count(source_list, name = 'num_eco') %>% # count the number of nrow for each source list (or number of eco regions)
                    distinct(source_list, num_eco) # return the number of eco regions per source list

And now we should have all the numbers needed to calculate the ERSin metric.

# Now calculate the ERSin score  
ERSin_score <- (pa_area_eco_num$num_eco / suit_area_eco_num$num_eco) * 100

# Mutate the ERSin score to the master df
in_situ <- in_situ %>% mutate(ERSin = ERSin_score)

So now that we have the SRSin, GRSin and ERSin scores, we can calculate the The final conservation score in situ (FCSin). This is a very simple calculation to do in R.

# Calculate the final conservation score for in situ conservation 

in_situ <- in_situ %>% mutate(FCSin = ((SRSin + GRSin + ERSin)/3))

Now we have all the calculations complete it is time to visualize it! I am only going to show one example plot - so just replicate it for all the data!

# bar plots 
fill_cols <- c("#EDF8B1", "#7FCDBB", "#2C7FB8")

# M. coronaria
cor_srs_combined <- in_situ %>% filter(species == 'Malus coronaria' & suitability == 'high') %>% 
  ggplot(aes(x = period, y = SRSin, fill = ssp)) + 
  geom_col(position = position_dodge()) +
  scale_x_continuous(limits = c(1990, 2080),
                     breaks = c(2000, 2030, 2050, 2070),
                     labels = c("Historical", 2030, 2050, 2070)) +
  scale_y_continuous(limits = c(0, 105),
                     breaks = c(0, 20, 40, 60, 80, 100),
                     expand = c(0,0)) +
  theme_classic() +
  scale_fill_manual(values = fill_cols, 
                    breaks = c('historical', '245', '585'),
                    labels = c('Historical', 'SSP245', 'SSP585')) +
  theme(text = element_text(size = 30, colour = 'black'),
        axis.text = element_text(colour = 'black'),
        axis.title.x=element_blank(),
        legend.title = element_blank())

# ditto
# arrange plots using <ggpubr>

ggarrange(cor_srs_combined, cor_grs_combined, cor_ers_combined, cor_fcs_combined,
          nrow = 1, ncol = 4,
          legend = "top",
          common.legend = T)

And this is the result!