UBC Botanical Garden is advancing action for adaptation through the development of community-based adaptation plans with the Sustainable Communities Field School. Ensuring the long-term preservation and sustainable utilization of apple genetic diversity across Canada is of paramount importance. This collaborative initiative aims to inventory and understand existing apple genetic diversity within the country. Additionally, it seeks to investigate how the suitability of habitats for apples may be affected by changing climatic conditions and emerging challenges. In response to these potential impacts, the project aims to identify urgent and longer-term adaptation strategies necessary for the continued resilience of apple populations in Canada.
This work is associated with a in-progress manuscript on Malus crop wild relatives (CWR) that are native to Canada, including M. coronaria (Sweet Crabapple) and M. fusca (Pacific Crabapple).
The following analysis was written in R, a free programming language and software environment primarily used for statistical computing and graphics. Visit https://www.r-project.org/ to get started with using R. Also recommended is the use of RStudio, a integrated development environment (IDE) for R. It provides a user-friendly interface for writing code, viewing plots and data, managing files, and installing packages. Once R is installed, visit https://www.rstudio.com/products/rstudio/download/ to download RStudio.
The species distribution modeling in the following analysis also requires that you have Java (another free programming language) installed. Visit https://www.oracle.com/ca-en/java/technologies/downloads/ and install the Java Development Kit (JDK).
Please note: The following is a reproducible example, and is not intended to be a introductory tutorial of SDM. The markdown is written for someone with intermediate to advanced experience in R.
The following is a reproducible workflow for the development of species distribution models (SDMs), using presence/background occurrence data for M. coronaria and M. fusca crabapple tree species. This data will be explained in more detail bellow. Occurrence data will be downloaded from the open source Global Biodiversity Information Facility (GBIF). There are several types of SDMs, however in this workflow we use the widely adopted MaxEnt (Maximum Entropy), a machine learning algorithum which can handle presence/background data. The outputs of this analysis are models of predicted habitat suitability of fundamental niches from environmental (climate) data. These models predict suitable from historical (1970s-2000s), to future projected climate conditions due to climate change under two Shared Socioeconomic Pathways (SSP2-4.5 and SSP5-8.5). This is essential for understanding how these species may respond to changes in climate, so that they can be better managed, sampled, or conserved for the preservation of these species genetic diversity for future generations.
See the following for a introduction and description of MaxEnt: Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E., & Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists. Diversity and distributions, 17(1), 43-57. https://doi.org/10.1111/j.1472-4642.2010.00725.x.
Here is a brief overview of the steps and associated scripts for each part of the analysis.
Step | Script | Purpose |
---|---|---|
1 | gbif_occ.R |
Prepare data request and download from GBIF |
2 | occ_clean.R |
Clean occurrence (presence) data points |
3 | occ_thin.R |
Thin occurrence points to a single observation per predictor (raster) cell |
4 | malus_bg.R |
Download ecoregions, sample background points, predictor cluster, 3D Kernel density plots |
5 | malus_sdm.R |
Climate predictor data, Maxent modeling and tuning, habitat suitability predictions |
6 | sdm_plot.R |
Produce publication quality plots of SDM predictions |
7 | malus_gap |
Conservation gap analysis to assess the level of conservation of species |
IMPORTANT NOTE: This workflow and code differs slightly from the code published to the “https://github.com/TerrellRoulston/malus” repository, and is an example rather than a standard on how to perform species distribution. One major nuanced difference is Roulston et al., (in prep) limited the number of predictor variables using a list of well performing variables rather than using them all like in this workflow.
The Global Biodiversity Information Facility (GBIF) is a valuable resource for SDM due to its extensive collection of biodiversity data from around the world. There are several sources of occurrences, such as research and herbarium collections and community science observations. Note that this data is presence-only data, meaning there is no information about the absence of species. Also, all occurrences are research grade meaning they are expert verified identifications or several community members agree on identification.
NOTE: You need to have a GBIF user profile, visit https://www.gbif.org/ to sign up!
Load libraries
library(tidyverse) # data management, grammar
library(rgbif) # access GBIF data
Initiate user profile information. Note that I have redacted my own information.
# GBIF user info
user='REDACTED'
pwd='REDACTED'
email='REDACTED'
Navigate to https://www.gbif.org/ and search your taxon of interest to get the taxon key identifier. In this case the taxon keys are: M. coronaria = 3001166 and M. fusca = 3001080.
For the purposes of this workflow I will only show the code for one of the species. Code is just replicated for the other species.
# M. coronaria download
taxonKey <- 3001166
basisOfRecord <- c('PRESERVED_SPECIMEN', 'HUMAN_OBSERVATION', 'OCCURRENCE') # excluded living specimens and material samples (germplasm)
hasCoordinates <- TRUE # limit to records with coordinates
NOTE: We do not limit the year of observation in our data collection. There are reasons why you might. However, for the purposes of this analysis it is reasonable to include all observations. Because the main reason why more historical occurrences may be missing today (or say post 1970) is more likely to be due to habitat destruction, rather than changes in climate affecting realized niches. Furthermore, crabtrees are slow growing, and migrate even slower, thus their distribution are likely lagging behind changes in their fundamental niches. This highlights the importance of understanding the ecology of your species of interest.
Prepare a download request for the GBIF occurrence data. There are
several formats, I choose .csv
because it is familiar and
easy to work with.
# Download data
# Use 'pred()' if there is a single argument, or 'pred_in()' if there are multiple
down_code = rgbif::occ_download(
pred("taxonKey", taxonKey),
pred_in("basisOfRecord", basisOfRecord),
pred("hasCoordinate", hasCoordinates),
format = "SIMPLE_CSV",
user=user, pwd=pwd, email=email)
Download and save the GBIF occurrence data as a
.csv
.
getwd() # check your working directory (wd)
setwd("./occ_data/") # set wd to a location where you want to save the csv file.
download_coronaria <- occ_download_get(down_code[1], overwrite = TRUE)
# extract csv from zipper folder and save as clearly named csv in excel or equivalent.
gbif_cor <- read.csv(file = "occ_coronaria.csv") # load named csv data
NOTE: It is good practice to keep this
.csv
as a ‘raw’ data file, and manipulate it within the R
environment in following workflow rather than make changes to the raw
file.
Now we have downloaded all observations from GBIF for M. coronaria and M. fusca that fit a set of defined criteria suitable for our research question. In this case the criteria are limited to species-level taxonomic resolution, that come from preserved herbarium specimens, human observation (i.e. iNaturalist) and research collections documenting species occurrences, all of which are geo-referenced (i.e. geographic coordinates). Also note that we do not limit the geographic location of occurrences, yet. At the time of downloading this data (02/14/2024) M. coronaria had 1373 records, and M. fusca had 2282 records.
As eluded to above, we now need to make some more choices about what
species occurrences are appropriate to include and from where. Thus, we
must clean the species occurrence data. To do this we
are going to visually inspect the species occurrences as well as take
advantage of R packages such as CoordinateCleaner
to assist
in removing spurious occurrences replicates, those with poor referencing
such located in the center of a country or in a body of water, etc.
Load Libraries
library(tidyverse) # grammar, data management
library(CoordinateCleaner) # helpful functions to clean data
library(terra) # working with vector/raster data
library(geodata) # download basemaps
library(scales) # alpha adjust colours
We begin with loading the saved .csv
files of species
occurrences from the last script.
# set wd
getwd()
setwd("../occ_data/") # use relative path to open from folder where .csv is saved
# load occurrence csv files
gbif_cor <- read.csv(file = "occ_coronaria.csv") # load coronaria data
gbif_fusca <- read.csv(file = "occ_fusca.csv") # load fusca data
Now we want to initiate a data.frame
and select columns
that are useful from the .csv
files. At the same time as we
build this data.frame
we also are going to use several
functions to filter occurrences out of the ‘cleaned’ data set. For more
information on the functions used see the library documentation of
CoordinateCleaner
. Also see https://data-blog.gbif.org/post/gbif-filtering-guide/
for a great blog post published by GBIF on post-processing GBIF
downloads.
# filter M. corornia data
occ_cor <- gbif_cor %>%
filter(countryCode %in% c('US', 'CA')) %>% #limit to CA and US
filter(!is.na(decimalLongitude)) %>% # remove records w/o coords
filter(coordinateUncertaintyInMeters < 30000 | is.na(coordinateUncertaintyInMeters)) %>%
cc_cen(buffer = 2000) %>% # remove records within 2km of country centroids
cc_inst(buffer = 2000) %>% # remove records within 2km of herbariums, botanical gardens, and other institutions
cc_sea() %>%
distinct(decimalLatitude, decimalLongitude, speciesKey, datasetKey, .keep_all = T) %>%
filter(decimalLongitude >= -100) %>% # remove some records on west coast, two from bot gardens
filter(!(decimalLatitude < 35 & decimalLongitude < -86)) %>% # remove records from Texas, Oklahoma, Louisiana
filter(!(decimalLatitude > 45)) %>% # remove record from New Brunswick
filter(!(decimalLongitude < -98)) %>% # remove iNat record from south Kansas, northern Kansas record verified by taxonomist
dplyr::select(species, countryCode, decimalLatitude,
decimalLongitude, coordinateUncertaintyInMeters, year, basisOfRecord
)
# Note it is helpful to plot the occurrences bellow, and then add more conditions to clean inaccurate points
# Pay special attention to points at the edge of the range of occurrences, as these are most likely to be suspicious
# as well as influence the model in strange ways, in comparison to inaccurate points well within the other occurrences
After filter the occurrences we are left with 978 occurrences for M. coronaria and 1197 occurrences for M. fusca.
You will notice that I have added a few manual filters in the code
above using function filter
and the !
operator, to remove records that were verified to be inaccurate. For
cleaning these inaccurate points, it is useful to begin by checking
occurrences at the edges of the range of a species, or any other outlier
points. These points at the edges are most likely to negatively
influence the model if they are not true presence
points. I personally find using a combination of the GBIF map and table
view of observations, and Google Maps to verify the truth of these
questionable points to be a good approach.
NOTE: I am using piper operators
(%>%
) from the tidyverse ecosystem to assist in the ease
of data processing. It sequentially applies each argument or function to
the data.frame
, and so be careful in what order you specify
functions.
Now we will save the data.frame
object as a
.Rdata
file. .Rdata
files are the best way to
store/load data within the R environment. We will routinely save
intermediate data files to help speed up analysis of downstream
workflow, so that results can be stored rather than having to be
repeated every time the R session is closed (INCLUDING WHEN IT
CRASHES UNEXPECTEDLY.
getwd() # check what directory you are in
setwd('../occ_data/')
saveRDS(occ_cor, file = "occ_cor.Rdata") # name the file with the .Rdata file extension
Now when revisiting an analysis you can reload the saved cleaned data frame rather than re-running the cleaning functions which are slow computations.
setwd('./occ_data/')
occ_cor <- readRDS(file = "occ_cor.Rdata") # read in the saved .Rdata object
head(occ_cor) # take a peak at your dateframe
## species countryCode decimalLatitude decimalLongitude
## 1 Malus coronaria US 43.08833 -84.72617
## 2 Malus coronaria US 43.61555 -84.24722
## 3 Malus coronaria US 40.89888 -74.70638
## 4 Malus coronaria US 41.83333 -88.08333
## 5 Malus coronaria US 42.97083 -82.42500
## 6 Malus coronaria US 35.58000 -82.55583
## coordinateUncertaintyInMeters year basisOfRecord
## 1 NA 2013 HUMAN_OBSERVATION
## 2 20000 1931 PRESERVED_SPECIMEN
## 3 3 1945 PRESERVED_SPECIMEN
## 4 NA 1894 PRESERVED_SPECIMEN
## 5 5000 1892 PRESERVED_SPECIMEN
## 6 20000 NA PRESERVED_SPECIMEN
Early I mentioned that we are not limiting the years in which species occurrences are recorded, however it is still important to verify that this decision is not adding ‘weird’ bias to the data. To do this I am going to split the occurrences in to three date frames pre and post 1970, and those without dates. This data is commonly associated with when changes in significant changes in global climate began. If occurrences from pre-1970 are vastly different in geography, then we may feel uncomfortable including them in the analysis. Similarly those without any date should also be verified.
# compare pre/post 1970 and nd occurrences
occ_cor_pre <- occ_cor %>% # pre 1970
filter(year < 1970)
occ_cor_post <- occ_cor %>% # post 1970
filter(year >= 1970)
occ_cor_nd <- occ_cor %>%
filter(year %in% NA)
The best way to compare these pre/post/nd occurrences is by
visualizing them to detect any differences in spatial coordinates. To do
this we need to utilize the terra
and geodate
packages. The terra
package is useful for working with
Raster data and will frequently be used throughout this workflow. The
geodata
package will also be used frequently, and allows
easy download of geographic boundaries from Global Administrative Areas
(GADM).
For now plotting will be done quickly using base R
graphics
and terra
. At the end we will produce
publication qaulity maps.
First download the basemaps, and put them in a location to access them again later to avoid having to redownload them. These files are large, so consider not pushing them to GitHub if you are working in a repo.
setwd('../occ_data/')
# download/load maps
us_map <- gadm(country = 'USA', level = 1, resolution = 2,
path = "../occ_data/base_maps") #USA basemap w. States
ca_map <- gadm(country = 'CA', level = 1, resolution = 2,
path = '../occ_data/base_maps') #Canada basemap w. Provinces
canUS_map <- rbind(us_map, ca_map) #combine US and Canada vector map
Now plot the occurrence points. Remember that you might have to go back and add additional filters to the code above like I have for example by removing records from Texas, Oklahoma, Louisiana, and New Brunswick.
# M coronaria
plot(canUS_map, xlim = c(-100, -60), ylim = c(25, 50))
# plot Malus coronia occurrences
# pre-1970
points(occ_cor_pre$decimalLongitude, occ_cor_pre$decimalLatitude, pch = 16,
col = alpha("red", 0.2))
# post-1970
points(occ_cor_post$decimalLongitude, occ_cor_post$decimalLatitude, pch = 16,
col = alpha("blue", 0.2))
# No date
points(occ_cor_nd$decimalLongitude, occ_cor_nd$decimalLatitude, pch = 16,
col = alpha('black', 0.2))
legend(x= -75,
y = 33,
title = 'M. coronaria',
legend = c('Pre-1970 (n=412)', 'Post-1970 (n=514)', 'No Date (n=53)'),
fill = c('red', 'blue', 'black'))
dev.off() # close graphics plot window
NOTE: Plotting geographic coordinates can be tricky, and often you will have to play around with the plot limits and legend location to get it to something that is visually appealing.
Inspecting the plot of M. coronria occurrences there is a large amount of overlap between occurrences from pre and post 1970, thus we feel comfortable keeping all the data no matter the year in the analysis. It is important you evaluate this for your own species and ask if it fits your research question.
After visually inspecting the occurrence locations, verifying that
there are no duplicates, and no occurrences from strange locations (such
as records from the west coast, when dealing with an east coast species)
you can save your occurrence data.frame
(e.g. occ_cor
) for downstream analysis.
So we have downloaded and cleaned the occurrence data for M. coronaria and M. fusca however we still need to do more processing before we can move producing the SDM. We mush now deal with sampling bias in the occurrences. Sampling bias is an issue because with presence-only data it is hard to distinguish if species are being detected because it is truly preferred habitat, or whether those are just the locations that have been sufficiently sampled. This issue can be (partially) addressed by thinning records, so that multiple records from the same are represented by only one (or sometimes a few) occurrence sample. This method is imperfect, but does start to deal with some biases in the spatial representation of the data.
There are several methods for thinning occurrence records; some more sophisticated then others. A simple approach, and the one used in this workflow is randomly sampling a single occurrence per grid cell in the predictor raster (in this case the climate variables).
# Load libraries
library(tidyverse) # Grammar and data management
library(terra) # Working with spatial data
library(geodata) # Basemaps and climate data
Start by loading the cleaned species occurrence data saved in the
.Rdata
file we made in the occ_clean
script.
setwd("../occ_data/")
occ_cor <- readRDS(file = "occ_cor.Rdata")
When working with spatial data it is important to make sure that
different data sources share the same coordinate reference system
(a.k.a. projection). We are going to be using climatic data from
WorldClim as our predictor raster layers, which is projected using the
common WGS84 projection. In order to make sure that the coordinates of
occurrences line up accurately within the grid cells of the WorldClim
data we are going to create a SpatVector
object using
terra::vect
and specify the crs
argument so
that it knows to project as WGS84.
occ_cor <- vect(occ_cor, geom = c('decimalLongitude', 'decimalLatitude'),
crs = "+proj=longlat +datum=WGS84")
Now we are going to download the WorldClim predictor rasters. WorldClim is a global data set of bioclimatic variables commonly used in ecological research, particular in SDM workflows. Variables in this data set represent a comprehensive set of climatic conditions, including tempature seasonality, perciptication in wettest/driest months and temperature extremes. See https://www.worldclim.org/data/bioclim.html for a complete list of all 19 bioclimatic variables included.
NOTE: There are three resolutions, 10, 5 and 2.5 arcminutes. I choose to use 2.5 arcminutes (and generally recommend) as it is the finest resolution which will give a greater accuracy of environmental conditions, but this is at a cost of increased computation time. For reference 2.5 arcminutes is about 3.3 km at 45 degree latitude. Pick which resolution you think is best for your use.
IMPORTANT NOTE: These downloads are large (several
thousand Mb), so if you are working in a GitHub repository make sure to
add these files to the .gitignore
file to make sure copies
are not sent to the repo.
I like to create a separate file for WorldClim data, it makes workflow easier to follow and more reproducible. Create a folder within the head of the project.
setwd("../wclim_data/")
# Note DO NOT PUSH wclim data**
wclim <- worldclim_global(var = 'bio', res = 2.5, version = '2.1', path = "../wclim_data/")
NOTE: Climate data is available at several resolutions (size of grid cells) ranging from 10 - 0.5 minutes of degree. In this workflow we opted to go with 2.5 arc-mins (approx. 5 km) as it provides a compromise in the fine-scale of the cell size and practical computation time.
# plot a raster layer to check it downloaded properly
plot(wclim$wc2.1_2.5m_bio_1, main = expression(atop('WorldClim Bioclimatic Predictor 1', 'Mean Annual Tempature (1970-2000)')))
Now we have vectorized occurrence data and rasterised predictor
variables, it is time to thin the occurrences data. We will use the
terra::spatSample()
function to take a random sample of a
single occurrence per raster cell.
set.seed(1337) # set random generator seed to get reproducible results
# M. coronaria thinning
occ_cor <- spatSample(occ_cor, size = 1,
strata = wclim) # sample one occurrence from each climatic cell
And like last time, we will save this thinned data as a
.Rdata
file for futher downsteam analysis.
NOTE: that I call this file a slightly different name,
so that I know it is the thinned points.
setwd("../occ_data/")
saveRDS(occ_cor, file = 'occThin_cor.Rdata')
In the Analysis Overview I mentioned in passing that this is a presence/background SDM workflow. Presence/background data sets are cases in which occurrence data is presence-only (like on GBIF), meaning there is no information about the absence of species. This is an important distinction because if we are going to appropriately interpret results, we need to understand the model assumptions. In the presence/background workflow, background (bg) points are randomly sampled from the environment surrounding and overlapping known presence points (more discussion on bg sampling bellow). This effectively samples the distribution of the ‘background’ environment across your study extent. The Max modeling workflow uses this background data to make inferences about absence points, using Baye’s Rule. Thus this workflow produces similar results to those as if you were modeling a presence/absence data set (but if presence/absence data is available, use it!).
See the following for more on background points: Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E., & Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists. Diversity and distributions, 17(1), 43-57. https://doi.org/10.1111/j.1472-4642.2010.00725.x.
So just exactly what is the background area and how can you define a study extent? Should it be simply the convex hull around occurrences? Or should it be all of North America? This depends entirely on the spatial distribution and scale of your data, model assumptions you are willing to live with, and ecological first principles depending on the life histories of your key species and research question. NOTE: Any decision you make will create assumptions and introduce spatial biases to the analysis, and therefore affect the interpretation and predictions of your SDMs.
See the following citation for in depth discussion on defining study extent: Barve, N., Barve, V., Jiménez-Valverde, A., Lira-Noriega, A., Maher, S. P., Peterson, A. T., … & Villalobos, F. (2011). The crucial role of the accessible area in ecological niche modeling and species distribution modeling. Ecological modelling, 222(11), 1810-1819. https://doi.org/10.1016/j.ecolmodel.2011.02.011.
One approach to defining study extent identified by Barve, et al. (2011), is by limiting the background to Biotic regions (hereafter, ecoregions). In short, this method allows for a compromise between real-world biology of species and model tractability. In this workflow I am using North American Ecoregions, as defined by an Internatial collaboration (between Canada, Mexico and the United States) known as the Commission for Environmental Cooperation. Ecoregions are defined at three Levels (I, II and III) or sptail scales, and are available open source at https://www.epa.gov/eco-research/ecoregions-north-america. Choose an appropriate scale for your research question.
In this workflow I opted to go with the Level-II ecoregions, as this scale compromise the scale at which sets of environmental conditions and biogeography are grouped together while still covering relatively large spatial areas that span across NA.
# Load libraries
library(tidyverse)
library(terra)
library(predicts)
library(geodata)
library(ENMTools)
library(plotly) # 3D surface Kernel bivariate plots
library(MASS)
Due to the large size of the raster files, I have saved a local copy to my machine, and will not push these fills to the GitHub repo.
ecoNA <- vect(x = "C:/Users/terre/Documents/Acadia/Malus Project/maps/eco regions/na_cec_eco_l2/", layer = 'NA_CEC_Eco_Level2')
ecoNA <- project(ecoNA, 'WGS84') # project ecoregion vector to same coords ref as basemap
NOTE: I project the ecoregion rasters the same CSR as the climate predictor rasters.
Let’s take a look at the ecoregions (plotted in red), over top of North America.
# download/load maps
getwd()
setwd('../occ_data/')
us_map <- gadm(country = 'USA', level = 1, resolution = 2,
path = "../occ_data/base_maps") #USA basemap w. States
ca_map <- gadm(country = 'CA', level = 1, resolution = 2,
path = '../occ_data/base_maps') #Canada basemap w. Provinces
canUS_map <- rbind(us_map, ca_map) #combine US and Canada vector map
# plot basemap
plot(canUS_map, xlim = c(-180, -50))
# plot ecoregions
lines(ecoNA, col = 'red')
dev.off()
Now lets load our thinned occurrences from the
occ_thin.R
script.
# load occurrence data
setwd("../occ_data/")
occThin_cor <- readRDS(file = 'occThin_cor.Rdata')
Now that we have the ecoregions and occurrence points loaded, lets load the WorldClim predictor rasters. We want to mask and crop these predictor layers to the ecoregions. This will defined the study extent, and subsequently allow us to sample the background.
First we need to know which of the level-II ecoregions are known to
have species occurrences. To do this we will make use of the function
terra::extract
. NOTE: It could take a few
mins for the extraction to complete
# M. coronaria ecoregions
# extract ecoregion polygon that contain M. coronaria occurrence points
eco_cor <- extract(ecoNA, occThin_cor) # extract what polygons contained points
# return vector of eco region codes of the polygons that contain occurrences
eco_cor_code <- eco_cor$NA_L2CODE %>% unique()
eco_cor_code <- eco_cor_code[eco_cor_code != '0.0'] #remove the 'water' '0.0' ecoregion
# CODES: "8.1" "8.2" "5.3" "8.4" "8.3" "9.4" "8.5" "5.2"
# View the legend on the epa website for the names of these ecoregions.
Now that we have extracted what ecoregion codes contain at least one species occurrence, we can mask the ecoregion raster to only contain those selected above.
ecoNA_cor <- terra::subset(ecoNA, ecoNA$NA_L2CODE %in% eco_cor_code) # subset ecoregion spat vector by the codes
plot(ecoNA_cor) # plot the subsetted M. coronaria ecoregions
points(occThin_cor, pch = 3, col = 'red') # plot M. coronaria points
And as always we want to save this intermediate object as a
.Rdata
file, so that we do not need to repeat the long
computational step of extracting the ecoregions.
setwd('../occ_data/eco_regions')
saveRDS(ecoNA_cor, file = 'ecoNA_cor.Rdata')
Now we can use these ecoregion rasters to crop our predictor climatic
rasters. Again we are doing this so that we can define the study extent
and sample the environmental background. Make sure to set the
mask
argument equal to TRUE
.
# crop+mask extent of WorldClim data to the Malus ecoregions
wclim_cor <- terra::crop(wclim, ecoNA_cor, mask = T)
# Save cropped wclim data for downsteam SDM workflow
saveRDS(wclim_cor, file = 'wclim_cor.Rdata')
Now we want to randomly sample background points, and store the associated predictor values as SpatVectors. Alternatively you could also use SpatRasters if you please, but I found SpatVectors easier to work with.
IMPORTANT NOTE: How many background points should be sampled? There is not a set default, althought it is common in some studies to see 5000 or 10000 background points. We made the decision to sample 20000 points due to the very large spatial extent of this study. With this many points we can feel confident we the background environment is adequately sampled. Note that this is random, and has not yet addressed any spatial biases that may arise when modeling backgrounds (this is foreshadowing).
set.seed(1337) # set a seed to ensure reproducible results
# NOTE: Set arguments as.raster = T to return raster
# OR as.points to return spatvector = T to return spatvector
# M. coronaria background
# SpatVector
# Note upped bg points from 5000 to 20000 to be more suitable to better reflect a mean probability of presence 1 - a/2
cor_bg_vec <- spatSample(wclim_cor, 20000, 'random', na.rm = T, as.points = T) #ignore NA values
plot(wclim_cor[[1]])
points(cor_bg_vec, cex = 0.01)
We can also measure how many km^2 these ecoregions comprise, to get an idea of how many samples are taken per km^2.
expanse(wclim_cor[[1]], unit = 'km') # total area of raster in km^2
# 5683684 km^2
20000/5683684
# 0.00035 samples/km
Annnnd you guessed it, we want to save these SpatVectors of background points for downsteam workflow.
# Save background SpatVectors
setwd("../occ_data/")
saveRDS(cor_bg_vec, file = 'cor_bg_vec.Rdata')
You also may want to extract the actual predictor values and put them
into a data.frame
for both the occurrence points and
background points. To do this you can again make use of the
terra::extract
function like above, and the
terra::values
function to get the values from the already
bg SpatVectors. I show both as it can be useful, but I use mainly the
SpatVectors downstream.
# Extracting presence-background raster values
cor_predvals <- extract(wclim_cor, occThin_cor) # M. coronaria
cor_predvals <- cor_predvals[-1] # drop ID column
cor_bgvals <- values(cor_bg_vec) # Extract raster values for bg points
Now we can create and save a data.frame
which may come
in handy later.
cor_pb <- c(rep(1, nrow(cor_predvals)), rep(0, nrow(cor_bgvals))) #binary (0,1) presence or background string
# combine presence and background data frames for SDM
cor_sdmData <- data.frame(cbind(cor_pb, rbind(cor_predvals, cor_bgvals)))
setwd("../occ_data/")
saveRDS(cor_sdmData, file = 'cor_sdmData.Rdata')
So we now have successfully defined a study extent using ecoregions that contain occurrences, and croped a predictor raster so that background points could be sampled from it. Then using these background and occurrence points we extracted the bioclimatic values in the predictor raster.
Using these values in a data frame (e.g. cor_sdmData
) we
can complete several colinearity checks. These checks allow us to
understand how the climatic variables relate to one another. It is
important to note, that in a classical statistical approach it is
considered good practice to limit colinear variables, to prevent over
fitting of a model. This step is refereed to as variable
selection. With that said, MaxEnt is not a linear modeling
approach, and thus does not need to follow assumptions about predictor
colinearity. On top of that, MaxEnt is known to be robust to including
many predictors.
HOWEVER, in the final versions of SDMs published by Roulston et al., we do subset five bio climatic variables that are mostly likely to be predictive of tree species habitat suitability which includes Bio1 - Tmean, Bio4 - Tvar, Bio11 - Tcoldq, Bio10 - Twarmq, Bio15 - Pvar and, Bio16 - Pwetq. Ultimatelty these differences can and did impact the results of SDMs so it is important to consider when modeling yourself.
See the following for more on background points: Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E., & Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists. Diversity and distributions, 17(1), 43-57. https://doi.org/10.1111/j.1472-4642.2010.00725.x.
We can start by simply plotting the relationship between all
predictor pairs, using the base R graphics
. Use the
function pairs
. This can get messy quickly with some many
variables, so it is a bit hard to read. It may take a minute to
plot…
# M. coronaria
pairs(cor_sdmData[,-1]) # drop the first column of 0/1
Let’s use a more helpful way to detect colinearity of predictor variables, a dendrogram, which visualizes correlation matrices. NOTE: This is only considering two-dimensional covariation, and cannot detect higher dimensional correlations.
# Dendogram cluster of predictor colinearity
threshold <- 0.7 # set the threshold for colinearity
cor_cors <- raster.cor.matrix(wclim_cor) # pearson correlation
cor_dist <- as.dist(1 - abs(cor_cors)) # calculate distance of predictors
cor_clust <- hclust(cor_dist, method = 'single') # calculate cluster dendogram
cor_groups <- cutree(cor_clust, h = 1 - threshold) #calculate groupings of variables
plot(cor_clust, hang = -1, main = "M. coronaria Predictors")
rect.hclust(cor_clust, h = 1 - threshold)
We can also visualize the predictor covariate values to understand how they inform the model about the probability of a true presence. This involves the visualization of the distribution of the values at sampled points, presence and background, which allows us to get a peak into how the model constraints come about.
To do this we can compare the density of covariates from occurrence points and background points. I like to visualize this density using Kernel Density Plots. This plot allows you to visualize three dimensions, in this case two of the bioclimatic predictor variables, and the density of points sampled (pseudo-probability). We will build two separate plots to visualize the density of occurrence points and of background points.
First we will create new data frames for each of the climatic variables (in this case mean annual temperature, and mean annual precipitation), so that we can plot them below.
# Predictor Kernel Density Plots
# It is helpful to visualize two predictors pairs of
# Presence points and background points
cor_occ.temp <- cor_sdmData %>% filter(cor_pb == 1) %>% # Presence points
dplyr::select(wc2.1_2.5m_bio_1) %>% # Mean annual temp
drop_na() %>%
unlist()
cor_bg.temp <- cor_sdmData %>% filter(cor_pb == 0) %>% # Background points
dplyr::select(wc2.1_2.5m_bio_1) %>% # Mean annual temp
drop_na() %>%
unlist()
cor_occ.perc <- cor_sdmData %>% filter(cor_pb == 1) %>% # Presence points
dplyr::select(wc2.1_2.5m_bio_12) %>% # Annual precipitation
drop_na() %>%
unlist()
cor_bg.perc <- cor_sdmData %>% filter(cor_pb == 0) %>% # Background points
dplyr::select(wc2.1_2.5m_bio_12) %>% # Annual precipitation
drop_na() %>%
unlist()
Now that all the data is prepared, we will build the build the Kernel
bivariate grid, using the function MASS::kde2d
. This
function needs three values, the x coordinate = temperature values, y
coordinate = precipitation values, and the z coordinate is density of
occurrence (pseudo-probability).
cor_occ.3d <- kde2d(cor_occ.temp, cor_occ.perc)
cor_bg.3d <- kde2d(cor_bg.temp, cor_bg.perc)
And plot the results using the package plot_ly
. These
build interactive 3D plots, and in this case can be used to visualize
the Kernel plots.
plot_cor.occ_3d <- plot_ly(x=cor_occ.3d$x, y=cor_occ.3d$y, z=cor_occ.3d$z) %>%
add_surface() %>%
layout(scene = list(xaxis = list(title = 'Mean Annual Temp (C)', autotick = F, nticks = 5, tickvals = list(0,5,10,15,20)),
yaxis = list(title = 'Mean Annual Percip. (mm)', tick0=0, tick1=2000, dtick=200),
zaxis = list(title = 'Kernel Density', tick0=0, tick1=0.001, dtick=0.0002)),
title = list(text = "<i>M. coronaria<i> Occurrence Points",
y = 0.95, x = 0.5,
xanchor = 'center',
yanchor = 'top'))
plot_cor.occ_3d # run to view
plot_cor.bg_3d <- plot_ly(x=cor_bg.3d$x, y=cor_bg.3d$y, z=cor_bg.3d$z) %>%
add_surface() %>%
layout(scene = list(xaxis = list(title = 'Mean Annual Temp (C)', tick0=0, tick1=20, dtick=5),
yaxis = list(title = 'Mean Annual Percip. (mm)', tick0=0, tick1=2000, dtick=200),
zaxis = list(title = 'Kernel Density')),
title = list(text = "<i>M. coronaria<i> Background Points",
y = 0.95, x = 0.5,
xanchor = 'center',
yanchor = 'top'))
plot_cor.bg_3d
In this example I only visualize two of the 19 different bioclimatic variables. You may choose to visualize more for your own purposes. Remember this is only a visual comparison, but is valuable for understanding what the model is ‘seeing’ - of course it is using multidimensional space of all covariates, not just the two here.
So now have all we need (almost) to build the SDMs, finally. We have the presence (occurrence) points, and background points. We also have the climatic predictor rasters (WorldClim bioclimatic variables). What we don’t have is the future bioclimatic data, which we will use with the results of the SDM to make predictions about future suitable habitat, under different climatic scenarios.
library(tidyverse) # Grammar and data management
library(terra)# Spatial Data package
library(predicts) # SDM package
library(geodata) # basemaps
library(rJava) # MaxEnt models are dependant on JDK
library(ENMeval) # Another modeling package, useful for data partitioning (Checkerboarding)
library(raster) # RasterStack dependancy (a now deprecated function)
library(ecospat) # Useful spatial ecology tools
library(parallel) # speed up computation by running in parallel
library(doParallel) # added functionality to parallel
Let’s start by loading in the thinned occurrence points and background points, along with some base maps of Canada, USA, and Mexico. (Note: In the end I decided not to use these boundaries for plotting, but you may want to for your purposes so I left the chunk here).
# Download/load basemaps
us_map <- gadm(country = 'USA', level = 1, resolution = 2,
path = "../occ_data/base_maps") #USA basemap w. States
ca_map <- gadm(country = 'CA', level = 1, resolution = 2,
path = '../occ_data/base_maps') # Canada basemap w. Provinces
mex_map <-gadm(country = 'MX', level = 1, resolution = 2,
path = '../occ_data/base_maps') # Mexico basemap w. States
canUSMex_map <- rbind(us_map, ca_map, mex_map) # Combine Mexico, US and Canada vector map
NA_ext <- ext(-180, -30, 18, 85) # Set spatial extent of analyis to NA in Western Hemisphere
canUSMex_map <- crop(canUSMex_map, NA_ext) # crop to Western Hemisphere
plot(canUSMex_map) # plot basemap
We are also going to load a shape file that outlines the Great Lakes. This helps to make the figures more pretty, and we can also use the vector layer to mask the predictor layers so that no predictions are made over the Great Lakes (Malus cannot swim…). You can download the shape files from the url below.
# Great Lakes shapefiles for making pretty maps
# Shape files downloaded from the USGS (https://www.sciencebase.gov/catalog/item/530f8a0ee4b0e7e46bd300dd)
great_lakes <- vect('C:/Users/terre/Documents/Acadia/Malus Project/maps/great lakes/combined great lakes/')
Now lets download the future bioclimatic data. For this we will need
to specify a few additional arguments then before when downloading the
historical climate data. We will again use the geodata
package, but instead of worldclim_global
we will use
cmip6_world
.
This function will download data for projected future climates from Coupled Model Intercomparison Project Phase 6 (CMIP6) generated from Global Climate Models (GCMs). CMIP6 projections are based on the Shared Socio-economic Pathway (SSP) scenarios, included in the Intergovernmental Panel on Climate Change (IPCC) Assessment Report (AR6).
In There are six different SSPs, four of which are available through
geodata
, “126”, “245”, “370” and “585”; for SSP1-2.6,
SSP2-4.5, SSP3-7.0, SSP5-8.5. This is also available for three time
periods: “2021-2040”, “2041-2060”, and “2061-2080”; or early century
(2030), mid century (2050) and late century (2070). In this workflow I
will model all three timeframes, and I chose two SSPs, SSP2-4.5 and
SSP5-8.5. The former repersents a ‘middle of the road’ scenario with
high climate adaptation, low climate mitigation; and the later is a
‘worst case’ or ‘bussiness as usual’ scenario where the global
development continues down a unsustainable route with increased fossil
fuel reliance. It is worth noting that SSP5-8.5 is the current tracking
scenario: https://news.stanford.edu/2023/01/30/ai-predicts-global-warming-will-exceed-1-5-degrees-2030s/.
The final decision to make is which CMIP6 climate model you are going
to choose. There are several models available through
geodata
, all of which have their individual strengths and
weaknesses. For the purposes of this study, as I am primary concerned
with Malus habitat in Canada, I chose to use the Canadian Earth
System Model version 5 (CanESM5). This model is still a global model,
but it is most commonly employed for climate research in Canada. Read
more here: https://gmd.copernicus.org/articles/12/4823/2019/gmd-12-4823-2019.html
NOTE: Climate projections, are just that, projections. These models are useful, but can be flawed for a host of reasons. Recently it has come to the attention of climate modelers that many of the models included in CMIP6 are ‘too hot’, meaning the rates of warming may be too high, although this is not true for every region. Read more here: https://www.science.org/content/article/use-too-hot-climate-models-exaggerates-impacts-global-warming.
# Historical climate 1970-2000
wclim <- geodata::worldclim_global(var = 'bio',
res = 2.5,
version = '2.1',
path = "../wclim_data/") %>%
crop(NA_ext) %>% #crop raster to NA
mask(great_lakes, inverse = T) # cut out the great lakes
# SSP (Shared social-economic pathway) 2.45
# middle of the road projection, high climate adaptation, low climate mitigation
ssp245_2030 <- cmip6_world(model = "CanESM5",
ssp = "245",
time = "2021-2040",
var = "bioc",
res = 2.5,
path = "../wclim_data/") %>%
crop(NA_ext) %>% #crop raster to NA
mask(great_lakes, inverse = T) # cut out the great lakes
ssp245_2050 <- cmip6_world(model = "CanESM5",
ssp = "245",
time = "2041-2060",
var = "bioc",
res = 2.5,
path = "../wclim_data/") %>%
crop(NA_ext) %>% #crop raster to NA
mask(great_lakes, inverse = T) # cut out the great lakes
ssp245_2070 <- cmip6_world(model = "CanESM5",
ssp = "245",
time = "2061-2080",
var = "bioc",
res = 2.5,
path = "../wclim_data/") %>%
crop(NA_ext) %>% #crop raster to NA
mask(great_lakes, inverse = T) # cut out the great lakes
# SPP 5.85
# low regard for enviromental sustainability, increased fossil fuel reliance, this is the current tracking projection
ssp585_2030 <- cmip6_world(model = "CanESM5",
ssp = "585",
time = "2021-2040",
var = "bioc",
res = 2.5,
path = "../wclim_data/") %>%
crop(NA_ext) %>% #crop raster to NA
mask(great_lakes, inverse = T) # cut out the great lakes
ssp585_2050 <- cmip6_world(model = "CanESM5",
ssp = "585",
time = "2041-2060",
var = "bioc",
res = 2.5,
path = "../wclim_data/") %>%
crop(NA_ext) %>% #crop raster to NA
mask(great_lakes, inverse = T) # cut out the great lakes
ssp585_2070 <- cmip6_world(model = "CanESM5",
ssp = "585",
time = "2061-2080",
var = "bioc",
res = 2.5,
path = "../wclim_data/")%>%
crop(NA_ext) %>% #crop raster to NA
mask(great_lakes, inverse = T) # cut out the great lakes
NOTE: Just like with the previous climate data, DO NOT push these large files to a github repo due to the size of the files.
Now that we have the future climate data we need to do a bit more prep work to get ready for the SDMs. First we should load in the cropped climate rasters from the background sampling script. I am doing this so I can visualize a spatial sampling method called checkerboarding (more details to come below).
# These Rasters are useful for sampling spatial checkerboards
# and making habitat suitability predictions (Historical and under future SSPs climate scenarios)
setwd('../wclim_data')
# Historical (1970-2000)
wclim_cor <- readRDS(file = 'wclim_cor.Rdata')
wclim_cor_stack <- raster::stack(wclim_cor) # covert SpatRaster to RasterStack for dependency in ENMeval checkboarding
wclim_fus <- readRDS(file = 'wclim_fus.Rdata')
wclim_fus_stack <- raster::stack(wclim_fus) # covert SpatRaster to RasterStack for dependency in ENMeval checkboarding
NOTE: The stack
function is from the
deprecated package raster
so it may not work in future
versions of ENMeval
(package for checkerboarding and SDM).
Similar function from the more up to data terra
package is
terra::c
, which can be used to combine SpatRasters, but
does not have the required RasterStack
object
signature.
Then for ease of making predictions downsteam we will extract the names of each bioclimatic raster layer so that we can rename the layers from the projected climates. This is so that when making predictions using our SDM downstearm the names correspond.
climate_predictors <- names(wclim_cor) # extract climate predictor names, to rename layers in the rasters below
# This is important to do for making predictions once the SDMs have been made on future climate data
# Note that the names of the layers still correspond to the same environmental variables
# Future SSPs
# Do not need to create RasterStacks
# SSP 245
names(ssp245_2030) <- climate_predictors #rename raster layers for downsteam analysis
names(ssp245_2050) <- climate_predictors
names(ssp245_2070) <- climate_predictors
# SSP 585
names(ssp585_2030) <- climate_predictors #rename raster layers for downsteam analysis
names(ssp585_2050) <- climate_predictors
names(ssp585_2070) <- climate_predictors
So now we have all our data prepped and its time to start getting things sorted for the SDM. In the next few pieces of code I am going to show you how to manually partition data into training and testing groups. There are several reasons for doing this, including and most importantly evaluating the performance and generalization of the model. This is called cross-validation, whereby the species occurrence data is partitioned into training data that gets inputted into the model, and testing data that is withheld and used to evaluate how well the trained model fits the ‘new’ testing occurrences. There are several different ways to partition data, all of which have pros/cons.
The simplest method is k-fold random, where data is equally binned into groups (‘folds’). However, this can be flawed as can preserve bias in the training and testing data. It is preferred to use spatial partitioning to reduce spatial autocorrelation preserve spatial heterogeneity, which can help reduce model overfitting. This creates partitions that are much more ecologically relevant so that the model can learn how the species is spatially organized, which in turn makes the predictions of the model more generalizable for future climate predictions.
There are several different methods to spatially partition occurrence data. A common approach is to overlay a square grids and take a checkerboard pattern of occurrences, where occurrences are partitioned depending on what square they fall in. The method I will demonstrate will complete 4-fold cross validation, where four partitions are taken - three will be used for training and one for testing. As I mentioned I will show you how to do this manually, but the MaxEnt function we will use later will do this partitioning for us, as well are cycle through and withhold each group.
Read more about spatial partitioning and checkerboarding in the following two articles.
Radosavljevic, A., & Anderson, R. P. (2014). Making better Maxent models of species distributions: complexity, overfitting and evaluation. Journal of biogeography, 41(4), 629-643. https://doi.org/10.1111/jbi.12227.
Muscarella, R., Galante, P. J., Soley‐Guardia, M., Boria, R. A., Kass, J. M., Uriarte, M., & Anderson, R. P. (2014). ENM eval: An R package for conducting spatially independent evaluations and estimating optimal model complexity for Maxent ecological niche models. Methods in ecology and evolution, 5(11), 1198-1205. https://doi.org/10.1111/2041-210X.12261.
Let’s get started with the partitions.
# Spatial partitioning preparation
occ_cor_coords <- as.data.frame(geom(occThin_cor)[,3:4]) # extract longitude, lattitude from occurence points
bg_cor_coords <- as.data.frame(geom(cor_bg_vec)[,3:4]) # extract longitude, lattitude from background points
occ_fus_coords <- as.data.frame(geom(occThin_fus)[,3:4]) # extract longitude, lattitude from occurence points
bg_fus_coords <- as.data.frame(geom(fus_bg_vec)[,3:4]) # extract longitude, lattitude from background points
The geom
function will get the geometry of the
occurrences from the SpatVectors and store them in data frames.
# Spatial Checkerboard partitioning of occurrence and background points cross-fold validation
set.seed(1337)
agg_factor <- c(9,9) # this defines how many adjacent cells are aggregated into the same checkboard 'sqaure'
# I visually tested different aggregations and found 9,9 was the best fit given the spatial extent
Setting a random generator seed will ensure that randomized partitions yield the same results consistently. The aggregation (agg) factor determines at what scale the occurrences are aggregated in each grid cell. The first value species the fine scale
NOTE: We are using the
get.checkerboard2
function to get the 4-fold cross
validation. This function is convenient but there is no reason why you
could not do more partitions if you wanted to.
There are four required arguments: 1. the species
occurrences coordinates, 2. background point
coordinates, 3. a predictor RasterStack
that is cropped to the extent of the background, and 4.
the above defined aggregation factor.
Then we can use the evalplot.grps
function to visualize
the partitions.
cor_cb <- get.checkerboard2(occs = occ_cor_coords,
bg = bg_cor_coords,
envs = wclim_cor_stack,
aggregation.factor = agg_factor
)
evalplot.grps(pts = occ_cor_coords, pts.grp = cor_cb$occs.grp, envs = wclim_cor_stack, pts.size = .75) # plot the checkerboard partitions of occurrence points
evalplot.grps(pts = bg_cor_coords, pts.grp = cor_cb$bg.grp, envs = wclim_cor_stack, pts.size = .75) # plot the checkerboard partitions of occurrence bg points
# background points may return an error if the checkboard sampling, this can happen due to the aggregation factor removing some obs
# It is fine as we do not need to visualize groups. If you want to, must remove the difference in observations.
First lets look at the species occurrence partitions. But, it is hard to see the checkerboard pattern.
Now let’s look at the background points. As you can see the checkerboarding is much clearer. I also show how adjusting the aggregation factor changes the scale at which the partitions are taken from.
3,3 aggregation
6,6 aggregation
9,9 aggregation, which I chose to use!
12,12 aggregation
Okay, now that we have played around with the understanding how background point sampling works lets dive into developing our MaxEnt model. We are going to use parallel processing, to utilize multiple processors (CPUs) simultaneously. This will increase the amount of available memory and significantly decrease the computational time. If you are using a Windows computer (like me), then you only have one type of parallel processing available to you, known as ‘socket’ clusters. Read more here: https://www.r-bloggers.com/2019/06/parallel-r-socket-or-fork/#:~:text=As%20a%20result%2C%20it%20runs,40%25%20faster%20than%20the%20socket.
First let’s take a look at the code chunk and then I will break each argument down.
# Build Species Distribution Model using MaxEnt from the <ENMeval> package
# Run prediction in a parallel using 'socket' clusters to help speed up computation
# <ENMeval> implements parallel functions natively
# But load <parallel> library for additional functions like <decectCores()>
cn <- detectCores(logical = F) # logical = F, is number of physical RAM cores in your computer
set.seed(1337)
cor_maxent <- ENMevaluate(occ = occ_cor_coords, # occurrence records
envs = wclim_cor, # environment from background training area
n.bg = 20000, # 20000 bg points
tune.args =
list(rm = seq(0.5, 8, 0.5),
fc = c("L", "LQ", "H",
"LQH", "LQHP", "LQHPT")),
partition.settings =
list(aggregation.factor = c(9, 9), gridSampleN = 20000), # 9,9 agg
partitions = 'checkerboard2',
parallel = TRUE,
numCores = cn - 1, # leave one core available for other apps
parallelType = "doParallel", # use doParrallel on Windows - socket cluster
algorithm = 'maxent.jar')
NOTE: The cn
object is the number of
physical CPUs your computer has. The function detectCores
requires that you load the parallel
library. Also, make
sure to set the random generator seed to achieve consistent results.
Breakdown of the ENMevaluate
function
The first argument that it is expecting is a
data.frame
of all species occurrences with two columns,
longitude and latitude, in that specific order.
Like in the example we did above, we want to set
n.bg
= 20,000 background points to get a comprehensive
sample.
Next the the envs
argument is
SpatRaster
of bioclimatic predictor data cropped to the
background extent (or ecoregions). We can reuse the
cropped raster we produced in the malus_bg.R
script.
Now we can adjust, or tune, the arguments of the MaxEnt algorithm
with tune.args
.
rm
stands for “regularization multiplier”, which
affects how focused or closely-fitted the output distribution is.
Smaller values can lead to model overfitting, so it is good to test
different values.fc
stands for “feature class”. This corresponds to a
mathematical transformation of the different covariates used in the
model to allow complex relationship to be modeled. L = linear, Q =
quadratic, H = hinge, P = product and T = threshold.partition.settings
is where to control the aggregation
factor we played around with above. As well make sure to set the number
of grid cells included in the partitioning equal to the number of
background points sampled. It is 10,000 by default.partitions
is how to select what cross-validation
method will be used. Like above we are going to use the 4-fold
checkerboad partitioning. The benefit of doing this in the
ENMevalutate
function is that the model will parition and
then cross-validate across all groupings, meaning it will rotate what
data is used in training and what is used for testing. It will then
produce a continuous Boyce Index (cbi) to correlate each set of testing
data against model fits.parallel
set this equal to TRUE
to ensure
parallel processing is enabled. Likely if not enabled you will get a
runtime or memory error - especially with complex models.numCores
is where we want to specify how many cores
RStudio can simultaneously access. It is good practice to leave one core
free to run other processes on your computer.parallelType
set to “doParallel” to run a socket
cluster. This is the only type available on Windows. The other method,
“doSNOW”, will run a fork cluster - which with can decrease
computational time depending - but is only available on Mac and
Linux.algorithm
is how to select which SDM you want to use.
In this case I am using the maxent.jar
algorithm, which is
a version of MaxEnt implemented in the dismo
package using
Java Script. Another version of MaxEnt is also available, implemented in
native R with the maxnet
package. It is worth noting that
their are subtle differences. Personally I find the memory usage with
maxnet
to be much greater than maxent.jar
,
which equals longer computational time. The benefit of
maxnet
is you do not have to install JDK, which can be a
challenge.NOTE: The above model took 39 minutes and 32.8 seconds to compute. My laptop has a 3.2 GHz AMD Ryzen 7 5800H processor with eight cores. Depending on your own PC specs, model complexity including number of different tunings, and study extent your modeling could take longer/shorter depending.
And just like all the previous long computations we want to save a
.Rdata
file to reload the model if needed.
NOTE: These files are large (>200 Mb), DO NOT PUSH
to git.
# Save the MaxEnt model so you do not have to waste time re-running the model
setwd('../sdm_output')
saveRDS(cor_maxent, file = 'cor_maxent.Rdata') # save
cor_maxent <- readRDS(file = 'cor_maxent.Rdata') # load
The next step after running several model combinations (96 to be
precise), is to select what ‘best’ model (with particular tuning
arguments). A common approach is to use Akaike information criterion
(AIC), specifically delta AICc, which is a corrected AIC value,
expressed as the relative difference in AICc values of each model. Delta
AICc values less than 2, are considered to be equally good models. In my
case, there was only a single model less than 2, the best performing
model. Typically, to deal with tiebreaks between well performing models,
the least complex (meaning simplest fc
) and/or the model
with the lowest rm
value.
# Note that maxent results provide Continuous Boyce Index (cbi)
best_cor_maxent <- subset(cor_maxent@results, delta.AICc == 0) # selects the best performing model based on delta AICc - returns data frame object
mod.best_cor_maxent <- eval.models(cor_maxent)[[best_cor_maxent$tune.args]] # extracts the best model - returns MaxEnt object
So now that we have selected our best performing MaxEnt model we can
start to make some predictions! We are going to use the
terra
package to create a SpatRaster
of
habitat suitability scores. First, lets start by making some predictions
using the historical WorldClim data. In this case we want to use the
uncropped WorldClim SpatRaster
, to make
predictions across North America. This will become more clear why once
we get to the future climate change predictions.
# Wclim is the historical climatic conditions (1970-2000)
cor_pred_hist <- terra::predict(wclim, mod.best_cor_maxent, cores = cn - 1)
plot(cor_pred_hist)
points(occThin_cor, cex = 0.05)
We have made our first prediction of the fundamental niche of M. coronaria! But there are a few more things to consider before moving on.
First, is how confident can we be of these predictions? Using metrics
like Area Under the Curve (AUC) and Continous Boyce Index can help us,
however they have their own flaws, and so its important to think
critically about the results of your predictions. Note that
ENMevaluate
and the MaxEnt algorithm automatically compute
several indices, including separate indices calculated on the training
data, and those calculated on the validation (testing) data.
ecospat.boyce
uses a moving
window, where it forms classes by 1/10th of the suitability range. The
Boyce Index is then calculated using a Spearman rank
correlation metric between P/E scores and habitat suitability
scores. What we expect to see is a steady increase in P/E with an
increase suitability scores.
# Evaluate predictions using Boyce Index
# the number of true presences should decline with suitability groups 100-91, 90-81, etc.
# First extract suitability values for the background and presence points, make sure to omit NA values
corPred_bg_val <- terra::extract(cor_pred_hist, bg_cor_coords)$lyr1 %>%
na.omit()
corPred_val_na <- terra::extract(cor_pred_hist, occ_cor_coords)$lyr1 %>%
na.omit()
# Evaluate predictions using Boyce Index
ecospat.boyce(fit = corPred_bg_val, # vector of predicted habitat suitability of bg points
obs = corPred_val, # vector of
nclass = 0,
PEplot = TRUE,
method = 'spearman')
Second, is it best to leave these suitability scores as a continuous
gradient of habitat suitability? Myself (and many others) find maps with
colour gradients illegible, or less informative at a glance. Some take
the approach of creating a binary threshold, where habitat suitability
is boolean. There are several methods to determine this binary
threshold, which can be done using the predicts
package. A
flaw of this approach is it less tangible what these thresholds mean in
lay terms, and biology is never as simple as YES/NO.
With that said (foreshadowing), these binary thresholds are very useful for completing gap analyses. Gap analyses are a natural next step after completing a SDM to understand how much of a species suitable habitat is covered by protected areas (PAs) so that we know how much of the specie’s habitat is protected.
cor_pa <- predicts::pa_evaluate(p = occ_cor_coords_mat, a = bg_cor_coords_mat, model = cor_maxent, x = wclim_cor)
cor_threshold <- predicts::threshold(cor_pa)
cor_hist_habitat <- cor_pred_hist > cor_threshold$max_spec_sens #the threshold at which the sum of the sensitivity (true positive rate) and specificity (true negative rate) is highest
However, I think it is more practical to bin suitability scores into several categories. Lets say three categories: “High suitability”, “Moderate suitability” and “Low suitability”. These categories are determined using qauntiles, where suitability scores in the lowest percentiles can be considered to be lowly or moderately suitable, and higher scores considered to be of high suitability. In this case I use the 1st, 10th, and 50th percentile to classify these categories. This means that in the low habitat category we are only omitting 1% of predicted suitable habitat, compared to the 50th, were we are omitting 50% of predicted values. It is important to understand that this approach also has flaws, but typically the 10th percentile is considered a acceptable minimum threshold.
corPred_val <- terra::extract(cor_pred_hist, occ_cor_coords)$lyr1
corPred_threshold_1 <- quantile(corPred_val, 0.01, na.rm = T) # Low suitability
corPred_threshold_10 <- quantile(corPred_val, 0.1, na.rm = T) # Moderate suitability
corPred_threshold_50 <- quantile(corPred_val, 0.5, na.rm = T) # High suitability
# Save suitability thresholds
setwd('../sdm_output/thresholds')
saveRDS(corPred_threshold_1, file = 'corPred_threshold_1.Rdata')
saveRDS(corPred_threshold_10, file = 'corPred_threshold_10.Rdata')
saveRDS(corPred_threshold_50, file = 'corPred_threshold_50.Rdata')
Now let’s visualize the results! As mentioned previously, we will tidy things up downstream to make publication quality plots. But, this is a great start!
legend_labs <- c('Low Suitability', 'Moderate Suitability', 'High Suitability')
# brewer.pal(3, "YlOrBr") # Brewer palettes are helpful for mapping colour gradients
fill_cols <- c("#FFF7BC", "#FEC44F", "#D95F0E")
par(mar = c(5, 5, 5, 5))
terra::plot(cor_pred_hist > corPred_threshold_1, col = c('#E8E8E8', '#FFF7BC'), legend = F, xlim = c(-100, -50), ylim = c(30, 60), main = expression(atop(italic('Malus coronaria'), " Historical Suitability (1970-2000)")), background = 'lightskyblue1')
terra::plot(cor_pred_hist > corPred_threshold_10, col = c(NA, '#FEC44F'), add = T, legend = F)
terra::plot(cor_pred_hist > corPred_threshold_50, col = c(NA, '#D95F0E'), add = T, legend = F)
terra::plot(canUSMex_map, add = T)
points(occThin_cor, col = 'black', cex = 0.75, pch = 4)
legend(x = -75, y = 30, xpd = NA, inset = c(5, 0),
title = 'Habitat Suitability',
legend = legend_labs,
fill = fill_cols)
Now we can move on to make predictions for the future climate scenarios. We will still use the same threshold values from the historical prediction to keep interpretation consistent. As always, make sure to save the predictions to save time when having to reload.
# Future Climate predictions
# SSP 245
cor_pred_ssp245_30 <- terra::predict(ssp245_2030, mod.best_cor_maxent, cores = cn - 1)
cor_pred_ssp245_50 <- terra::predict(ssp245_2050, mod.best_cor_maxent, cores = cn - 1)
cor_pred_ssp245_70 <- terra::predict(ssp245_2070, mod.best_cor_maxent, cores = cn - 1)
# SSP585
cor_pred_ssp585_30 <- terra::predict(ssp585_2030, mod.best_cor_maxent, cores = cn - 1)
cor_pred_ssp585_50 <- terra::predict(ssp585_2050, mod.best_cor_maxent, cores = cn - 1)
cor_pred_ssp585_70 <- terra::predict(ssp585_2070, mod.best_cor_maxent, cores = cn - 1)
# Save M cor. SDM predictions
setwd('../sdm_output')
saveRDS(cor_pred_hist, file = 'cor_pred_hist.Rdata')
saveRDS(cor_pred_ssp245_30, file = 'cor_pred_ssp245_30.Rdata')
saveRDS(cor_pred_ssp245_50, file = 'cor_pred_ssp245_50.Rdata')
saveRDS(cor_pred_ssp245_70, file = 'cor_pred_ssp245_70.Rdata')
saveRDS(cor_pred_ssp585_30, file = 'cor_pred_ssp585_30.Rdata')
saveRDS(cor_pred_ssp585_50, file = 'cor_pred_ssp585_50.Rdata')
saveRDS(cor_pred_ssp585_70, file = 'cor_pred_ssp585_70.Rdata')
For the purposes of this tutorial I am only going to show results of one of the future climate predictions - SSP5-8.5 2020-2040 (the current tracking projection). I am using the same code as above for the historical suitability plot.
We can see a clear shift in the fundamental niche of M. coronaria northward. At the same time, it seems that the climate for M. coronaria will be favourable in Southern Ontario, as well as an expansion in Quebec and the Maritimes. Also pay attention to areas where fundamental habitat is being lost, especially in the US. Remember that these are simply predictions, that carry a lot of assumptions.
Now that we have made all our predictions and thresholds saved as
.Rdata
files lets move onto to tidying up plots to prepare
them for a results section and publication! Also, if you have made it
this far take a minute to appreciate just how much work has been
accomplished.
One of my favourite, but most time consuming steps in a SDM workflow is producing figures. Designing a plot that effectively display results at the same time as being simple and aesthetically pleasing can take a lot of time and fine tuning. Not to mention that plotting spatial data can be difficult to say the least. With that said once you do nail a plot it is so rewarding!
It is important to note that there are several different approaches
and packages that you can use to plot spatial vectors and rasters. One
of the most popular packages is ggplot2
(a.k.a ggplot),
which is highly maintained and polished. It can be a bit of a learning
curve, but once you get the hang of it it is THE way to make (most)
plots. Not to mention there are endless packages that add functionality
to ggplot2
that can do a lot of cool things.
Now, with that said for the purposes of my publication and workflow I
will be using the plotting functionality built into terra
.
As this is a newer package, there is less support and documentation in
ggplot2
but a package called tidyterra
is well
into development. Also, the terra::plot
function builds off
the base R graphics package, which can also have its advantages.
Now what exactly goes into a ‘publication quality’ plot. There is a long list of things that separates a good figure from a bad figure. I will go over a few decisions I have made, but I will not begin to scratch the surface. I highly reccomend taking a look at this YouTube video by the Riffomonas Project on creating publication quality figures (https://youtu.be/3y7pJYURspA?si=NsdrJU9MxnmNF6b_). The creator, Dr. Pat Schloss, is a microbiologist at the University of Michigan. He has an immense experience base in computational statistics, and a talent for simply explaining complex code. Checkout his catalog of videos and his website!! I take a lot of inspiration from his philosophy on open access statistics and coding - and certainly have learned a lot from him. Okay that is enough about my statics crush…
For a terra
specific introduction to plotting checkout
this great introduction to making simple maps with terra
by
my collaborator Dr. Tyler Smith, (https://plantarum.ca/2023/02/13/terra-maps/). He has
recently updated the blog to be up to date with all the new
terra
functionality. Also checkout the rest of his
blogs!
As always, let’s start by loading the raster layers and thresholds we saved from the SDM outputs.
NOTE: Due to the Lambert projection the x any y axes are now a projected coordinate system, and no longer standard geographic coordinates (Lat and Lon). Thus the axes limits must be set differently. IF you want to still plot lat/lon coordinates you need to project a graticule, which I will cover below.
Let’s begin…
# Load occurrences and raster/vectors
# Occurrence Points in SpatVectors
getwd()
setwd("../occ_data/")
occThin_cor <- readRDS(file = 'occThin_cor.Rdata') # M. coronaria
occThin_fus <- readRDS(file = 'occThin_fus.Rdata') # M. fusca
# Great Lakes shapefiles for making pretty maps
# Shape files downloaded from the USGS (https://www.sciencebase.gov/catalog/item/530f8a0ee4b0e7e46bd300dd)
great_lakes <- vect('C:/Users/terre/Documents/Acadia/Malus Project/maps/great lakes/combined great lakes/')
While plotting the SDM results in the previous script I realized that
the gadm
country boundaries were making the results of the
SDM really hard to see in areas at the coast. This was especially true
for Malus fusca on the west coast of British Columbia, where
there are many many inlets and bays (see photo example of a prelimary
map below). Because of this I chose to scarp the admin boundaries that
outline the entirity of Canada and the US, and download a shape file of
just the border instead.
# Canada/US Border for showing the international line. Gadm admin boundaries trace the entire country, vs this is just the border
# Much easier to see the SDM results along coastlines where tracing obscures the data
# Downloaded from https://koordinates.com/layer/111012-canada-and-us-border/
can_us_border <- vect('C:/Users/terre/Documents/Acadia/Malus Project/maps/can_us border')
A small tweak I made to the vector was to remove the border lines drawn in the ocean on both coasts. This was purely for aesthetic reasons.Thankfully each segment of the vector is clearly named and easy to drop.
can_us_border$SectionEng # access variable names of each segment
# Two line segments are in the water and are not needed in this case, lets remove them to make the maps look prettier
segments_to_remove <- c("Gulf of Maine", "Straits of Georgia and Juan de Fuca")
can_us_border <- can_us_border[!can_us_border$SectionEng %in% segments_to_remove, ]
Up until now all the maps produced above have been done using the World Geodectic System 1984 (a.k.a WGS84); which is a pseudo-Mercator projection that is the de facto standard for most web apps. With that said it is not very nice for producing figures of maps because it distorts shape and size of areas. So we will be using the Lambert Conformal Conic projection. This projection is ideal for mapping Canada and the US because its east-west orientation minimizes distortion across these large, elongated regions, providing a more accurate representation of geographic features and distances. It is also pretty much the standard projection in every Canadian geography textbook and it is also the official mapping projection of the Government of Canada.
For an introduction into projections and ‘Coordinate Reference Systems’ (CRS) visit in R https://rspatial.org/raster/spatial/6-crs.html.
projLam <- "+proj=lcc +lat_1=49 +lat_2=77 +lat_0=49 +lon_0=-95 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs"
As I mentioned above, when creating a new projection in R other than
the WGS84, you will need to produce graticules to show the latitude and
longitude on the new coordinate system you are using. There is a
function in terra
to do this. I am not going to use it for
my plots but it is good to know how to do it!
# Gratitcules
g.cor <- terra::graticule(
lon = seq(-110, -55, by = 10), # Longitude from 130°W to 50°W
lat = seq(30, 55, by = 10), # Latitude from 25°N to 75°N
crs = projLam
)
So now we need to get all of the spatial data we are going to project on the same coordinate system (The Lambert). This can take a few minutes to run!
# Great lakes
great_lakes.lcc <- project(great_lakes, projLam)
# Can/US border
can_us_border.lcc <- project(can_us_border, projLam)
# Occurrences
occThin_cor.lcc <- project(occThin_cor, projLam)
occThin_fus.lcc <- project(occThin_fus, projLam)
# Habitat suitability
cor_pred_hist.lcc <- project(cor_pred_hist, projLam)
cor_pred_ssp245_30.lcc <- project(cor_pred_ssp245_30, projLam)
cor_pred_ssp245_50.lcc <- project(cor_pred_ssp245_50, projLam)
cor_pred_ssp245_70.lcc <- project(cor_pred_ssp245_70, projLam)
cor_pred_ssp585_30.lcc <- project(cor_pred_ssp585_30, projLam)
cor_pred_ssp585_50.lcc <- project(cor_pred_ssp585_50, projLam)
cor_pred_ssp585_70.lcc <- project(cor_pred_ssp585_70, projLam)
Ok we have done all the prep work we needed to do, now let’s make some figures!
# We need three colours that clearly show the threshold categories
brewer.pal(3, "YlOrBr") # Brewer palettes are helpful for mapping colour gradients
legend_labs <- c('Low Suitability', 'Moderate Suitability', 'High Suitability')
fill_cols <- c("#FFF7BC", "#FEC44F", "#D95F0E")
# Then we need to determine some plot limits. Unfortunately it is a bit more tricky compared to the Mercator projection
# It can take some trial and error
cor.xlim <- c(-5*10^5, 3.1*10^6)
cor.ylim <- c(-2*10^6, 2*10^6)
I want to make a figure with multiple panels. You could absolutely do this in R, BUT sometimes the simplier solution can be easier. In this case I am going to make all the elements/panels of the figure and then edit it all together in PowerPoint! To start I am going to produce a legend.
# Plot a legend that can be saved on its own
plot(NULL ,xaxt='n',yaxt='n',bty='n',ylab='',xlab='', xlim=0:1, ylim=0:1)
legend('center', xpd = NA, title = c(as.expression(bquote(bold('Habitat Suitability')))), legend = legend_labs, fill = fill_cols, cex = 3)
# Plot historical distribution
terra::plot(cor_pred_hist.lcc > corPred_threshold_1, col = c('#E8E8E8', '#FFF7BC'),
background = 'lightskyblue1',
legend = F,
xlim = cor.xlim, ylim = cor.ylim,
main = " Historical Suitability (1970-2000)",
cex.main = 3,
axes = F,
box = T,
mar = c(5, 5, 5, 5))
terra::plot(cor_pred_hist.lcc > corPred_threshold_10, col = c(NA, '#FEC44F'), add = T, legend = F)
terra::plot(cor_pred_hist.lcc > corPred_threshold_50, col = c(NA, '#D95F0E'), add = T, legend = F)
terra::plot(can_us_border.lcc, add = T)
#terra::plot(g.cor, add = T, retro = T, col = '#999999',lab.lat = c(2,3,4), lab.lon = c(2,3,4,5), lab.cex = 1.5, mar = c(5,5,5,5), lab.loc = 1, off.lat = -0.01) # Code for adding a gratitcule to the plot.
Now I have repeated the plotting process for all time periods for both the SSPs. I took all those photos and went into PowerPoint and produced the following figure!
Now let’s break down some of the decisions I made when designing the map.
I chose the sequential Brewer palettes as light colours stress lower values and darker greater values. This makes sense for habitat suitability going from low to high. The shades of brown are quick and easy to distinguish (but are not colour blind friendly). They are also soft hues that are aesthetic.
The light grey background of North America is easily recognizable but it is not too dark to take the readers attention away from SDM results.
As previously mentioned, in previous iterations I had administrative boundaries for each province/state but it was making the plots very busy. The question I asked myself is, is the cluster worth the information being presented - or is it even relevant at all. No, not really! BUT it is important to show the international border!
The ocean could have remained white, but it is a bit more clear what is water if we make it blue. The light blue matches the soft browns and grey. It also distinguishes the background of the map from the rest of the figure.
The arrangement of the historical plot and then the SSPs which it creates a sensible hierarchy with each year as a column and row as a different SSP. It is more natural to read left to right, following the time progression then to compare up and down.
The legend is large and clear.
The species is in the top left to be the first thing the reader sees, with a accompanying picture that draws the reader in and reminds them what species they are looking at. The same picture is used in a previous figure introducing the species to create continuity.
Each panel is labeled with letters so that they can be easily referenced in-text.
Overall the plot is minimal, nothing is excessive, and the whole plot is soft and easy to look at. (At least in my humble opinion).
The natural next step after completing a SDM is to complete a gap analysis.
A gap analysis is a method in conservation biology to assess the effectiveness of existing protected areas (PAs) and ex situ collections in conserving biodiversity. The analysis can identify where the species of interest (or ecosystem) is protected or not, and to what extent they are represented within those protected areas. There are many different approaches and software for completing this analysis, but to as we have come this far we will be sticking with R. Dan Carver et al. 2021 (https://nsojournals.onlinelibrary.wiley.com/doi/full/10.1111/ecog.05430) published a great package for completing gap analyses in R, but there are now several deprecated dependencies which has made the package inoperable. With that said we can still use their equations to complete the analysis ourselves.
Unfortunately for the Malus species I am working with there are simply not enough ex situ data in existence in Canada to complete it. We do however thankfully have enough data to do an in situ analysis.
library(tidyverse) # Grammar and data management
library(terra) # Spatial Data package
library(geodata) # basemaps
library(ggpubr) # arrange ggplot figures
# The following gap analysis is referencing Carver et al. (2021)
# https://nsojournals.onlinelibrary.wiley.com/doi/full/10.1111/ecog.05430
# For the purposes of the analysis we are restricting the study area to just Canada
To begin we need to get spatial data of the protected areas in Canada. The Canadian government releases all their protected and conserved areas - this includes terrestrial and marine areas, as well as effective area-based conservation measures (OECM). In this analysis I am only considering PAs, although for your purposes you may want to consider including OECMs. Link to Canadian Protected and Conserved Areas Database (CPCAD) https://www.canada.ca/en/environment-climate-change/services/national-wildlife-areas/protected-conserved-areas-database.html.
Read more about Canadian OECMs here: https://www.canada.ca/en/environment-climate-change/services/nature-legacy/other-effective-area-based-measures.html.
# Protected Area and OECM data from
# Download .kmz file and vectorize using terra
pro_area <- terra::vect("C:/Users/terre/Documents/Acadia/Malus Project/maps/canada_pa/ProtectedConservedArea_2023/ProtectedConservedArea_2023/ProtectedConservedArea_2023.gdb")
pro_area <- pro_area[pro_area$BIOME == 'T'] # filter only the terrestrial protected areas and OECMs
pro_area <- pro_area[pro_area$PA_OECM_DF == '1'] # filter only protected areas (remove OECMs)
pro_area <- project(pro_area, 'WGS84') # project the PAs to be the same CRS as the SDM layers
NOTE: The CPCAD data uses a different projection than what we saved our spatial data as so make sure to change the projection.
Here we can see all the terrestrial protected areas in Canada as of Sept. 2024. Note that this does not include OECMs.
Now let’s load in all the necessary SDM layers, species occurrences and base maps.
# I am using the complete suitability rasters and thresholds separately
# Could also load pre-categorized layers as well
# M. coronaria
getwd()
setwd('../sdm_output/')
cor_pred_hist <- readRDS(file = 'cor_pred_hist.Rdata')
cor_pred_ssp245_30 <- readRDS(file = 'cor_pred_ssp245_30.Rdata')
cor_pred_ssp245_50 <- readRDS(file = 'cor_pred_ssp245_50.Rdata')
cor_pred_ssp245_70 <- readRDS(file = 'cor_pred_ssp245_70.Rdata')
cor_pred_ssp585_30 <- readRDS(file = 'cor_pred_ssp585_30.Rdata')
cor_pred_ssp585_50 <- readRDS(file = 'cor_pred_ssp585_50.Rdata')
cor_pred_ssp585_70 <- readRDS(file = 'cor_pred_ssp585_70.Rdata')
# Thresholds
# M. coronaria
setwd('../sdm_output/thresholds')
corPred_threshold_1 <- readRDS(file = 'corPred_threshold_1.Rdata')
corPred_threshold_10 <- readRDS(file = 'corPred_threshold_10.Rdata')
corPred_threshold_50 <- readRDS(file = 'corPred_threshold_50.Rdata')
# Load a Canada Admin boundary
getwd()
setwd('../occ_data')
ca_bound <- gadm(country = 'CA', level = 0, resolution = 1,
path = '../occ_data/base_maps')
NOTE: I am following Carver et al., 2021. I will not be explaining in very much detail what each metric represents so go checkout the paper!
ANOTHER NOTE: One major difference in this workflow to Carver et al., 2021 is that I am using the projected SSPs to look at the future scores of these metrics - if the PAs were to stay the same as they are today.
Now the first gap analysis statistic we want to calculate is the sampling representativeness score in situ (SRSin). To do this we will need to crop and mask the predicted SDM layers to Canada’s borders. Then we will take the product of that and crop and mask it to only the protected areas (PAs).
I will note that the way I am coding it here is not likely the most efficient or clean - but it will get the job done! Let’s hope Dan can get get the GapAnalysisR package back to complete strength. But this workflow does serve as an example on how to manually complete these calculations on your own!
# Mask and crop SDM layer to Canada
# M. coronaria
# Historical
can_cor_pred_hist <- crop(cor_pred_hist, ca_bound, mask = T)
# SSP245
can_cor_pred_ssp245_30 <- crop(cor_pred_ssp245_30, ca_bound, mask = T)
can_cor_pred_ssp245_50 <- crop(cor_pred_ssp245_50, ca_bound, mask = T)
can_cor_pred_ssp245_70 <- crop(cor_pred_ssp245_70, ca_bound, mask = T)
# SSP585
can_cor_pred_ssp585_30 <- crop(cor_pred_ssp585_30, ca_bound, mask = T)
can_cor_pred_ssp585_50 <- crop(cor_pred_ssp585_50, ca_bound, mask = T)
can_cor_pred_ssp585_70 <- crop(cor_pred_ssp585_70, ca_bound, mask = T)
# Mask Suitable Habitat rasters to the PA polygons
#M. coronaria
#Historical
pa_cor_pred_hist <- crop(can_cor_pred_hist, pro_area, mask = T)
# SSP245
pa_cor_pred_ssp245_30 <- crop(cor_pred_ssp245_30, pro_area, mask = T)
pa_cor_pred_ssp245_50 <- crop(cor_pred_ssp245_50, pro_area, mask = T)
pa_cor_pred_ssp245_70 <- crop(cor_pred_ssp245_70, pro_area, mask = T)
# SSP585
pa_cor_pred_ssp585_30 <- crop(cor_pred_ssp585_30, pro_area, mask = T)
pa_cor_pred_ssp585_50 <- crop(cor_pred_ssp585_50, pro_area, mask = T)
pa_cor_pred_ssp585_70 <- crop(cor_pred_ssp585_70, pro_area, mask = T)
So now that the layers are properly cropped and masked to the only the areas we want to consider, we need to extract all the true presences. As you will remember from above these are the cases where occurrences fall within an area that is predicted as suitable. Then we will do the same for all the true positive occurences that fall within PAs.
# Extract occurrence points that fall within the SDM prediction
# M. coronaria
# Historical
index_cor_pos_hist_low <- extract(can_cor_pred_hist > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_cor_pos_hist_mod <- extract(can_cor_pred_hist > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_hist_high <- extract(can_cor_pred_hist > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)
# SSP245
index_cor_pos_ssp245_30_low <- extract(can_cor_pred_ssp245_30 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_cor_pos_ssp245_30_mod <- extract(can_cor_pred_ssp245_30 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp245_30_high <- extract(can_cor_pred_ssp245_30 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp245_50_low <- extract(can_cor_pred_ssp245_50 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_cor_pos_ssp245_50_mod <- extract(can_cor_pred_ssp245_50 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp245_50_high <- extract(can_cor_pred_ssp245_50 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp245_70_low <- extract(can_cor_pred_ssp245_70 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_cor_pos_ssp245_70_mod <- extract(can_cor_pred_ssp245_70 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp245_70_high <- extract(can_cor_pred_ssp245_70 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)
# SSP585
index_cor_pos_ssp585_30_low <- extract(can_cor_pred_ssp585_30 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_cor_pos_ssp585_30_mod <- extract(can_cor_pred_ssp585_30 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp585_30_high <- extract(can_cor_pred_ssp585_30 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp585_50_low <- extract(can_cor_pred_ssp585_50 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_cor_pos_ssp585_50_mod <- extract(can_cor_pred_ssp585_50 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp585_50_high <- extract(can_cor_pred_ssp585_50 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp585_70_low <- extract(can_cor_pred_ssp585_70 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_cor_pos_ssp585_70_mod <- extract(can_cor_pred_ssp585_70 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_cor_pos_ssp585_70_high <- extract(can_cor_pred_ssp585_70 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)
# Extract that occurrences that fall within the PAs
# M. coronaria
# Historical
index_pa_cor_pos_hist_low <- extract(pa_cor_pred_hist > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_pa_cor_pos_hist_mod <- extract(pa_cor_pred_hist > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_hist_high <- extract(pa_cor_pred_hist > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)
# SSP245
index_pa_cor_pos_ssp245_30_low <- extract(pa_cor_pred_ssp245_30 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_pa_cor_pos_ssp245_30_mod <- extract(pa_cor_pred_ssp245_30 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp245_30_high <- extract(pa_cor_pred_ssp245_30 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp245_50_low <- extract(pa_cor_pred_ssp245_50 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_pa_cor_pos_ssp245_50_mod <- extract(pa_cor_pred_ssp245_50 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp245_50_high <- extract(pa_cor_pred_ssp245_50 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp245_70_low <- extract(pa_cor_pred_ssp245_70 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_pa_cor_pos_ssp245_70_mod <- extract(pa_cor_pred_ssp245_70 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp245_70_high <- extract(pa_cor_pred_ssp245_70 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)
# SSP585
index_pa_cor_pos_ssp585_30_low <- extract(pa_cor_pred_ssp585_30 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_pa_cor_pos_ssp585_30_mod <- extract(pa_cor_pred_ssp585_30 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp585_30_high <- extract(pa_cor_pred_ssp585_30 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp585_50_low <- extract(pa_cor_pred_ssp585_50 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_pa_cor_pos_ssp585_50_mod <- extract(pa_cor_pred_ssp585_50 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp585_50_high <- extract(pa_cor_pred_ssp585_50 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp585_70_low <- extract(pa_cor_pred_ssp585_70 > corPred_threshold_1, occThin_cor) %>% filter(lyr1 == T) # filter only the true cases # filter only the true cases
index_pa_cor_pos_ssp585_70_mod <- extract(pa_cor_pred_ssp585_70 > corPred_threshold_10, occThin_cor) %>% filter(lyr1 == T)
index_pa_cor_pos_ssp585_70_high <- extract(pa_cor_pred_ssp585_70 > corPred_threshold_50, occThin_cor) %>% filter(lyr1 == T)
Note: We need to filter the layers to only include the ‘lyr1’ value equal to T (or True).
Now we have a list of all the true positive occurrences for the entire predicted suitable area in Canada, and those cases that fall within PAs in Canada. So now we want to count how of each there are and calculate the SRSin metric.
# return count of the number of occurrences
# positive occurrences (overall)
# list of obj names from above
occ_names <- list(
index_cor_pos_hist_low,
index_cor_pos_hist_mod,
index_cor_pos_hist_high,
index_cor_pos_ssp245_30_low,
index_cor_pos_ssp245_30_mod,
index_cor_pos_ssp245_30_high,
index_cor_pos_ssp245_50_low,
index_cor_pos_ssp245_50_mod,
index_cor_pos_ssp245_50_high,
index_cor_pos_ssp245_70_low,
index_cor_pos_ssp245_70_mod,
index_cor_pos_ssp245_70_high,
index_cor_pos_ssp585_30_low,
index_cor_pos_ssp585_30_mod,
index_cor_pos_ssp585_30_high,
index_cor_pos_ssp585_50_low,
index_cor_pos_ssp585_50_mod,
index_cor_pos_ssp585_50_high,
index_cor_pos_ssp585_70_low,
index_cor_pos_ssp585_70_mod,
index_cor_pos_ssp585_70_high)
# return counts in a vector
SRSin_occ <- occ_names %>%
lapply(., nrow) %>%
unlist()
# positive occurrences within PAs
pa_names <- list(
index_pa_cor_pos_hist_low,
index_pa_cor_pos_hist_mod,
index_pa_cor_pos_hist_high,
index_pa_cor_pos_ssp245_30_low,
index_pa_cor_pos_ssp245_30_mod,
index_pa_cor_pos_ssp245_30_high,
index_pa_cor_pos_ssp245_50_low,
index_pa_cor_pos_ssp245_50_mod,
index_pa_cor_pos_ssp245_50_high,
index_pa_cor_pos_ssp245_70_low,
index_pa_cor_pos_ssp245_70_mod,
index_pa_cor_pos_ssp245_70_high,
index_pa_cor_pos_ssp585_30_low,
index_pa_cor_pos_ssp585_30_mod,
index_pa_cor_pos_ssp585_30_high,
index_pa_cor_pos_ssp585_50_low,
index_pa_cor_pos_ssp585_50_mod,
index_pa_cor_pos_ssp585_50_high,
index_pa_cor_pos_ssp585_70_low,
index_pa_cor_pos_ssp585_70_mod,
index_pa_cor_pos_ssp585_70_high)
# return counts in a vector
SRSin_pa <- pa_names %>%
lapply(., nrow) %>%
unlist()
# Calculate SRSin score
SRSin_score <- ((SRSin_pa/SRSin_occ) * 100) %>% replace_na(., 0) # some values are NAs as there are 0 in the denominator
NOTE: Some NA values are produced in the SRSin calculation, as there are cases where in the future SDM predictions (total area) there is no suitable area covering where occurrence data is found. So we take those cases to equal zero here.
Now we want to create a dataframe to store all the results in. We
also should be thinking about how we can make it easy on ourselves to
plot this data in ggplot2
downstream.
in_situ <- data.frame(
species = c(rep('Malus coronaria', times = 21)),
ssp = c(rep('historical', times = 3), rep(245, times = 9), rep(585, times = 9)),
suitability = rep(c('low', 'moderate', 'high'), times = 7),
period = c(rep(2000, times = 3), rep(2030, times = 3), rep(2050, times = 3), rep(2070, times = 3),
rep(2030, times = 3), rep(2050, times = 3), rep(2070, times = 3)),
SRSin = SRSin_score
)
print(in_situ)
Now lets calculate the geographical representativeness score in situ (GRSin) metric. This is a score of the total area of PAs that is suitable habitat divided by all the area that the SDM predicted was suitable habitat. To do that we need to begin by calculating the area of the total suitable habitat and that of the protected area containing suitable habitat.
# Overal suitable area
# create a list of rasters to calculate total suitable area for each threshold, time period and SSP
suitable_area_rasts <- list(
can_cor_pred_hist > corPred_threshold_1,
can_cor_pred_hist > corPred_threshold_10,
can_cor_pred_hist > corPred_threshold_50,
can_cor_pred_ssp245_30 > corPred_threshold_1,
can_cor_pred_ssp245_30 > corPred_threshold_10,
can_cor_pred_ssp245_30 > corPred_threshold_50,
can_cor_pred_ssp245_50 > corPred_threshold_1,
can_cor_pred_ssp245_50 > corPred_threshold_10,
can_cor_pred_ssp245_50 > corPred_threshold_50,
can_cor_pred_ssp245_70 > corPred_threshold_1,
can_cor_pred_ssp245_70 > corPred_threshold_10,
can_cor_pred_ssp245_70 > corPred_threshold_50,
can_cor_pred_ssp585_30 > corPred_threshold_1,
can_cor_pred_ssp585_30 > corPred_threshold_10,
can_cor_pred_ssp585_30 > corPred_threshold_50,
can_cor_pred_ssp585_50 > corPred_threshold_1,
can_cor_pred_ssp585_50 > corPred_threshold_10,
can_cor_pred_ssp585_50 > corPred_threshold_50,
can_cor_pred_ssp585_70 > corPred_threshold_1,
can_cor_pred_ssp585_70 > corPred_threshold_10,
can_cor_pred_ssp585_70 > corPred_threshold_50)
# the following will return a vector of the total suitable habitat in km^2
# lapply allows you to iterate functions over a list
# terra::extract calculates the total area of a raster
# I then initiate a anaymous function to create a dataframe of the listed values so that
# they can be filtered such that only TRUE suitable areas (value = 1) are selected
# and then the area is returned using pull (extracts single column) for all cases that match value = 1
# finally I unlist the temporary df so that the area is returned in a vector
total_area <- lapply(suitable_area_rasts, terra::expanse, byValue = T, unit = 'km') %>%
lapply(., function(df) {
df %>% filter(value == 1) %>% pull(area)
}) %>%
unlist()
And we simply repeat what’s above for protected areas.
pa_area_rasts <- list(
pa_cor_pred_hist > corPred_threshold_1,
pa_cor_pred_hist > corPred_threshold_10,
pa_cor_pred_hist > corPred_threshold_50,
pa_cor_pred_ssp245_30 > corPred_threshold_1,
pa_cor_pred_ssp245_30 > corPred_threshold_10,
pa_cor_pred_ssp245_30 > corPred_threshold_50,
pa_cor_pred_ssp245_50 > corPred_threshold_1,
pa_cor_pred_ssp245_50 > corPred_threshold_10,
pa_cor_pred_ssp245_50 > corPred_threshold_50,
pa_cor_pred_ssp245_70 > corPred_threshold_1,
pa_cor_pred_ssp245_70 > corPred_threshold_10,
pa_cor_pred_ssp245_70 > corPred_threshold_50,
pa_cor_pred_ssp585_30 > corPred_threshold_1,
pa_cor_pred_ssp585_30 > corPred_threshold_10,
pa_cor_pred_ssp585_30 > corPred_threshold_50,
pa_cor_pred_ssp585_50 > corPred_threshold_1,
pa_cor_pred_ssp585_50 > corPred_threshold_10,
pa_cor_pred_ssp585_50 > corPred_threshold_50,
pa_cor_pred_ssp585_70 > corPred_threshold_1,
pa_cor_pred_ssp585_70 > corPred_threshold_10,
pa_cor_pred_ssp585_70 > corPred_threshold_50)
pa_area <- lapply(pa_area_rasts, terra::expanse, byValue = T, unit = 'km') %>%
lapply(., function(df) {
df %>% filter(value == 1) %>% pull(area)
}) %>%
unlist()
Now we should have everything we need to make the GRSin calculation.
# calculate the GRSin score
GRSin_score <- (pa_area/total_area) * 100
# add it to the data frame from above
in_situ <- in_situ %>% mutate(GRSin = GRSin_score)
The next metric is the calculation of the ecological representativeness score in situ (ERSin).
NOTE: another deviation from Carver et al., 2021 is we will be using a different set of ecoregions. They are using ecoregion data from the Nature Conservacy Geospatial Atlas (https://geospatial.tnc.org/datasets/7b7fb9d945544d41b3e7a91494c42930_0). Wereas we will be using the ecoregion data from the background analysis script from the EPA (https://www.epa.gov/eco-research/ecoregions-north-america>), sticking with level II as used above.
# Download NA Ecoregion shapefile from: https://www.epa.gov/eco-research/ecoregions-north-america
# Load shapefile from local files
ecoNA <- vect(x = "C:/Users/terre/Documents/Acadia/Malus Project/maps/eco regions/na_cec_eco_l2/", layer = 'NA_CEC_Eco_Level2')
ecoNA <- project(ecoNA, 'WGS84') # project ecoregion vector to same coords ref as basemap
So now lets take our list of rasters that we made for the GRSin
calculation and reuse it here. We want to convert all the polygons in
the suitable habitat rasters to points, so that we can extract the names
of the eco regions from the eco region vector. To do that we will use
the lapply()
function to convert all the rasters to points,
while also defining an anonymous function to only select the ‘true’
layer. Remember we only want the points of the TRUE cases where habitat
is in fact suitable.
# suitable habitat eco regions
suitable_area_points <- lapply(suitable_area_rasts, terra::as.points) %>%
lapply(., function (x) {
x[x$lyr1 == 1, ]
})
# protected area ecoregions
pa_area_points <- lapply(pa_area_rasts, terra::as.points) %>%
lapply(., function (x) {
x[x$lyr1 == 1, ]
})
IMPORTANT NOTE This next step is computationally expensive and will take several hours to complete. In my case it took a total of 17 hours to complete this chunk. It is worth noting that this can be reduced if you use less input spatial points (ie. the SDM rasters converted to points), if you were to use fewer suitability thresholds, fewer time periods and/or fewer SSPs.
start <- Sys.time()
suitable_area_eco <- lapply(suitable_area_points, function(point) {
terra::extract(x = ecoNA, y = point)
})
print( Sys.time() - start )
# took 12.93733 hours to complete
start <- Sys.time()
pa_area_eco <- lapply(pa_area_points, function(point) {
terra::extract(x = ecoNA, y = point)
})
print( Sys.time() - start )
# took 3.679289 hours to complete
We especially want to save these intermidate objects after the scripts are complete
# SAVE/LOAD eco regiion results
getwd()
setwd('../gap_analysis')
saveRDS(suitable_area_eco, file = 'suitable_area_eco.Rdata')
saveRDS(pa_area_eco, file = 'pa_area_eco.Rdata')
suitable_area_eco <- readRDS(file = 'suitable_area_eco.Rdata')
pa_area_eco <- readRDS('pa_area_eco.Rdata')
So now we want to filter the data so that we can find out the number of ecoregions repesented in each layer supplied.
# return unique names and number of eco regions
suit_area_eco_num <- suitable_area_eco %>%
bind_rows(.id = 'source_list') %>% # bind the lists of lists together
unnest(cols = c(NA_L2NAME)) %>% # unnest the lists into columns of a dataframe
distinct(source_list, NA_L2CODE) %>% # return only the unique eco regions grouped by source list (aka the raster layers)
filter(!NA_L2CODE %in% c(NA, '0.0')) %>% # remove NA and 0.0 (water) valyes
add_count(source_list, name = 'num_eco') %>% # count the number of nrow for each source list (or number of eco regions)
distinct(source_list, num_eco) # return the number of eco regions per source list
pa_area_eco_num <- pa_area_eco %>%
bind_rows(.id = 'source_list') %>% # bind the lists of lists together
unnest(cols = c(NA_L2NAME)) %>% # unnest the lists into columns of a dataframe
distinct(source_list, NA_L2CODE) %>% # return only the unique eco regions grouped by source list (aka the raster layers)
filter(!NA_L2CODE %in% c(NA, '0.0')) %>% # remove NA and 0.0 (water) valyes
add_count(source_list, name = 'num_eco') %>% # count the number of nrow for each source list (or number of eco regions)
distinct(source_list, num_eco) # return the number of eco regions per source list
And now we should have all the numbers needed to calculate the ERSin metric.
# Now calculate the ERSin score
ERSin_score <- (pa_area_eco_num$num_eco / suit_area_eco_num$num_eco) * 100
# Mutate the ERSin score to the master df
in_situ <- in_situ %>% mutate(ERSin = ERSin_score)
So now that we have the SRSin, GRSin and ERSin scores, we can calculate the The final conservation score in situ (FCSin). This is a very simple calculation to do in R.
# Calculate the final conservation score for in situ conservation
in_situ <- in_situ %>% mutate(FCSin = ((SRSin + GRSin + ERSin)/3))
Now we have all the calculations complete it is time to visualize it! I am only going to show one example plot - so just replicate it for all the data!
# bar plots
fill_cols <- c("#EDF8B1", "#7FCDBB", "#2C7FB8")
# M. coronaria
cor_srs_combined <- in_situ %>% filter(species == 'Malus coronaria' & suitability == 'high') %>%
ggplot(aes(x = period, y = SRSin, fill = ssp)) +
geom_col(position = position_dodge()) +
scale_x_continuous(limits = c(1990, 2080),
breaks = c(2000, 2030, 2050, 2070),
labels = c("Historical", 2030, 2050, 2070)) +
scale_y_continuous(limits = c(0, 105),
breaks = c(0, 20, 40, 60, 80, 100),
expand = c(0,0)) +
theme_classic() +
scale_fill_manual(values = fill_cols,
breaks = c('historical', '245', '585'),
labels = c('Historical', 'SSP245', 'SSP585')) +
theme(text = element_text(size = 30, colour = 'black'),
axis.text = element_text(colour = 'black'),
axis.title.x=element_blank(),
legend.title = element_blank())
# ditto
# arrange plots using <ggpubr>
ggarrange(cor_srs_combined, cor_grs_combined, cor_ers_combined, cor_fcs_combined,
nrow = 1, ncol = 4,
legend = "top",
common.legend = T)
And this is the result!