Multi-scale model assessment with spatialsample

Author

Mike Mahoney

Published

July 13, 2022

Modeling spatially structured data is complicated. In addition to the usual difficulty of statistical modeling, models of spatially structured data may have spatial structure in their errors, with different regions being more or less well-described by a given model. This also means that accuracy metrics for these models might change depending on what spatial scale is being assessed. Only investigating model accuracy at larger aggregation scales, such as when accuracy is only assessed for the entire study area as a whole, might “smooth out” these local differences and present an inaccurate picture of model performance.

For this reason, a number of researchers (most notably, Riemann et al. (2010)1) have suggested assessing models at multiple scales of spatial aggregation to ensure cross-scale differences in model accuracy are identified and reported. This post walks through how to do that using the new spatialsample package.

Because Riemann et al. were working with data from the US Forest Inventory and Analysis (FIA) program, we’re going to do the same. However, because our main goal is to show how spatialsample can support this type of analysis, we won’t spend a ton of time worrying about any of the quirks of FIA data2 or on feature engineering. Instead, we’re going to use a simple linear model to see if we can predict how much aboveground biomass (“AGB”; all the non-root woody bits) there is in a forest based on how many trees there are. We’ll use all the FIA field data from New York State, USA.

Because we’re mostly interested in assessing our models, I’m going to mostly ignore how exactly I downloaded and wrangled the FIA data for this post. If you’re curious, I’ve hidden the code below:

Code
library(dplyr)

# Download the FIA database for New York over the internet,
# and unzip it into our local directory
#
# This updates annually, which means that this post likely won't
# generate the exact same results after 2022
httr::GET(
  "https://apps.fs.usda.gov/fia/datamart/Databases/SQLite_FIADB_NY.zip",
  httr::write_disk("SQLite_FIADB_NY.zip", TRUE)
)

unzip("SQLite_FIADB_NY.zip")

# We're going to work with the database through dplyr's database connections
#
# But first, we need to create a DBI connection to the database and
# load out tables:
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "FIADB_NY.db")
trees <- tbl(con, "TREE")

plots <- tbl(con, "PLOT")

# The FIA database has every measurement ever collected by the program;
# we'll filter to only the most recent survey for each of the plots.
#
# Plots are measured on a rolling 7 year basis, so we'll also cut out any
# plots which might not be remeasured anymore with a call to filter()
plots <- plots |> 
  group_by(PLOT) |> 
  filter(INVYR == max(INVYR, na.rm = TRUE)) |> 
  ungroup() |> 
  filter(INVYR > 2009) |> 
  collect()

copy_to(con, plots, "newest_plots", TRUE)
newest_plots <- tbl(con, "newest_plots")

# Now we'll use a filtering join to select only trees measured in the most
# recent sample at each plot
#
# We'll also count how many trees were at each plot,
# sum up their AGB, 
# and save out a few other useful columns like latitude and longitude
plot_measurements <- trees |> 
  right_join(newest_plots, by = c("INVYR", "PLOT")) |> 
  group_by(PLOT) |> 
  summarise(
    yr = mean(INVYR, na.rm = TRUE),
    plot = mean(PLOT, na.rm = TRUE),
    lat = mean(LAT, na.rm = TRUE),
    long = mean(LON, na.rm = TRUE),
    n_trees = n(),
    agb = sum(DRYBIO_AG, na.rm = TRUE)
  ) |> 
  collect() |> 
  mutate(
    # Because of how we joined, `n_trees` is always at least 1 -- 
    # even if there are 0 trees
    n_trees = ifelse(is.na(agb) & n_trees == 1, 0, n_trees),
    agb = ifelse(is.na(agb), 0, agb)
  )

DBI::dbDisconnect(con)

readr::write_csv(plot_measurements, "plot_measurements.csv")

With that pre-processing done, it’s time for us to load our data and turn it into an sf object. We’re going to reproject our data to use a coordinate reference system that the US government tends to use for national data products, like the FIA:

library(sf)

invisible(sf_proj_network(TRUE))

plot_measurements <- readr::read_csv("https://www.mm218.dev/junkdrawer/riemann/plot_measurements.csv") |> 
  st_as_sf(coords = c("long", "lat"), crs = 4326) |> 
  st_transform(5070)

And this is what we’re going to go ahead and resample. We want to assess our model’s performance at multiple scales, following the approach in Riemann et al. That means we need to do the following:

  1. Block our study area using multiple sets of regular hexagons of different sizes, and assign our data to the hexagon it falls into within each set.
  2. Perform leave-one-block-out cross-validation with each of those sets, fitting our model to n - 1 of the n hexagons we’ve created and assessing it on the hold-out hexagon.
  3. Calculate model accuracy for each size based on those held-out hexes.

So to get started, we need to block our study area. We can do this using the spatial_block_cv() function from spatialsample. We’ll generate ten different sets of hexagon tiles, using cellsize arguments of between 10,000 and 100,000 meters3. The code to do that, and to store all of our resamples in a single tibble, looks like this4:

set.seed(123)
library(dplyr)
library(spatialsample)
cellsize <- seq(10, 100, 10) * 1000

riemann_resamples <- tibble(
  resamples = purrr::map(
    cellsize, 
    \(cs) {
      spatial_block_cv(
        plot_measurements,
        v = Inf,
        cellsize = cs,
        square = FALSE
      )
    }
  ),
  cellsize = cellsize
)

If we want, we can visualize one (or more) of our resamples, to get a sense of what our tiling looks like:

autoplot(riemann_resamples$resamples[[10]])

And that’s step 1 of the process completed! Now we need to move on to step 2, and actually fit models to each of these resamples. I want to highlight that this is a lot of models, and so is going to take a while5:

purrr::map_dbl(
  riemann_resamples$resamples,
  nrow
) |> 
  sum()
[1] 2600

With that said, actually fitting those few thousand models is a two part process. First, we’re going to load the rest of the tidymodels packages and use them to define a workflow (from the workflows package), specifying the formula and model that we want to fit to each resample:

library(tidymodels)

lm_workflow <- workflow() |> 
  add_model(linear_reg()) |> 
  add_formula(agb ~ n_trees)

Next, we’ll actually apply that workflow a few thousand times! We’ll calculate two metrics for each run of the model: the root-mean-squared error (RMSE) and the mean absolute error (MAE). We can add these metrics as a new column to our resamples using the following:

riemann_resamples <- riemann_resamples |> 
  mutate(
    resampled_outputs = purrr::map(
      resamples, 
      fit_resamples,
      object = lm_workflow,
      metrics = metric_set(
        rmse,
        mae
      )
    )
  )

The riemann_resamples object now includes both our original resamples as well as the accuracy metrics associated with each run of the model. A very cool thing about this approach is that we can now visualize our block-level accuracy metrics with a few lines of code.

For instance, if we wanted to plot block-level RMSE for our largest assessment scale, we could use the following code to “unnest” our nested metric and resample columns:

riemann_resamples$resampled_outputs[[10]] |> 
  mutate(splits = purrr::map(splits, assessment)) |> 
  unnest(.metrics) |> 
  filter(.metric == "rmse") |> 
  unnest(splits) |> 
  st_as_sf() |> 
  ggplot(aes(color = .estimate)) + 
  geom_sf()

We can also go on to the third step of our assessment process, and get our model accuracy metrics for each aggregation scale we investigated. We’ll create a new data frame with only our cellsize variable and the associated model metrics:

riemann_metrics <- riemann_resamples |> 
  transmute(
    cellsize = cellsize,
    resampled_metrics = purrr::map(resampled_outputs, collect_metrics)
  ) |> 
  unnest(resampled_metrics)

head(riemann_metrics)
# A tibble: 6 × 7
  cellsize .metric .estimator  mean     n std_err .config             
     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1    10000 mae     standard   5787.  1541    99.9 Preprocessor1_Model1
2    10000 rmse    standard   6980.  1541   121.  Preprocessor1_Model1
3    20000 mae     standard   5722.   424   130.  Preprocessor1_Model1
4    20000 rmse    standard   7644.   424   169.  Preprocessor1_Model1
5    30000 mae     standard   5637.   205   161.  Preprocessor1_Model1
6    30000 rmse    standard   7725.   205   218.  Preprocessor1_Model1

And just like that, we’ve got a multi-scale assessment of our model’s accuracy! We can then use this to investigate and report how well our model does at different levels of aggregation. For instance, by plotting RMSE against MAE at various scales, it appears that our RMSE increases with aggregation while MAE decreases. This hints that, as we aggregate our predictions to larger hexagons, more of our model’s overall error is driven by large outliers:

library(ggplot2)

ggplot(riemann_metrics, aes(cellsize, mean, color = .metric)) + 
  geom_line() +
  geom_point() + 
  theme_minimal()

Footnotes

  1. Riemann, R., Wilston, B. T., Lister, A., and Parks, S. 2010. An effective assessment protocol for continuous geospatial datasets of forest characteristics using USFS Forest Inventory and Analysis (FIA) data. Remote Sensing of Environment, 114, pp. 2337-2353. doi: 10.1016/j.rse.2010.05.010.↩︎

  2. Among them that only forested areas are measured, where “forested” means “principally used as forest” which excludes parks but includes recently clear-cut lands, and that plot locations are considered personally identifying information under the farm bill of 1985, and so as to not identify anyone the coordinates in public data are “fuzzed” by a few miles and approximately 20% of plot coordinates are swapped with other plots in the data set. Which is to say, consult your local forester or ecologist if you want to use FIA data to answer real questions in your own work.↩︎

  3. This value is in meters because our coordinate reference system is in meters. It represents the length of the apothem, from the center of each polygon to the middle of the side. We’re using hexagons because Riemann et al. also used hexagons.↩︎

  4. v is Inf because we want to perform leave-one-block-out cross-validation, but we don’t know how many blocks there will be before they’re created. This is the supported way to do leave-one-X-out cross-validation in spatialsample > 0.2.0 (another option is to set v = NULL).↩︎

  5. Linear regression was invented in 1805, ish, long before the Analytical Engine was a twinkle in Babbage’s eye. Whenever I get frustrated at how long fitting multiple models like this takes, I like to take a step back and recognize that I am asking my poor overworked computer to fit roughly as many models as were used in the first ~100 years of the technique’s life.↩︎