Using in-line grouping to fit many models

An alternative to nesting for fitting separate models to multiple groups.
R
Tutorials
Data science
Author
Published

April 1, 2023

Tim Tiefenbach has a post on using dplyr and other tidymodels tools to fit many models in relatively few lines of code. Tim’s post walks through a lot of interesting functions for more complicated model fitting, which I’m not going to talk about at all. What I want to talk about is that I recently realized I don’t do much nesting in R anymore!

The building block in Tim’s post uses the new-ish dplyr::nest_by() to create a nested data frame, then fits models using the new data list column:

dplyr::storms |> 
  dplyr::nest_by(status) |> 
  dplyr::mutate(
    mod = list(lm(wind ~ pressure, data = data))
  )
# A tibble: 9 × 3
# Rowwise:  status
  status                                data mod   
  <fct>                  <list<tibble[,12]>> <list>
1 disturbance                     [146 × 12] <lm>  
2 extratropical                 [2,068 × 12] <lm>  
3 hurricane                     [4,684 × 12] <lm>  
4 other low                     [1,405 × 12] <lm>  
5 subtropical depression          [151 × 12] <lm>  
6 subtropical storm               [292 × 12] <lm>  
7 tropical depression           [3,525 × 12] <lm>  
8 tropical storm                [6,684 × 12] <lm>  
9 tropical wave                   [111 × 12] <lm>  

This is a pattern I used to use all the time1 – my undergrad thesis and eventual first pub from the same were built from strings of nesting and unnesting data and models. But over the years, I’ve realized that you can often get the same results using grouped data frames in the place of nested ones, and have shifted to using dplyr::summarise() and friends instead of nesting:

dplyr::storms |> 
  dplyr::summarise(
    mod = list(lm(wind ~ pressure, data = dplyr::pick(dplyr::everything()))),
    .by = status
  )
# A tibble: 9 × 2
  status                 mod   
  <fct>                  <list>
1 tropical depression    <lm>  
2 tropical storm         <lm>  
3 extratropical          <lm>  
4 hurricane              <lm>  
5 subtropical storm      <lm>  
6 subtropical depression <lm>  
7 disturbance            <lm>  
8 other low              <lm>  
9 tropical wave          <lm>  

Now, the obvious downside of using a grouped data frame instead of a nested one is that future function calls no longer have access to your raw data frame. The slightly less obvious downside is that the output of dplyr::nest_by() is a rowwise data frame, which makes it easy to pass our model objects directly to other functions like broom::tidy():

dplyr::storms |> 
  dplyr::nest_by(status) |> 
  dplyr::mutate(
    mod = list(lm(wind ~ pressure, data = data)),
    res = list(broom::tidy(mod))
  )
# A tibble: 9 × 4
# Rowwise:  status
  status                                data mod    res             
  <fct>                  <list<tibble[,12]>> <list> <list>          
1 disturbance                     [146 × 12] <lm>   <tibble [2 × 5]>
2 extratropical                 [2,068 × 12] <lm>   <tibble [2 × 5]>
3 hurricane                     [4,684 × 12] <lm>   <tibble [2 × 5]>
4 other low                     [1,405 × 12] <lm>   <tibble [2 × 5]>
5 subtropical depression          [151 × 12] <lm>   <tibble [2 × 5]>
6 subtropical storm               [292 × 12] <lm>   <tibble [2 × 5]>
7 tropical depression           [3,525 × 12] <lm>   <tibble [2 × 5]>
8 tropical storm                [6,684 × 12] <lm>   <tibble [2 × 5]>
9 tropical wave                   [111 × 12] <lm>   <tibble [2 × 5]>

The outputs of the grouped data frame method are not rowwise data frames, which means we need to use another function to iterate through each element of mod. I usually use purrr::map() for this:

dplyr::storms |> 
  dplyr::summarise(
    mod = list(lm(wind ~ pressure, data = dplyr::pick(dplyr::everything()))),
    res = purrr::map(mod, broom::glance),
    .by = status
  )
# A tibble: 9 × 3
  status                 mod    res              
  <fct>                  <list> <list>           
1 tropical depression    <lm>   <tibble [1 × 12]>
2 tropical storm         <lm>   <tibble [1 × 12]>
3 extratropical          <lm>   <tibble [1 × 12]>
4 hurricane              <lm>   <tibble [1 × 12]>
5 subtropical storm      <lm>   <tibble [1 × 12]>
6 subtropical depression <lm>   <tibble [1 × 12]>
7 disturbance            <lm>   <tibble [1 × 12]>
8 other low              <lm>   <tibble [1 × 12]>
9 tropical wave          <lm>   <tibble [1 × 12]>

Part of the reason I use purrr for this is that purrr provides plenty of other helper functions for working with list columns; for instance, I tend to use purrr::chuck() to extract model fit statistics:

dplyr::storms |> 
  dplyr::summarise(
    mod = list(lm(wind ~ pressure, data = dplyr::pick(dplyr::everything()))),
    res = purrr::map(mod, broom::glance),
    rsquared = purrr::map_dbl(res, purrr::chuck, "r.squared"),
    .by = status
  )
# A tibble: 9 × 4
  status                 mod    res               rsquared
  <fct>                  <list> <list>               <dbl>
1 tropical depression    <lm>   <tibble [1 × 12]>   0.0369
2 tropical storm         <lm>   <tibble [1 × 12]>   0.485 
3 extratropical          <lm>   <tibble [1 × 12]>   0.631 
4 hurricane              <lm>   <tibble [1 × 12]>   0.807 
5 subtropical storm      <lm>   <tibble [1 × 12]>   0.473 
6 subtropical depression <lm>   <tibble [1 × 12]>   0.214 
7 disturbance            <lm>   <tibble [1 × 12]>   0.299 
8 other low              <lm>   <tibble [1 × 12]>   0.332 
9 tropical wave          <lm>   <tibble [1 × 12]>   0.126 

Anyway – there’s nothing better or worse (as far as I’m concerned) with either nesting or grouping for fitting many models; I just think it’s interesting that my personal style has shifted over time to use much more grouping, and much less nesting.

And as I said at the start, Tim’s post walks through a lot of interesting functions for more complicated model fitting, which I’m not going to talk about at all; here’s the link again if you’re interested.

Footnotes

  1. Using tidyr::nest(), though.↩︎