How rsample keeps memory usage low

Copy-on-modify is pretty neat.
R
rsample
tidymodels
Author
Published

October 4, 2022

A few months back, I wrote two comments on a GitHub issue explaining a bit of how rsample works under the hood. Specifically, a user asked how rsample keeps the total amount of memory that its resamples use relatively low. I’ve sent this GitHub issue to a few people since then, so it felt like it might be useful enough to turn the issue into a blog.1

What’s an rsample?

In case you’ve never used it, rsample is an R package for data resampling – if you need bootstrap resampling, V-fold cross-validation, permutation sampling, and more, rsample is meant for you.2 The majority of these rsample functions return rset objects, which are just jazzed-up tibbles:

set.seed(123)
library(rsample)
library(mlbench)
data(LetterRecognition)

boots <- bootstraps(LetterRecognition, times = 2)
boots
# Bootstrap sampling 
# A tibble: 2 × 2
  splits               id        
  <list>               <chr>     
1 <split [20000/7403]> Bootstrap1
2 <split [20000/7375]> Bootstrap2

Each of our individual resamples is stored as an rsplit object, each of which takes up a row in the splits column. Printing these objects tells us how many rows are in our analysis and assessment sets,3 but hides most of the actual structure of the rsplit object. If we use str() instead, we can see that there are three named elements in each rsplit: data, our original data frame; in_id, which has the indices for which observations are going to be held “in” our analysis set, and out_id, which sometimes4 has the indices for which observations are going to be held “out” to make up our assessment set, but here is NA:

boots$splits[[1]]
<Analysis/Assess/Total>
<20000/7403/20000>
str(
  boots$splits[[1]]
)
List of 4
 $ data  :'data.frame': 20000 obs. of  17 variables:
  ..$ lettr: Factor w/ 26 levels "A","B","C","D",..: 20 9 4 14 7 19 2 1 10 13 ...
  ..$ x.box: num [1:20000] 2 5 4 7 2 4 4 1 2 11 ...
  ..$ y.box: num [1:20000] 8 12 11 11 1 11 2 1 2 15 ...
  ..$ width: num [1:20000] 3 3 6 6 3 5 5 3 4 13 ...
  ..$ high : num [1:20000] 5 7 8 6 1 8 4 2 4 9 ...
  ..$ onpix: num [1:20000] 1 2 6 3 1 3 4 1 2 7 ...
  ..$ x.bar: num [1:20000] 8 10 10 5 8 8 8 8 10 13 ...
  ..$ y.bar: num [1:20000] 13 5 6 9 6 8 7 2 6 2 ...
  ..$ x2bar: num [1:20000] 0 5 2 4 6 6 6 2 2 6 ...
  ..$ y2bar: num [1:20000] 6 4 6 6 6 9 6 2 6 2 ...
  ..$ xybar: num [1:20000] 6 13 10 4 6 5 7 8 12 12 ...
  ..$ x2ybr: num [1:20000] 10 3 3 4 5 6 6 2 4 1 ...
  ..$ xy2br: num [1:20000] 8 9 7 10 9 6 6 8 8 9 ...
  ..$ x.ege: num [1:20000] 0 2 3 6 1 0 2 1 1 8 ...
  ..$ xegvy: num [1:20000] 8 8 7 10 7 8 8 6 6 1 ...
  ..$ y.ege: num [1:20000] 0 4 3 2 5 9 7 2 1 1 ...
  ..$ yegvx: num [1:20000] 8 10 9 8 10 7 10 7 7 8 ...
 $ in_id : int [1:20000] 18847 18895 2986 1842 3371 11638 4761 6746 16128 2757 ...
 $ out_id: logi NA
 $ id    : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  ..$ id: chr "Bootstrap1"
 - attr(*, "class")= chr [1:2] "boot_split" "rsplit"

The mystery of the missing MBs

So, just looking at this structure, it seems like each rsplit contains a complete copy of our original data. But somehow, to borrow the example from the rsample README, creating a 50-times bootstrap sample doesn’t require 50 times as much memory, but instead about 3x:

lobstr::obj_size(LetterRecognition)
2.64 MB
set.seed(35222)
boots <- bootstraps(LetterRecognition, times = 50)
lobstr::obj_size(boots)
6.69 MB

Even that top-line result is a little misleading, though, because rsample isn’t copying the data to actually create boots. If we look at the object sizes for both the original data and the resamples together, we can see that boots is only contributing ~4 MB:

lobstr::obj_size(LetterRecognition, boots)
6.69 MB
lobstr::obj_sizes(LetterRecognition, boots)
* 2.64 MB
* 4.04 MB

So: what? How?

Copying; modifying

Well, R uses what’s known as copy-on-modify semantics. That means that, when you assign the same data to multiple variables, each of those variables will actually point at the same address in RAM:

LetterRecognition2 <- LetterRecognition

lobstr::obj_addr(LetterRecognition)
[1] "0x5573114c93e0"
lobstr::obj_addr(LetterRecognition2)
[1] "0x5573114c93e0"
identical(
  lobstr::obj_addr(LetterRecognition),
  lobstr::obj_addr(LetterRecognition2)
)
[1] TRUE

This also means that LetterRecognition2 takes up literally 0 space in your RAM:

lobstr::obj_size(LetterRecognition, LetterRecognition2)
2.64 MB

And that will stay true up until we modify either of these objects. No copy is made, no additional RAM gets used, until one of the objects is modified.

That also means that, right now, LetterRecognition2 is another name for the data stored in each of our rsplits:

identical(
  lobstr::obj_addr(boots$splits[[1]]$data),
  lobstr::obj_addr(LetterRecognition2)
)
[1] TRUE

And if we get rid of LetterRecognition, which both LetterRecognition2 and our bootstraps are based off of, those objects will still point at the same address,5 and our data slot in boots still won’t take up additional space:

rm(LetterRecognition)
gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  849739 45.4    1358681 72.6  1358681 72.6
Vcells 2362850 18.1    8388608 64.0  8384745 64.0
identical(
  lobstr::obj_addr(boots$splits[[1]]$data),
  lobstr::obj_addr(LetterRecognition2)
)
[1] TRUE
lobstr::obj_sizes(LetterRecognition2, boots$splits[[1]]$data)
* 2.64 MB
*     0 B

So how does rsample keep its objects so small? By not making extra copies of your data where it doesn’t have to. This is how the entire boots table winds up only adding ~1.5x the space of the original data:

lobstr::obj_sizes(LetterRecognition2, boots)
* 2.64 MB
* 4.04 MB

And that’s pretty close to as small as this object could get – that’s just the amount of space required to store the indices (in this case, 20,000 indices per repeat, 50 repeats):

lobstr::obj_size(sample.int(20000 * 50))
4.00 MB

(The 42kb difference is the attributes we’ve attached to each split – things like its class and ID and so on – but that’s not going to be enough memory to be relevant for most applications.)

This is also, as it happens, why out_id is set NA in our bootstrap resamples.6 Because you can figure out which observations we want to “hold out” for the assessment set based on which ones we’re keeping “in” for analysis, rsample doesn’t store those indices for most of its resampling methods.7

And one last thought: if you modified LetterRecognition2, then the data in our splits would no longer point at the same address space as the original table. That’s entirely on purpose and desirable, because once you start messing with your original data, your resampling indices are no longer guaranteed to correspond to the original table you used to create them.

LetterRecognition2 <- NA

identical(
  lobstr::obj_addr(boots$splits[[1]]$data),
  lobstr::obj_addr(LetterRecognition2)
)
[1] FALSE

But, as best as possible, rsample will keep the rset small.

lobstr::obj_size(boots)
6.69 MB

Footnotes

  1. Plus, I’ve been writing my candidacy exam for two weeks now, and need an excuse to look at anything else for an hour.↩︎

  2. For what it’s worth, while I’m an author on rsample, I didn’t write any of the rsample features mentioned in this blog post. I believe the rsample-specific details were all written by Max Kuhn. All the copy-on-modify semantics stuff, however, is just part of R and written over the past few decades by R Core.↩︎

  3. “Analysis” maps to “training” while “assessment” maps to “testing”. “Analysis” and “assessment” are purposefully used to avoid confusion over which training and test set are being used.↩︎

  4. We’ll come back to this.↩︎

  5. As of R 4.0, as I understand it.↩︎

  6. Told ya we’d come back to it.↩︎

  7. Now the package I maintain, spatialsample, does include out_id on its objects relatively often. Most of the time, this is because the objects were created with a non NULL buffer, and so our hold out set isn’t simply “all of the data that’s not in”; sometimes it’s because I initially always included out_id, and haven’t fixed my code to be more efficient yet. PRs welcome!↩︎