How to run R jobs across multiple (local) computers

TIL this extremely, extremely easy thing
R
Tutorials
Spatial
Data science
Author
Published

March 3, 2023

I’ve got a small homelab going at the moment, where in addition to my daily workstation I’ve also got a NUC, Raspberry Pi, and Synology NAS running on my local network.1 I primarily use these other machines as always-on servers and storage for things like Telegraf, Influx and Grafana, mealie and paperless, but every so often it’s useful to run a long-running, high CPU job on the NUC, rather than tying up my main workstation. In those situations, I tend to use VS Code’s Remote Server, at least for anything too complex for just ssh’ing into the NUC and running commands in the CLI.

At the moment I’m working with some extremely long-running jobs which will take a few weeks to complete and are blockers for my other work. As a result, I’m not really worried about tying up my main workstation, if it means the jobs will run faster – in fact, I’d like to tie up as many computers as possible, if it means the jobs execute any faster.

In the past, I’ve tried to manually split up jobs into smaller pieces and run them independently on the different computers. This is a pain. I’ve also shifted to using targets for most analysis projects these days, in order to take advantage of its automatic DAG creation and state-saving. Manual splitting-and-execution really undermines targets’ automatic orchestration abilities, so I’ve needed to find a better way to split workloads across computers.

It turns out that better way is extremely straightforward,2 and I’m kicking myself for not finding it out earlier. Say you’ve got two machines with internal IP addresses of 192.168.1.001 and 192.168.1.002, each with a user some_user who has key-based access to ssh into the other machine. If you’re using future, setting a plan to work across both computers takes two function calls:

cl <- parallel::makePSOCKcluster(
  c("192.168.1.001", "192.168.1.002"),
  master = "192.168.1.001",
  user = "some_user"
)
future::plan(future::cluster, workers = cl)

And that’s it! Any future-enabled functions you use3 will be split across your machines. For my targets-based workflow, I just run targets::tar_make_future(workers = 2) to split the jobs up.

To push things further, you can also run multiple clusters on a single machine:

cl <- parallel::makePSOCKcluster(
  c("192.168.1.001", "192.168.1.001", "192.168.1.002"),
  master = "192.168.1.001",
  user = "some_user"
)
future::plan(future::cluster, workers = cl)

Or set jobs to use multiple cores, either by nesting futures or using other forms of parallelism; for instance, my current job is primarily using terra for a lot of raster predictions, so by setting cores = future::availableCores() - 1 inside of terra::predict() I’m able to more-or-less max out both machines I’m running on.

Footnotes

  1. With external access via tailscale.↩︎

  2. For a value of straightforward that includes “maintaining multiple machines with similar-enough R environments, access to shared storage if necessary, and ssh access to each other on a private subnet”.↩︎

  3. Highly recommend either future.apply or furrr for ease-of-use, by the way.↩︎