Three fun R functions

All the cool kids were jumping off the bridge!
R
Tutorials
Author
Published

October 24, 2023

Inspired by Maëlle who was inspired by Yihui who was inspired by Maëlle(who has a whole series about this), I wanted to share three useful base R functions that I think maybe don’t get enough love. And inspired by Maëlle again, my list here is actually four functions.

sweep()

If you ever need to do math with matrices, then sweep() is going to be your best friend. Say for instance we want to center and scale each column in a matrix. This is a pretty straightforward operation – we need to calculate the mean and standard deviations for each column, subtract the column mean from each observation, and then divide those by the corresponding standard deviation.

We can use apply to get our means and standard deviations:

# Generate some fake data in a 10x10 matrix:
x <- matrix(data = rnorm(100), nrow = 10)
# Calculate one mean and sd for each column of our matrix:
col_means <- apply(x, 2, mean)
col_sds <- apply(x, 2, sd)

The subtraction and division are a bit less straightforward. R’s base math operators will attempt to do element-wise operations, treating our vector as a one-column array and replicating as needed. That’s not what we want:

all.equal(
  (x - col_means) / col_sds,
  scale(x)
)
[1] "Attributes: < Length mismatch: comparison on first 1 components >"
[2] "Mean relative difference: 0.360556"                               

We could replicate our vector ourself, in order to take advantage of these element-wise operations:

all.equal(
  ((x - matrix(rep(col_means, 10), 10, byrow = TRUE)) / 
    matrix(rep(col_sds, 10), 10, byrow = TRUE)) |> as.vector(),
  scale(x) |> as.vector()
)
[1] TRUE

But that’s silly, especially if we were working with more observations.

Better instead is to use sweep() to perform some operation between each element of our vector and each column of the matrix:

# Take every value in our matrix, and subtract its corresponding column mean:
centered <- sweep(
  x = x, 
  MARGIN = 2, # just like in apply()
  STATS = col_means, 
  FUN = "-" # "-" is the default argument -- we don't NEED to provide it here
)

And we can similarly use sweep() to divide each column by its corresponding standard deviation, finishing up our centering and scaling:

# Divide each value by its corresponding column sd:
centered_and_scaled <- sweep(centered, 2, col_sds, "/")

# Works out identically to the built-in scale function:
all.equal(
  as.vector(centered_and_scaled),
  as.vector(scale(x))
)
[1] TRUE

This is the main way I use sweep(), but there’s no requirement you use it for math – it works just as well with non-mathematical functions or non-numeric matrices:

letter_mat <- matrix(rep(letters[1:5], 5), 5)
letter_mat
     [,1] [,2] [,3] [,4] [,5]
[1,] "a"  "a"  "a"  "a"  "a" 
[2,] "b"  "b"  "b"  "b"  "b" 
[3,] "c"  "c"  "c"  "c"  "c" 
[4,] "d"  "d"  "d"  "d"  "d" 
[5,] "e"  "e"  "e"  "e"  "e" 
sweep(letter_mat, 2, LETTERS[1:5], paste0)
 [1] "aA" "bA" "cA" "dA" "eA" "aB" "bB" "cB" "dB" "eB" "aC" "bC" "cC" "dC" "eC"
[16] "aD" "bD" "cD" "dD" "eD" "aE" "bE" "cE" "dE" "eE"

reformulate() and DF2formula()

The reformulate() function is a lifesaver if you’re trying to write long or complicated formulas, or multiple formulas generated by some other logic in your code.

The function is pretty straightforward. If you’re trying to make a formula y ~ x + z, provide your predictors as the first argument and your outcome as the second:

reformulate(c("x", "z"), "y")
y ~ x + z

The nice thing is that reformulate accepts vectors as inputs, making it easy to construct a vector of predictors and automatically turn them into a formula:

reformulate(letters, "outcome")
outcome ~ a + b + c + d + e + f + g + h + i + j + k + l + m + 
    n + o + p + q + r + s + t + u + v + w + x + y + z
reformulate(names(Orange), "age")
age ~ Tree + age + circumference

And in particular, this is an excellent alternative to dropping a few columns in order to use outcome ~ . – instead, you can use setdiff() to exclude those columns from your formula:

outcome_variable <- "age"
reformulate(setdiff(names(Orange), outcome_variable), outcome_variable)
age ~ Tree + circumference

Relatedly, the function DF2formula() will automatically turn the column names from a data frame into a formula. The first column will become the outcome variable, and the rest will be used as predictors:

DF2formula(Orange)
Tree ~ age + circumference

To change what column is used as the outcome variable, reorder the columns in your data frame:

DF2formula(Orange[3:1])
circumference ~ age + Tree

str2lang()

Shockingly enough, str2lang() function turns a string into a language object:

growth_rate <- "circumference / age"
str2lang(growth_rate)
circumference/age
class(str2lang(growth_rate))
[1] "call"

Wooooo!

I think that, to most people, this does not sound immediately useful.1 But the idea that your code can turn plain text into code at runtime is pretty powerful, and some of the most R-esque nonsense that R has to offer.

For instance, we can use eval() to actually execute the call created by str2lang() in our global environment:

eval(str2lang("2 + 2"))
[1] 4

And that string can do anything that regular R code can do – assign variables, manage connections, any procedure that normal R code can do:

eval(str2lang("x <- 3"))
x
[1] 3

We can also use this with with() or local() to execute our code inside of other environments. For instance, if we want to calculate our growth_rate from earlier, we can run that code with the Orange data frame:

with(Orange, eval(str2lang(growth_rate)))
 [1] 0.25423729 0.11983471 0.13102410 0.11454183 0.09748172 0.10349854
 [7] 0.09165613 0.27966102 0.14256198 0.16716867 0.15537849 0.13972380
[13] 0.14795918 0.12831858 0.25423729 0.10537190 0.11295181 0.10756972
[19] 0.09341998 0.10131195 0.08849558 0.27118644 0.12809917 0.16867470
[25] 0.16633466 0.14541024 0.15233236 0.13527181 0.25423729 0.10123967
[31] 0.12198795 0.12450199 0.11535337 0.12682216 0.11188369

This can be a powerful way to “import” code from other sources, for instance if you have a CSV of equations you want to run against a data frame. You want to be careful when using this with untrusted inputs, of course – if your input includes a call to system(), it might wind up wrecking your computer!

Footnotes

  1. I think, to most people, this barely sounds like English.↩︎