Introduction to programming with R

Loops in R

Questions

How can I do the same thing multiple times more efficiently in R?

What is vectorization?

Should I use a loop or an apply statement?

Objectives

Compare loops and vectorized operations.
Use the apply family of functions.

Keypoints

Where possible, use vectorized operations instead of for loops to make code faster and more concise.

Use functions such as apply instead of for loops to operate on the values in a data structure.

In R you have multiple options when repeating calculations: vectorized operations, for loops, and apply functions.

This lesson is an extension of Analyzing Multiple Data Sets. In that lesson, we introduced how to run a custom function, analyze, over multiple data files:

analyze <- function(filename) {
  # Plots the average, min, and max inflammation over time.
  # Input is character string of a csv file.
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation)
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation)
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation)
}

filenames <- list.files(path = "data", pattern = "inflammation.*.csv", full.names = TRUE)

Vectorized Operations

A key difference between R and many other languages is a topic known as vectorization. When you wrote the total function, we mentioned that R already has sum to do this; sum is much faster than the interpreted for loop because sum is coded in C to work with a vector of numbers. Many of R’s functions work this way; the loop is hidden from you in C. Learning to use vectorized operations is a key skill in R.

For example, to add pairs of numbers contained in two vectors

a <- 1:10
b <- 1:10

You could loop over the pairs adding each in turn, but that would be very inefficient in R.

Instead of using i in a to make our loop variable, we use the function seq_along to generate indices for each element a contains.

res <- numeric(length = length(a))
for (i in seq_along(a)) {
  res[i] <- a[i] + b[i]
}
res

 [1]  2  4  6  8 10 12 14 16 18 20

Instead, + is a vectorized function which can operate on entire vectors at once

res2 <- a + b
all.equal(res, res2)

[1] TRUE

Vector Recycling

When performing vector operations in R, it is important to know about recycling. If you perform an operation on two or more vectors of unequal length, R will recycle elements of the shorter vector(s) to match the longest vector. For example:

a <- 1:10
b <- 1:5
a + b

 [1]  2  4  6  8 10  7  9 11 13 15

The elements of a and b are added together starting from the first element of both vectors. When R reaches the end of the shorter vector b, it starts again at the first element of b and continues until it reaches the last element of the longest vector a. This behaviour may seem crazy at first glance, but it is very useful when you want to perform the same operation on every element of a vector. For example, say we want to multiply every element of our vector a by 5:

a <- 1:10
b <- 5
a * b

 [1]  5 10 15 20 25 30 35 40 45 50

Remember there are no scalars in R, so b is actually a vector of length 1; in order to add its value to every element of a, it is recycled to match the length of a.

When the length of the longer object is a multiple of the shorter object length (as in our example above), the recycling occurs silently. When the longer object length is not a multiple of the shorter object length, a warning is given:

a <- 1:10
b <- 1:7
a + b

Warning in a + b: longer object length is not a multiple of shorter object
length

 [1]  2  4  6  8 10 12 14  9 11 13

`for` or `apply`?

A for loop is used to apply the same function calls to a collection of objects. R has a family of functions, the apply family, which can be used in much the same way. You’ve already used one of the family, apply in the first lesson. The apply family members include

apply - apply over the margins of an array (e.g. the rows or columns of a matrix)
lapply - apply over an object and return list
sapply - apply over an object and return a simplified object (an array) if possible
vapply - similar to sapply but you specify the type of object returned by the iterations

Each of these has an argument FUN which takes a function to apply to each element of the object. Instead of looping over filenames and calling analyze, as you did earlier, you could sapply over filenames with FUN = analyze:

sapply(filenames, FUN = analyze)

Deciding whether to use for or one of the apply family is really personal preference. Using an apply family function forces to you encapsulate your operations as a function rather than separate calls with for. for loops are often more natural in some circumstances; for several related operations, a for loop will avoid you having to pass in a lot of extra arguments to your function.

Loops in R Are Slow

No, they are not! If you follow some golden rules:

Don’t use a loop when a vectorized alternative exists
Don’t grow objects (via c, cbind, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/column
Allocate an object to hold the results and fill it in during the loop

As an example, we’ll create a new version of analyze that will return the mean inflammation per day (column) of each file.

analyze2 <- function(filenames) {
  for (f in seq_along(filenames)) {
    fdata <- read.csv(filenames[f], header = FALSE)
    res <- apply(fdata, 2, mean)
    if (f == 1) {
      out <- res
    } else {
      # The loop is slowed by this call to cbind that grows the object
      out <- cbind(out, res)
    }
  }
  return(out)
}

system.time(avg2 <- analyze2(filenames))

   user  system elapsed 
  0.076   0.004   0.149

Note how we add a new column to out at each iteration? This is a cardinal sin of writing a for loop in R.

Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results. Then we loop over the files but this time we fill in the fth column of our results matrix out. This time there is no copying/growing for R to deal with.

analyze3 <- function(filenames) {
  out <- matrix(ncol = length(filenames), nrow = 40) ## assuming 40 here from files
  for (f in seq_along(filenames)) {
    fdata <- read.csv(filenames[f], header = FALSE)
    out[, f] <- apply(fdata, 2, mean)
  }
  return(out)
}

system.time(avg3 <- analyze3(filenames))

   user  system elapsed 
  0.050   0.002   0.054

In this simple example there is little difference in the compute time of analyze2 and analyze3. This is because we are only iterating over 12 files and hence we only incur 12 copy/grow operations. If we were doing this over more files or the data objects we were growing were larger, the penalty for copying/growing would be much larger.

Note that apply handles these memory allocation issues for you, but then you have to write the loop part as a function to pass to apply. At its heart, apply is just a for loop with extra convenience.