8 Basic Automation with R

8.1 Introduction and goals

The library dplyr provides many useful functions like select(), filter(), and mutate() that cover the vast majority of use cases, but real data is messy. Sometimes you may what to do but not how to do it with existing functions. This tutorial will teach you some fundamental building blocks of scripting so that you can produce the “how” when you know the “what.”

A warning though, if using dplyr is driving a car, then this tutorial is teaching you how to walk. While more locations are reachable by walking, you will always get there much faster by driving. Try to find the tools first before hacking together your own.

8.2 Prerequisites

Please download the following:

R libraries

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.6     ✔ stringr 1.4.0
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()          masks stats::filter()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ dplyr::lag()             masks stats::lag()

8.3 Making a function

In math, we specify functions like so: \(f(x) = 2x\), where the function \(f\) takes the parameter \(x\) and performs the action \(2x\) before handing back the result. In R, the same would look like this…

f <- function(x) {
  return(2 * x)
}

…and it can be used like so.

f(2)
## [1] 4

8.4 Decisions with conditionals

if allows selective execution based on a condition.

doubleIfLessThan <- function(x, threshold) {
  if (x < threshold) {
    return(2 * x)
  } else {
    return(x) # if no condition satisfied
  }
}

print(doubleIfLessThan(12, 100)) # 12 < 100, so doubled 12
## [1] 24
print(doubleIfLessThan(12, 10)) # 12 ≮ 10, so just give back 12
## [1] 12

More than one condition can be used…

doubleOutsideRange <- function(x, lower, upper) {
  if (x < lower) {
    return( 2 *x)
  } else if (x > upper) {
    return( 2 *x)
  } else {
    return(x)
  }
}

print(doubleOutsideRange(5, 10, 20))
## [1] 10
print(doubleOutsideRange(5, 0, 2))
## [1] 10
print(doubleOutsideRange(5, 0, 10))
## [1] 5

…and conditions can be combined with logical operators

doubleIfInRange <- function(x, lower, upper) {
  if (x >= lower && x <= upper) { # && is the same as AND
    return(2*x)
  } else {
    return(x)
  }
}

print(doubleIfInRange(5, 10, 20))
## [1] 5
print(doubleIfInRange(5, 0, 2))
## [1] 5
print(doubleIfInRange(5, 0, 10))
## [1] 10
Operator Description
&& AND, both sides must be TRUE
|| OR, at lease one side must beTRUE
! NOT, inverts the following (TRUE becomes FALSE and FALSE becomes TRUE)
== EQUALS, left must equal right

Try the following

1==1
## [1] TRUE
1==1 && 2==1
## [1] FALSE
!(2==1) || FALSE
## [1] TRUE

8.5 Iterating with loops

Suppose we wanted to double each number in a list.

numbers <- read.csv(file="data/numbers.csv", header=TRUE)
numbers
##     n
## 1   1
## 2   2
## 3   3
## 4   4
## 5   5
## 6   6
## 7   7
## 8   8
## 9   9
## 10 10

We can use a for loop. This is how it behaves…

for (i in 1:5) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

…and this is how we to use it in our context

for (rowNumber in 1:nrow(numbers)) {
  n <- numbers[rowNumber, "n"] # get value from row "rowNumber", column "n"
  print(2 * n)
}
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 12
## [1] 14
## [1] 16
## [1] 18
## [1] 20

If the number of repeats is unknown, we can use a while loop. A condition is used instead to tell it when to stop.

n <- 1
while (n < 1000) {
  n <-  n * 2
}

n
## [1] 1024

Loops can repeat forever. Click the red octagon (🛑) at the top right of your RStudio console to force the script to stop.

while (TRUE) {
  # loop forever
}

8.6 Reading files

Files can be opened and read one line at a time like so.

numbersFileConnection <- file("data/numbers.csv", "r") # r is for read

while (TRUE) {
  line <-  readLines(numbersFileConnection, n = 1) # read 1 line
  if (length(line) == 0) { # if no more lines ...
    break # ... exit loop
  }
  print(line)
}
## [1] "n"
## [1] "1"
## [1] "2"
## [1] "3"
## [1] "4"
## [1] "5"
## [1] "6"
## [1] "7"
## [1] "8"
## [1] "9"
## [1] "10"
close(numbersFileConnection)

The last parameter specifies the read/write permissions of the opened file.

con <- file("data/numbers.csv", "r")
write("some new text", file = con)
## Error in cat(x, file = file, sep = c(rep.int(sep, ncolumns - 1), "\n"), : cannot write to this connection

Files can also be written to one line at a time like so.

con <- file("data/myNumbers.csv", "w")

write("n", con) # header
for (i in 1:10) {
  line <- toString(i+10)
  write(line, con)
}

close(con)

Check your data folder for the new file.

Reading and writing csvs are done much better by read.csv() and write.csv(), but suppose you come across this file…

con <- file("data/brokenNumbers.csv","r")
line <- readLines(con) 
for (i in 1:10) {
  print(line[i])
}
## [1] "n, comment, half_n, 2n"
## [1] "1, ofcourse, odd, 0.5, 2"
## [1] "2, 1.0, 4"
## [1] "3, very odd, 1.5, 6"
## [1] "4, 2.0, 8"
## [1] "5, yes, odd, 2.5, 10"
## [1] "6, 3.0, 12"
## [1] "7, certainly odd, 3.5, 14"
## [1] "8, 4.0, 16"
## [1] "9, odd, 4.5, 18"
close(con)

…seems innocent enough, but when trying to load it as a csv…

isOdd <- read.csv(file = "data/brokenNumbers.csv", header = TRUE)
isOdd %>% slice(50:60)
##                          n             comment half_n  X2n
## 50                    25.0                 100          NA
## 51           very much odd                25.5    102   NA
## 52                    26.0                 104          NA
## 53                not even                26.5    106   NA
## 54                    27.0                 108          NA
## 55                   it is  if I were to guess    odd 27.5
##  110                                                    NA
## 56                    28.0                 112          NA
## 57    pretty sure it's odd                28.5    114   NA
## 58                    29.0                 116          NA
## 59                very odd                29.5    118   NA

8.7 Practice 1

…it’s format is obviously broken. Investigate the file and see if you can fix it.
Documentation for strsplit()

con <- file("data/brokenNumbers.csv", "r")
outCon <- file("data/fixedNumbers.csv", "w")

headers <- readLines(con, n = 1)
while (TRUE) {
  row = readLines(con, n = 1)
  if (length(row) == 0) {
    break
  }
  
  fixedRow <- ""
  
  # split row into cells
  cells <- str_split(row, ", ")[[1]] # see str_split above
  
  # look at cells by index like so
  n = cells[1]
  twoN = cells[length(cells)]
  # ...
  
  # join strings using paste()
  fixedRow <- paste(n, ", ", twoN) # ...

  write(fixedRow, outCon)
}

close(con)
close(outCon)

8.8 Practice 2

TreeSAPP’s assign function wants a single fasta file, but all your data is separated by bins from a previous taxonomic classifier! See if you can append all of them together into one file.

for (f in list.files('data/dummyData')) {
  print(f)
  # append files
}