8 Basic Automation with R
8.1 Introduction and goals
The library dplyr provides many useful functions like select(), filter(), and mutate() that cover the vast majority of use cases, but real data is messy. Sometimes you may what to do but not how to do it with existing functions. This tutorial will teach you some fundamental building blocks of scripting so that you can produce the “how” when you know the “what.”
A warning though, if using dplyr is driving a car, then this tutorial is teaching you how to walk. While more locations are reachable by walking, you will always get there much faster by driving. Try to find the tools first before hacking together your own.
8.2 Prerequisites
Please download the following:
R libraries
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.6     ✔ stringr 1.4.0
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()          masks stats::filter()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ dplyr::lag()             masks stats::lag()8.3 Making a function
In math, we specify functions like so: \(f(x) = 2x\), where the function \(f\) takes the parameter \(x\) and performs the action \(2x\) before handing back the result. In R, the same would look like this…
f <- function(x) {
  return(2 * x)
}…and it can be used like so.
f(2)## [1] 48.4 Decisions with conditionals
if allows selective execution based on a condition.
doubleIfLessThan <- function(x, threshold) {
  if (x < threshold) {
    return(2 * x)
  } else {
    return(x) # if no condition satisfied
  }
}
print(doubleIfLessThan(12, 100)) # 12 < 100, so doubled 12## [1] 24
print(doubleIfLessThan(12, 10)) # 12 ≮ 10, so just give back 12## [1] 12More than one condition can be used…
doubleOutsideRange <- function(x, lower, upper) {
  if (x < lower) {
    return( 2 *x)
  } else if (x > upper) {
    return( 2 *x)
  } else {
    return(x)
  }
}
print(doubleOutsideRange(5, 10, 20))## [1] 10
print(doubleOutsideRange(5, 0, 2))## [1] 10
print(doubleOutsideRange(5, 0, 10))## [1] 5…and conditions can be combined with logical operators
doubleIfInRange <- function(x, lower, upper) {
  if (x >= lower && x <= upper) { # && is the same as AND
    return(2*x)
  } else {
    return(x)
  }
}
print(doubleIfInRange(5, 10, 20))## [1] 5
print(doubleIfInRange(5, 0, 2))## [1] 5
print(doubleIfInRange(5, 0, 10))## [1] 10| Operator | Description | 
|---|---|
| && | AND, both sides must be TRUE | 
| || | OR, at lease one side must be TRUE | 
| ! | NOT, inverts the following ( TRUEbecomesFALSEandFALSEbecomesTRUE) | 
| == | EQUALS, left must equal right | 
Try the following
1==1## [1] TRUE
1==1 && 2==1## [1] FALSE
!(2==1) || FALSE## [1] TRUE8.5 Iterating with loops
Suppose we wanted to double each number in a list.
numbers <- read.csv(file="data/numbers.csv", header=TRUE)
numbers##     n
## 1   1
## 2   2
## 3   3
## 4   4
## 5   5
## 6   6
## 7   7
## 8   8
## 9   9
## 10 10We can use a for loop. This is how it behaves…
for (i in 1:5) {
  print(i)
}## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5…and this is how we to use it in our context
for (rowNumber in 1:nrow(numbers)) {
  n <- numbers[rowNumber, "n"] # get value from row "rowNumber", column "n"
  print(2 * n)
}## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 12
## [1] 14
## [1] 16
## [1] 18
## [1] 20If the number of repeats is unknown, we can use a while loop. A condition is used instead to tell it when to stop.
n <- 1
while (n < 1000) {
  n <-  n * 2
}
n## [1] 1024Loops can repeat forever. Click the red octagon (🛑) at the top right of your RStudio console to force the script to stop.
while (TRUE) {
  # loop forever
}8.6 Reading files
Files can be opened and read one line at a time like so.
numbersFileConnection <- file("data/numbers.csv", "r") # r is for read
while (TRUE) {
  line <-  readLines(numbersFileConnection, n = 1) # read 1 line
  if (length(line) == 0) { # if no more lines ...
    break # ... exit loop
  }
  print(line)
}## [1] "n"
## [1] "1"
## [1] "2"
## [1] "3"
## [1] "4"
## [1] "5"
## [1] "6"
## [1] "7"
## [1] "8"
## [1] "9"
## [1] "10"
close(numbersFileConnection)The last parameter specifies the read/write permissions of the opened file.
## Error in cat(x, file = file, sep = c(rep.int(sep, ncolumns - 1), "\n"), : cannot write to this connectionFiles can also be written to one line at a time like so.
con <- file("data/myNumbers.csv", "w")
write("n", con) # header
for (i in 1:10) {
  line <- toString(i+10)
  write(line, con)
}
close(con)Check your data folder for the new file.
Reading and writing csvs are done much better by read.csv() and write.csv(), but suppose you come across this file…
## [1] "n, comment, half_n, 2n"
## [1] "1, ofcourse, odd, 0.5, 2"
## [1] "2, 1.0, 4"
## [1] "3, very odd, 1.5, 6"
## [1] "4, 2.0, 8"
## [1] "5, yes, odd, 2.5, 10"
## [1] "6, 3.0, 12"
## [1] "7, certainly odd, 3.5, 14"
## [1] "8, 4.0, 16"
## [1] "9, odd, 4.5, 18"
close(con)…seems innocent enough, but when trying to load it as a csv…
##                          n             comment half_n  X2n
## 50                    25.0                 100          NA
## 51           very much odd                25.5    102   NA
## 52                    26.0                 104          NA
## 53                not even                26.5    106   NA
## 54                    27.0                 108          NA
## 55                   it is  if I were to guess    odd 27.5
##  110                                                    NA
## 56                    28.0                 112          NA
## 57    pretty sure it's odd                28.5    114   NA
## 58                    29.0                 116          NA
## 59                very odd                29.5    118   NA8.7 Practice 1
…it’s format is obviously broken. Investigate the file and see if you can fix it.
Documentation for strsplit()
con <- file("data/brokenNumbers.csv", "r")
outCon <- file("data/fixedNumbers.csv", "w")
headers <- readLines(con, n = 1)
while (TRUE) {
  row = readLines(con, n = 1)
  if (length(row) == 0) {
    break
  }
  
  fixedRow <- ""
  
  # split row into cells
  cells <- str_split(row, ", ")[[1]] # see str_split above
  
  # look at cells by index like so
  n = cells[1]
  twoN = cells[length(cells)]
  # ...
  
  # join strings using paste()
  fixedRow <- paste(n, ", ", twoN) # ...
  write(fixedRow, outCon)
}
close(con)
close(outCon)8.8 Practice 2
TreeSAPP’s assign function wants a single fasta file, but all your data is separated by bins from a previous taxonomic classifier! See if you can append all of them together into one file.
for (f in list.files('data/dummyData')) {
  print(f)
  # append files
}