8 Basic Automation with R
8.1 Introduction and goals
The library dplyr
provides many useful functions like select()
, filter()
, and mutate()
that cover the vast majority of use cases, but real data is messy. Sometimes you may what to do but not how to do it with existing functions. This tutorial will teach you some fundamental building blocks of scripting so that you can produce the “how” when you know the “what.”
A warning though, if using dplyr
is driving a car, then this tutorial is teaching you how to walk. While more locations are reachable by walking, you will always get there much faster by driving. Try to find the tools first before hacking together your own.
8.2 Prerequisites
Please download the following:
R libraries
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
## ✔ tibble 3.1.6 ✔ stringr 1.4.0
## ✔ tidyr 1.2.0 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ dplyr::lag() masks stats::lag()
8.3 Making a function
In math, we specify functions like so: \(f(x) = 2x\), where the function \(f\) takes the parameter \(x\) and performs the action \(2x\) before handing back the result. In R, the same would look like this…
f <- function(x) {
return(2 * x)
}
…and it can be used like so.
f(2)
## [1] 4
8.4 Decisions with conditionals
if
allows selective execution based on a condition.
doubleIfLessThan <- function(x, threshold) {
if (x < threshold) {
return(2 * x)
} else {
return(x) # if no condition satisfied
}
}
print(doubleIfLessThan(12, 100)) # 12 < 100, so doubled 12
## [1] 24
print(doubleIfLessThan(12, 10)) # 12 ≮ 10, so just give back 12
## [1] 12
More than one condition can be used…
doubleOutsideRange <- function(x, lower, upper) {
if (x < lower) {
return( 2 *x)
} else if (x > upper) {
return( 2 *x)
} else {
return(x)
}
}
print(doubleOutsideRange(5, 10, 20))
## [1] 10
print(doubleOutsideRange(5, 0, 2))
## [1] 10
print(doubleOutsideRange(5, 0, 10))
## [1] 5
…and conditions can be combined with logical operators
doubleIfInRange <- function(x, lower, upper) {
if (x >= lower && x <= upper) { # && is the same as AND
return(2*x)
} else {
return(x)
}
}
print(doubleIfInRange(5, 10, 20))
## [1] 5
print(doubleIfInRange(5, 0, 2))
## [1] 5
print(doubleIfInRange(5, 0, 10))
## [1] 10
Operator | Description |
---|---|
&& |
AND, both sides must be TRUE
|
|| |
OR, at lease one side must beTRUE
|
! |
NOT, inverts the following (TRUE becomes FALSE and FALSE becomes TRUE ) |
== |
EQUALS, left must equal right |
Try the following
1==1
## [1] TRUE
1==1 && 2==1
## [1] FALSE
!(2==1) || FALSE
## [1] TRUE
8.5 Iterating with loops
Suppose we wanted to double each number in a list.
numbers <- read.csv(file="data/numbers.csv", header=TRUE)
numbers
## n
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
We can use a for
loop. This is how it behaves…
for (i in 1:5) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
…and this is how we to use it in our context
for (rowNumber in 1:nrow(numbers)) {
n <- numbers[rowNumber, "n"] # get value from row "rowNumber", column "n"
print(2 * n)
}
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 12
## [1] 14
## [1] 16
## [1] 18
## [1] 20
If the number of repeats is unknown, we can use a while
loop. A condition is used instead to tell it when to stop.
n <- 1
while (n < 1000) {
n <- n * 2
}
n
## [1] 1024
Loops can repeat forever. Click the red octagon (🛑) at the top right of your RStudio console to force the script to stop.
while (TRUE) {
# loop forever
}
8.6 Reading files
Files can be opened and read one line at a time like so.
numbersFileConnection <- file("data/numbers.csv", "r") # r is for read
while (TRUE) {
line <- readLines(numbersFileConnection, n = 1) # read 1 line
if (length(line) == 0) { # if no more lines ...
break # ... exit loop
}
print(line)
}
## [1] "n"
## [1] "1"
## [1] "2"
## [1] "3"
## [1] "4"
## [1] "5"
## [1] "6"
## [1] "7"
## [1] "8"
## [1] "9"
## [1] "10"
close(numbersFileConnection)
The last parameter specifies the read/write permissions of the opened file.
## Error in cat(x, file = file, sep = c(rep.int(sep, ncolumns - 1), "\n"), : cannot write to this connection
Files can also be written to one line at a time like so.
con <- file("data/myNumbers.csv", "w")
write("n", con) # header
for (i in 1:10) {
line <- toString(i+10)
write(line, con)
}
close(con)
Check your data
folder for the new file.
Reading and writing csvs
are done much better by read.csv()
and write.csv()
, but suppose you come across this file…
## [1] "n, comment, half_n, 2n"
## [1] "1, ofcourse, odd, 0.5, 2"
## [1] "2, 1.0, 4"
## [1] "3, very odd, 1.5, 6"
## [1] "4, 2.0, 8"
## [1] "5, yes, odd, 2.5, 10"
## [1] "6, 3.0, 12"
## [1] "7, certainly odd, 3.5, 14"
## [1] "8, 4.0, 16"
## [1] "9, odd, 4.5, 18"
close(con)
…seems innocent enough, but when trying to load it as a csv…
## n comment half_n X2n
## 50 25.0 100 NA
## 51 very much odd 25.5 102 NA
## 52 26.0 104 NA
## 53 not even 26.5 106 NA
## 54 27.0 108 NA
## 55 it is if I were to guess odd 27.5
## 110 NA
## 56 28.0 112 NA
## 57 pretty sure it's odd 28.5 114 NA
## 58 29.0 116 NA
## 59 very odd 29.5 118 NA
8.7 Practice 1
…it’s format is obviously broken. Investigate the file and see if you can fix it.
Documentation for strsplit()
con <- file("data/brokenNumbers.csv", "r")
outCon <- file("data/fixedNumbers.csv", "w")
headers <- readLines(con, n = 1)
while (TRUE) {
row = readLines(con, n = 1)
if (length(row) == 0) {
break
}
fixedRow <- ""
# split row into cells
cells <- str_split(row, ", ")[[1]] # see str_split above
# look at cells by index like so
n = cells[1]
twoN = cells[length(cells)]
# ...
# join strings using paste()
fixedRow <- paste(n, ", ", twoN) # ...
write(fixedRow, outCon)
}
close(con)
close(outCon)
8.8 Practice 2
TreeSAPP’s assign
function wants a single fasta
file, but all your data is separated by bins from a previous taxonomic classifier! See if you can append all of them together into one file.
for (f in list.files('data/dummyData')) {
print(f)
# append files
}