R

Introduction

What is R?

Starting with a good old definition from Wikipedia¹:

R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners and statisticians for data analysis and developing statistical software.

It may have originated as a language for statistical computing, and it still is used as such, but it is increasingly used as a more general purpose data analysis tool: for things like web scraping, data engineering, and journalism.

It is also a functional programming language: which in layman’s terms means that, in R, you define functions and things into them to produce some result.

So, for example, to take a lower case string and make it upper case in R, you pass the string into the toupper() function:

toupper("small")

[1] "SMALL"

This is different from object oriented languages like Ruby, where objects of a certain class have methods (basically functions) built into them. They carry them around, waiting for you to call the method on them.

In Ruby, the same task of capitalising a string is done as follows (puts is just the command to print the output):

puts "small".upcase

SMALL

We won’t dwell on the basics of R too long here: it’s better to skip to doing useful stuff and fill in the basics later.

Basics of R

Installing Packages

The basic unit of shareable code in R is a package. Whilst “Base R” (i.e. the functions which are installed by default when you install R) is very powerful, the open source community has added a lot of additional functionality by writing packages. Some of these are so widely used now that new users don’t really distinguish them from Base R.

You install packages using install.packages():

install.packages("dplyr")

Then you can use functions from dplyr (like filter) either by loading dplyr in its entirety at the top of your .R file:

library(dplyr)
filter(mtcars, disp >= 450)

                     mpg cyl disp  hp drat    wt  qsec vs am gear carb
Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4

and/or by using the package::function syntax to specify from which namespace the function is being imported:

dplyr::filter(mtcars, disp >= 450)

                     mpg cyl disp  hp drat    wt  qsec vs am gear carb
Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4

Assignment operator

To create a variable in R, and assign a value to it, you use the assignment operator, <-, like so:

variable_one <- 10

You can also use = in most, but not all, cases – so it’s probably better to use <- generally.

Mathematical operators

+, -, / (for division) and * (for multiplication) work as you might expect.

%/% returns the quotient of a number when divided by another number. For example, since $13 = 4 \times 3 + 1$:

13 %/% 3

[1] 4

%% returns the remainder:

13 %% 3

[1] 1

Relational operators

Relational operators are used to compare two values, returning TRUE or FALSE.

They are:

a <- 1
b <- 10

# LESS THAN
a < b

[1] TRUE

# LESS THAN OR EQUAL TO
a <= b

[1] TRUE

# GREATER THAN
a > b

[1] FALSE

# GREATER THAN OR EQUAL TO
a >= b

[1] FALSE

# EQUAL TO
a == b

[1] FALSE

# NOT EQUAL TO
a != b

[1] TRUE

Data Types

R has 6 main data types (though usually you’ll only come across the first 4 of them):

character (like "hello, world!")
numeric (aka double) (real or decimal, like 3 and 3.14)
integer (like 9L: where the L tells R it is an integer specifically)
logical (TRUE or FALSE)
complex (like 4+9i)
raw (which most of the time you really don’t need to worry about)

R builds more complex data types from these basic building blocks: but underneath it all, every data object in R has to be one of the above.

This is done by adding classes to data objects. Classes are beyond the scope of this workshop, but what you need to know is that classes tell R to treat some objects in a different way when you use generic functions on them.

Take the following example:

today <- Sys.Date()
today

[1] "2022-03-22"

Of the 6 data types listed above, today looks most like a character string. You can use typeof() to see what it actually is, and class to see what class it has been given so that R knows to treat it differently.

typeof(today)

[1] "double"

class(today)

[1] "Date"

You can see the actual data object in all its classless glory using the unclass() function:

unclass(today)

[1] 19073

You’ll see that an object with class Date is just the number of days since 1 January 1970 (the Unix epoch²).

Vectors and Lists

Vectors

A vector is a data structure which contains a number of data elements of the same basic type.

As a rule, vectors are created using the c() function (short for combine):

vec_1 <- c(1, 2, 3, 4)
length(vec_1)

[1] 4

class(vec_1)

[1] "numeric"

typeof(vec_1)

[1] "double"

is.vector(vec_1)

[1] TRUE

Note

Vectors can only contain data of the same type: if you try to mix types in a vector, all of its elements will be coerced as follows:

logical
integer
double
character

Where R will use the type where fewest elements are coerced to NA values

vec_2 <- c(TRUE, "FALSE")
vec_2

[1] "TRUE"  "FALSE"

typeof(vec_2)

[1] "character"

You can use the c() function to combine (and flatten) vectors together into a single vector too:

vec_3 <- c(
  1,
  c(2, 3),
  c(4, 5, c(6, 7, 8))
)

vec_3

[1] 1 2 3 4 5 6 7 8

This is also a good way to add an element to the end of a vector:

vec_3 <- c(vec_3, 1000)
vec_3

[1]    1    2    3    4    5    6    7    8 1000

Sequences of integers can be created using the : function:

1:20

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

You can also create vectors with named elements:

vec_4 <- c(
  yes = 1,
  no = 2
)

vec_4

yes  no 
  1   2

Lists

Lists are like vectors, except every element does not have to be of the same type. They can even be lists, which means lists can be nested.

You create them using the list() function:

list_1 <- list(
  1, 2, "3", list(4, 5, 6, c(7, 8, 9))
)

list_1

[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] "3"

[[4]]
[[4]][[1]]
[1] 4

[[4]][[2]]
[1] 5

[[4]][[3]]
[1] 6

[[4]][[4]]
[1] 7 8 9

class(list_1)

[1] "list"

typeof(list_1)

[1] "list"

As with vectors, you can name the elements in a list:

list_2 <- list(
  yes = 1,
  no = 2,
  maybe = 3
)

list_2

$yes
[1] 1

$no
[1] 2

$maybe
[1] 3

Data Frames

Data frames are a fairly central concept when using R for analysis / data science. They are a 2 dimensional array, kind of like a table in Excel.

There is a built in example data frame in R, called mtcars:

class(mtcars)

[1] "data.frame"

knitr::kable(
  head(mtcars)
)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

As you can see, a data frame is actually a list of vectors:

typeof(mtcars)

[1] "list"

unclass(head(mtcars))

$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1

$cyl
[1] 6 6 4 6 8 6

$disp
[1] 160 160 108 258 360 225

$hp
[1] 110 110  93 110 175 105

$drat
[1] 3.90 3.90 3.85 3.08 3.15 2.76

$wt
[1] 2.620 2.875 2.320 3.215 3.440 3.460

$qsec
[1] 16.46 17.02 18.61 19.44 17.02 20.22

$vs
[1] 0 0 1 1 0 1

$am
[1] 1 1 1 0 0 0

$gear
[1] 4 4 4 3 3 3

$carb
[1] 4 4 1 1 2 1

attr(,"row.names")
[1] "Mazda RX4"         "Mazda RX4 Wag"     "Datsun 710"       
[4] "Hornet 4 Drive"    "Hornet Sportabout" "Valiant"

You can create your own data frame using the data.frame function:

snooker <- data.frame(
  colour = c("red", "yellow", "green", "brown", "blue", "pink", "black"),
  score = 1:7
)

knitr::kable(
  snooker
)

colour	score
red	1
yellow	2
green	3
brown	4
blue	5
pink	6
black	7

Subsetting

Unlike many other programming languages, R uses 1-based index arrays/vectors: meaning you can extract elements like so:

some_letters <- c("A", "B", "C", "D")

# First letter:
some_letters[1]

[1] "A"

# Third letter:
some_letters[3]

[1] "C"

With named vectors, you can extract individual elements using the name, like so:

vec_4["yes"]

yes 
  1

You can do the same with lists:

list_2["yes"]

$yes
[1] 1

Important

Subsetting a list in this way actually returns another list. To get the element itself, either double up the square brackets:

list_2[["yes"]]

[1] 1

or use the $ function:

list_2$yes

[1] 1

The key difference between single [ and double [[ / $ is that the former can be used to select multiple elements:

list_2[c(1, 2)]

$yes
[1] 1

$no
[1] 2

list_2[c("yes", "maybe")]

$yes
[1] 1

$maybe
[1] 3

You can replace values in a vector / list by assigning values when subsetted.

For example, to change the first element of some_letters:

some_letters[1] <- "Z"
some_letters

[1] "Z" "B" "C" "D"

Subsetting data frames

When subsetting data frames, it is important to remember that they are effectively just a list of vectors, so you can subset a column by name:

mtcars$cyl

 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

or by column number:

mtcars[[2]]

 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

# or
mtcars[,2]

 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

and you can select rows like so:

# for the first row:
mtcars[1,]

          mpg cyl disp  hp drat   wt  qsec vs am gear carb
Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4

Control flow

`if` / `else`

if and else statements are used to run blocks of code only when certain conditions are met:

n <- 4

if (n %% 2 == 0) {
  message(n, " is an even number")
}

4 is an even number

You can add in else if calls to the flow: the execution will exit on the first satisfied condition:

n <- 3
if (n == 2) {
  message("n is 2")
} else if (n == 3) {
  message("n is 3")
} else if (n*2 == 6) {
  # the following will not execute since the sequence exited 
  # above
  message("n times 2 is 6")
}

n is 3

else can be used to run an expression if no previous if or if else conditions were satisfied:

if (FALSE) {
  stop()
} else if (FALSE) {
  stop()
} else if (FALSE) {
  stop()
} else if (FALSE) {
  stop()
} else if (FALSE) {
  stop()
} else {
  message("hello")
}

hello

`for` loops

for loops iterate over a sequence, executing the code within the block once for each item in the sequence, like so:

for (thing in c("John", "Paul", "George", "Ringo")) {
  message(thing, " is a member of The Beatles")
}

John is a member of The Beatles

Paul is a member of The Beatles

George is a member of The Beatles

Ringo is a member of The Beatles

Sometimes, the index of the item is also needed, along with the item itself. This is achieved as follows:

beatles <- c("John", "Paul", "George", "Ringo")
for (i in seq_along(beatles)) {
  message(i, ". ", beatles[i], " is a member of The Beatles")
}

1. John is a member of The Beatles

2. Paul is a member of The Beatles

3. George is a member of The Beatles

4. Ringo is a member of The Beatles

These loops terminate automatically once they reach the end of the sequence: since a vector cannot have an infinite length, then you don’t need to worry about an infinite loop.

`while` loops

while loops continue to execute as long as the given condition is true:

i <- 0
while (i <= 3) {
  message("The value of i is ", i)
  i <- i + 1
}

The value of i is 0

The value of i is 1

The value of i is 2

The value of i is 3

Since this loop will continue as long as the value of i is less than or equal to 3, it is imperative that the body of the loop increments i. If it doesn’t, the loop will run infinitely.

`repeat` loops

repeat loops are very similar to while loops, except there is no preceding condition which is tested on the way in to decide whether to execute the expression within the block: you need to use break to manually exit the loop.

i <- 0
repeat {
  message("The value of i is ", i)
  i <- i + 1
  if (i > 3) break
}

The value of i is 0

The value of i is 1

The value of i is 2

The value of i is 3

You can also use next to skip execution in certain conditions - for example for odd numbered indices:

i <- 0
repeat {
  i <- i + 1
  if (i %% 2 == 1) next
  if (i > 10) break
  message("The value of i is ", i)
}

The value of i is 2

The value of i is 4

The value of i is 6

The value of i is 8

The value of i is 10

Creating Functions

As a rule, code which is repeated numerous times in your scripts, or which you run regularly with different inputs, or which you want a way to test easily, should be extracted into a function.

Functions are (generally) defined in R using function :

new_func <- function() {
  message("hello")
}

You then call a function by executing it with () at the end:

new_func()

hello

You can read much, much more about how functions work by reading the Functions chapter in Advanced R by Hadley Wickham, but for now it’s worth noting that functions have 3 elements:

formals, or arguments
body
environment

The formals are what go inside the brackets when defining the function. Variables which may be different each time the function is called would be fed in via these arguments. For example:

doubler <- function(n) {
  n * 2
}

doubler(10)

[1] 20

doubler(40)

[1] 80

The body is just the code which will be executed: this sits between the curly brackets. The last line of the body will be returned by the function (meaning you can assign its value to something if you want).

It’s a very good idea to add argument validation towards the top of your function’s body as well:

doubler <- function(n) {
  if (is.character(n)) stop("n must be a number")
  n * 2
}

doubler("ten")

Error in doubler("ten"): n must be a number

To understand more about environments, recommended reading is the section about lexical scoping in Advanced R. At this point, it’s useful to mention that a function will try to use the “most locally” defined variable with the same name, looking first at its own environment, and if it does not find it, in the environment in which it was called.

Also, the calltime environments of functions are transient: variables which are created on execution will not generally continue to exist once the function exits.

x <- 10
y <- 20

random_function <- function() {
  x <- 100
  z <- 30
  
  x + y + z
}

random_function()

[1] 150

[1] 10

[1] 20

Error in eval(expr, envir, enclos): object 'z' not found

Functional Programming (`apply` functions)

A functional is any function that takes a function as an input and returns a vector as output.

Here is a simple example:

numbers <- c(1, 2, 3, 4, 5, 6)

some_functional <- function(f) {
  f(numbers)
}

some_functional(mean)

[1] 3.5

some_functional(range)

[1] 1 6

some_functional(sum)

[1] 21

They are most often used to perform the same action on every element of a list or vector. Base R has the lapply family of functions to do this:

some_names <- c("bob", "jane", "eric")

lapply(some_names, toupper)

[[1]]
[1] "BOB"

[[2]]
[1] "JANE"

[[3]]
[1] "ERIC"

The first argument is the vector on which you want to iterate over and apply the second argument (a function) to. As you can see, lapply returns a list by default.

Sometimes, the functions you supply to lapply have additional arguments which you’d like to specify. You can provide additional, named, arguments in the call to lapply:

some_numbers <- list(
  c(1, 4, NA),
  c(10, 10, 200),
  c(9, 8, NA, 100, 542)
)

lapply(some_numbers, sum)

[[1]]
[1] NA

[[2]]
[1] 220

[[3]]
[1] NA

lapply(some_numbers, sum, na.rm = TRUE)

[[1]]
[1] 5

[[2]]
[1] 220

[[3]]
[1] 659

Useful Data Sciencey tasks

Having covered a few of the basics of R, it is worthwhile running through some practical tasks which are useful day-to-day.

Reading / writing `csv` files

`read.csv`

countries <- read.csv(
  "https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv"
)

knitr::kable(
  head(countries)
)

name	alpha.2	alpha.3	country.code	iso_3166.2	region	sub.region	region.code	sub.region.code	intermediate.region.code
Afghanistan	AF	AFG	4	ISO 3166-2:AF	Asia	Southern Asia	142	34	NA
Åland Islands	AX	ALA	248	ISO 3166-2:AX	Europe	Northern Europe	150	154	NA
Albania	AL	ALB	8	ISO 3166-2:AL	Europe	Southern Europe	150	39	NA
Algeria	DZ	DZA	12	ISO 3166-2:DZ	Africa	Northern Africa	2	15	NA
American Samoa	AS	ASM	16	ISO 3166-2:AS	Oceania	Polynesia	9	61	NA
Andorra	AD	AND	20	ISO 3166-2:AD	Europe	Southern Europe	150	39	NA

write.csv

write.csv(countries, "path/to/file.csv")

Querying APIs

At the moment, I would recommend using the httr package for HTTP requests.

Important

The httr2 package is currently in development and will likely replace httr. This package has a very different API to httr , so don’t get too attached to this example.

url <- "https://httpbin.org/anything?filter=everything&goal=show%20how%20to%20api%20request"

req <- httr::GET(url)

content <- httr::content(req)

content

$args
$args$filter
[1] "everything"

$args$goal
[1] "show how to api request"


$data
[1] ""

$files
named list()

$form
named list()

$headers
$headers$Accept
[1] "application/json, text/xml, application/xml, */*"

$headers$`Accept-Encoding`
[1] "deflate, gzip"

$headers$Host
[1] "httpbin.org"

$headers$`User-Agent`
[1] "libcurl/7.77.0 r-curl/4.3.2 httr/1.4.2"

$headers$`X-Amzn-Trace-Id`
[1] "Root=1-623a3dfe-2805beab6222f3b1458c29af"


$json
NULL

$method
[1] "GET"

$url
[1] "https://httpbin.org/anything?filter=everything&goal=show how to api request"

Web scraping

The rvest package is commonly used for web scraping:

url <- "https://www.scrapethissite.com/pages/simple/"

page <- rvest::read_html(x = url)

elements <- rvest::html_elements(x = page, css = "h3.country-name")

head(
  rvest::html_text2(elements)
)

[1] "Andorra"              "United Arab Emirates" "Afghanistan"         
[4] "Antigua and Barbuda"  "Anguilla"             "Albania"

Package Development

It is a good idea to bundle useful, reusable code (especially that which you plan to share with others, and subject to testing) as a package.

Creating R packages is a big topic — too big to cover in a single session — but you can read more about it in Hadley Wickham and Jenny Bryan’s book R Packages. It leans heavily on the usethis package, a set of utilities created to simplify the setup of projects / packages.

Testing

Unit testing is crucial when creating reliable packages. A little time up front writing out expectations as to what a function will return in a number of different circumstances saves you having to manually check that it does before you share your project. It also means that you can automate your tests: ensuring that it is impossible to publish work which does not demonstrably do what it set out to do.

The testthat package. The R Packages book has a section on how to get up and running with tests on R packages.

Getting help

To read the documentation for a function, type ?name_of_function into the R console and hit enter.

For example, to get the help documentation for read.table, you would type:

?read.table

read.table                package:utils                R Documentation

Data Input

Description:

     Reads a file in table format and creates a data frame from it,
     with cases corresponding to lines and variables to fields in the
     file.

Usage:

     read.table(file, header = FALSE, sep = "", quote = "\"'",
                dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
                row.names, col.names, as.is = !stringsAsFactors,
                na.strings = "NA", colClasses = NA, nrows = -1,
                skip = 0, check.names = TRUE, fill = !blank.lines.skip,
                strip.white = FALSE, blank.lines.skip = TRUE,
                comment.char = "#",
                allowEscapes = FALSE, flush = FALSE,
                stringsAsFactors = FALSE,
                fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)