toupper("small")
[1] "SMALL"
Starting with a good old definition from Wikipedia1:
R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners and statisticians for data analysis and developing statistical software.
It may have originated as a language for statistical computing, and it still is used as such, but it is increasingly used as a more general purpose data analysis tool: for things like web scraping, data engineering, and journalism.
It is also a functional programming language: which in layman’s terms means that, in R, you define functions and things into them to produce some result.
So, for example, to take a lower case string and make it upper case in R, you pass the string into the toupper()
function:
toupper("small")
[1] "SMALL"
This is different from object oriented languages like Ruby, where objects of a certain class have methods (basically functions) built into them. They carry them around, waiting for you to call the method on them.
In Ruby, the same task of capitalising a string is done as follows (puts
is just the command to print the output):
puts "small".upcase
SMALL
We won’t dwell on the basics of R too long here: it’s better to skip to doing useful stuff and fill in the basics later.
The basic unit of shareable code in R is a package. Whilst “Base R” (i.e. the functions which are installed by default when you install R) is very powerful, the open source community has added a lot of additional functionality by writing packages. Some of these are so widely used now that new users don’t really distinguish them from Base R.
You install packages using install.packages()
:
install.packages("dplyr")
Then you can use functions from dplyr
(like filter
) either by loading dplyr
in its entirety at the top of your .R
file:
library(dplyr)
filter(mtcars, disp >= 450)
mpg cyl disp hp drat wt qsec vs am gear carb
Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
and/or by using the package::function
syntax to specify from which namespace the function is being imported:
::filter(mtcars, disp >= 450) dplyr
mpg cyl disp hp drat wt qsec vs am gear carb
Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
To create a variable in R, and assign a value to it, you use the assignment operator, <-
, like so:
<- 10 variable_one
You can also use =
in most, but not all, cases – so it’s probably better to use <-
generally.
+
, -
, /
(for division) and *
(for multiplication) work as you might expect.
%/%
returns the quotient of a number when divided by another number. For example, since \(13 = 4 \times 3 + 1\):
13 %/% 3
[1] 4
%%
returns the remainder:
13 %% 3
[1] 1
Relational operators are used to compare two values, returning TRUE
or FALSE
.
They are:
<- 1
a <- 10
b
# LESS THAN
< b a
[1] TRUE
# LESS THAN OR EQUAL TO
<= b a
[1] TRUE
# GREATER THAN
> b a
[1] FALSE
# GREATER THAN OR EQUAL TO
>= b a
[1] FALSE
# EQUAL TO
== b a
[1] FALSE
# NOT EQUAL TO
!= b a
[1] TRUE
R has 6 main data types (though usually you’ll only come across the first 4 of them):
"hello, world!"
)3
and 3.14
)9L
: where the L
tells R it is an integer specifically)TRUE
or FALSE
)4+9i
)R builds more complex data types from these basic building blocks: but underneath it all, every data object in R has to be one of the above.
This is done by adding classes to data objects. Classes are beyond the scope of this workshop, but what you need to know is that classes tell R to treat some objects in a different way when you use generic functions on them.
Take the following example:
<- Sys.Date()
today today
[1] "2022-03-22"
Of the 6 data types listed above, today
looks most like a character string. You can use typeof()
to see what it actually is, and class
to see what class it has been given so that R knows to treat it differently.
typeof(today)
[1] "double"
class(today)
[1] "Date"
You can see the actual data object in all its classless glory using the unclass()
function:
unclass(today)
[1] 19073
You’ll see that an object with class Date
is just the number of days since 1 January 1970 (the Unix epoch2).
A vector is a data structure which contains a number of data elements of the same basic type.
As a rule, vectors are created using the c()
function (short for combine):
<- c(1, 2, 3, 4)
vec_1 length(vec_1)
[1] 4
class(vec_1)
[1] "numeric"
typeof(vec_1)
[1] "double"
is.vector(vec_1)
[1] TRUE
<- c(TRUE, "FALSE")
vec_2 vec_2
[1] "TRUE" "FALSE"
typeof(vec_2)
[1] "character"
You can use the c()
function to combine (and flatten) vectors together into a single vector too:
<- c(
vec_3 1,
c(2, 3),
c(4, 5, c(6, 7, 8))
)
vec_3
[1] 1 2 3 4 5 6 7 8
This is also a good way to add an element to the end of a vector:
<- c(vec_3, 1000)
vec_3 vec_3
[1] 1 2 3 4 5 6 7 8 1000
Sequences of integers can be created using the :
function:
1:20
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
You can also create vectors with named elements:
<- c(
vec_4 yes = 1,
no = 2
)
vec_4
yes no
1 2
Lists are like vectors, except every element does not have to be of the same type. They can even be lists, which means lists can be nested.
You create them using the list()
function:
<- list(
list_1 1, 2, "3", list(4, 5, 6, c(7, 8, 9))
)
list_1
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] "3"
[[4]]
[[4]][[1]]
[1] 4
[[4]][[2]]
[1] 5
[[4]][[3]]
[1] 6
[[4]][[4]]
[1] 7 8 9
class(list_1)
[1] "list"
typeof(list_1)
[1] "list"
As with vectors, you can name the elements in a list:
<- list(
list_2 yes = 1,
no = 2,
maybe = 3
)
list_2
$yes
[1] 1
$no
[1] 2
$maybe
[1] 3
Data frames are a fairly central concept when using R for analysis / data science. They are a 2 dimensional array, kind of like a table in Excel.
There is a built in example data frame in R, called mtcars
:
class(mtcars)
[1] "data.frame"
::kable(
knitrhead(mtcars)
)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
As you can see, a data frame is actually a list of vectors:
typeof(mtcars)
[1] "list"
unclass(head(mtcars))
$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1
$cyl
[1] 6 6 4 6 8 6
$disp
[1] 160 160 108 258 360 225
$hp
[1] 110 110 93 110 175 105
$drat
[1] 3.90 3.90 3.85 3.08 3.15 2.76
$wt
[1] 2.620 2.875 2.320 3.215 3.440 3.460
$qsec
[1] 16.46 17.02 18.61 19.44 17.02 20.22
$vs
[1] 0 0 1 1 0 1
$am
[1] 1 1 1 0 0 0
$gear
[1] 4 4 4 3 3 3
$carb
[1] 4 4 1 1 2 1
attr(,"row.names")
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
You can create your own data frame using the data.frame
function:
<- data.frame(
snooker colour = c("red", "yellow", "green", "brown", "blue", "pink", "black"),
score = 1:7
)
::kable(
knitr
snooker )
colour | score |
---|---|
red | 1 |
yellow | 2 |
green | 3 |
brown | 4 |
blue | 5 |
pink | 6 |
black | 7 |
Unlike many other programming languages, R uses 1-based index arrays/vectors: meaning you can extract elements like so:
<- c("A", "B", "C", "D")
some_letters
# First letter:
1] some_letters[
[1] "A"
# Third letter:
3] some_letters[
[1] "C"
With named vectors, you can extract individual elements using the name, like so:
"yes"] vec_4[
yes
1
You can do the same with lists:
"yes"] list_2[
$yes
[1] 1
You can replace values in a vector / list by assigning values when subsetted.
For example, to change the first element of some_letters
:
1] <- "Z"
some_letters[ some_letters
[1] "Z" "B" "C" "D"
When subsetting data frames, it is important to remember that they are effectively just a list of vectors, so you can subset a column by name:
$cyl mtcars
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
or by column number:
2]] mtcars[[
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
# or
2] mtcars[,
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
and you can select rows like so:
# for the first row:
1,] mtcars[
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
if
/ else
if
and else
statements are used to run blocks of code only when certain conditions are met:
<- 4
n
if (n %% 2 == 0) {
message(n, " is an even number")
}
4 is an even number
You can add in else if
calls to the flow: the execution will exit on the first satisfied condition:
<- 3
n if (n == 2) {
message("n is 2")
else if (n == 3) {
} message("n is 3")
else if (n*2 == 6) {
} # the following will not execute since the sequence exited
# above
message("n times 2 is 6")
}
n is 3
else
can be used to run an expression if no previous if
or if else
conditions were satisfied:
if (FALSE) {
stop()
else if (FALSE) {
} stop()
else if (FALSE) {
} stop()
else if (FALSE) {
} stop()
else if (FALSE) {
} stop()
else {
} message("hello")
}
hello
for
loopsfor
loops iterate over a sequence, executing the code within the block once for each item in the sequence, like so:
for (thing in c("John", "Paul", "George", "Ringo")) {
message(thing, " is a member of The Beatles")
}
John is a member of The Beatles
Paul is a member of The Beatles
George is a member of The Beatles
Ringo is a member of The Beatles
Sometimes, the index of the item is also needed, along with the item itself. This is achieved as follows:
<- c("John", "Paul", "George", "Ringo")
beatles for (i in seq_along(beatles)) {
message(i, ". ", beatles[i], " is a member of The Beatles")
}
1. John is a member of The Beatles
2. Paul is a member of The Beatles
3. George is a member of The Beatles
4. Ringo is a member of The Beatles
These loops terminate automatically once they reach the end of the sequence: since a vector cannot have an infinite length, then you don’t need to worry about an infinite loop.
while
loopswhile
loops continue to execute as long as the given condition is true:
<- 0
i while (i <= 3) {
message("The value of i is ", i)
<- i + 1
i }
The value of i is 0
The value of i is 1
The value of i is 2
The value of i is 3
Since this loop will continue as long as the value of i
is less than or equal to 3
, it is imperative that the body of the loop increments i
. If it doesn’t, the loop will run infinitely.
repeat
loopsrepeat
loops are very similar to while
loops, except there is no preceding condition which is tested on the way in to decide whether to execute the expression within the block: you need to use break
to manually exit the loop.
<- 0
i repeat {
message("The value of i is ", i)
<- i + 1
i if (i > 3) break
}
The value of i is 0
The value of i is 1
The value of i is 2
The value of i is 3
You can also use next
to skip execution in certain conditions - for example for odd numbered indices:
<- 0
i repeat {
<- i + 1
i if (i %% 2 == 1) next
if (i > 10) break
message("The value of i is ", i)
}
The value of i is 2
The value of i is 4
The value of i is 6
The value of i is 8
The value of i is 10
As a rule, code which is repeated numerous times in your scripts, or which you run regularly with different inputs, or which you want a way to test easily, should be extracted into a function.
Functions are (generally) defined in R using function
:
<- function() {
new_func message("hello")
}
You then call a function by executing it with ()
at the end:
new_func()
hello
You can read much, much more about how functions work by reading the Functions chapter in Advanced R by Hadley Wickham, but for now it’s worth noting that functions have 3 elements:
formals, or arguments
body
environment
The formals are what go inside the brackets when defining the function. Variables which may be different each time the function is called would be fed in via these arguments. For example:
<- function(n) {
doubler * 2
n
}
doubler(10)
[1] 20
doubler(40)
[1] 80
The body is just the code which will be executed: this sits between the curly brackets. The last line of the body will be returned by the function (meaning you can assign its value to something if you want).
It’s a very good idea to add argument validation towards the top of your function’s body as well:
<- function(n) {
doubler if (is.character(n)) stop("n must be a number")
* 2
n
}
doubler("ten")
Error in doubler("ten"): n must be a number
To understand more about environments, recommended reading is the section about lexical scoping in Advanced R. At this point, it’s useful to mention that a function will try to use the “most locally” defined variable with the same name, looking first at its own environment, and if it does not find it, in the environment in which it was called.
Also, the calltime environments of functions are transient: variables which are created on execution will not generally continue to exist once the function exits.
<- 10
x <- 20
y
<- function() {
random_function <- 100
x <- 30
z
+ y + z
x
}
random_function()
[1] 150
x
[1] 10
y
[1] 20
z
Error in eval(expr, envir, enclos): object 'z' not found
apply
functions)A functional is any function that takes a function as an input and returns a vector as output.
Here is a simple example:
<- c(1, 2, 3, 4, 5, 6)
numbers
<- function(f) {
some_functional f(numbers)
}
some_functional(mean)
[1] 3.5
some_functional(range)
[1] 1 6
some_functional(sum)
[1] 21
They are most often used to perform the same action on every element of a list or vector. Base R has the lapply
family of functions to do this:
<- c("bob", "jane", "eric")
some_names
lapply(some_names, toupper)
[[1]]
[1] "BOB"
[[2]]
[1] "JANE"
[[3]]
[1] "ERIC"
The first argument is the vector on which you want to iterate over and apply the second argument (a function) to. As you can see, lapply
returns a list by default.
Sometimes, the functions you supply to lapply
have additional arguments which you’d like to specify. You can provide additional, named, arguments in the call to lapply
:
<- list(
some_numbers c(1, 4, NA),
c(10, 10, 200),
c(9, 8, NA, 100, 542)
)
lapply(some_numbers, sum)
[[1]]
[1] NA
[[2]]
[1] 220
[[3]]
[1] NA
lapply(some_numbers, sum, na.rm = TRUE)
[[1]]
[1] 5
[[2]]
[1] 220
[[3]]
[1] 659
Having covered a few of the basics of R, it is worthwhile running through some practical tasks which are useful day-to-day.
csv
filesread.csv
<- read.csv(
countries "https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv"
)
::kable(
knitrhead(countries)
)
name | alpha.2 | alpha.3 | country.code | iso_3166.2 | region | sub.region | intermediate.region | region.code | sub.region.code | intermediate.region.code |
---|---|---|---|---|---|---|---|---|---|---|
Afghanistan | AF | AFG | 4 | ISO 3166-2:AF | Asia | Southern Asia | 142 | 34 | NA | |
Åland Islands | AX | ALA | 248 | ISO 3166-2:AX | Europe | Northern Europe | 150 | 154 | NA | |
Albania | AL | ALB | 8 | ISO 3166-2:AL | Europe | Southern Europe | 150 | 39 | NA | |
Algeria | DZ | DZA | 12 | ISO 3166-2:DZ | Africa | Northern Africa | 2 | 15 | NA | |
American Samoa | AS | ASM | 16 | ISO 3166-2:AS | Oceania | Polynesia | 9 | 61 | NA | |
Andorra | AD | AND | 20 | ISO 3166-2:AD | Europe | Southern Europe | 150 | 39 | NA |
write.csv
write.csv(countries, "path/to/file.csv")
At the moment, I would recommend using the httr
package for HTTP requests.
<- "https://httpbin.org/anything?filter=everything&goal=show%20how%20to%20api%20request"
url
<- httr::GET(url)
req
<- httr::content(req) content
content
$args
$args$filter
[1] "everything"
$args$goal
[1] "show how to api request"
$data
[1] ""
$files
named list()
$form
named list()
$headers
$headers$Accept
[1] "application/json, text/xml, application/xml, */*"
$headers$`Accept-Encoding`
[1] "deflate, gzip"
$headers$Host
[1] "httpbin.org"
$headers$`User-Agent`
[1] "libcurl/7.77.0 r-curl/4.3.2 httr/1.4.2"
$headers$`X-Amzn-Trace-Id`
[1] "Root=1-623a3dfe-2805beab6222f3b1458c29af"
$json
NULL
$method
[1] "GET"
$url
[1] "https://httpbin.org/anything?filter=everything&goal=show how to api request"
The rvest
package is commonly used for web scraping:
<- "https://www.scrapethissite.com/pages/simple/"
url
<- rvest::read_html(x = url)
page
<- rvest::html_elements(x = page, css = "h3.country-name")
elements
head(
::html_text2(elements)
rvest )
[1] "Andorra" "United Arab Emirates" "Afghanistan"
[4] "Antigua and Barbuda" "Anguilla" "Albania"
It is a good idea to bundle useful, reusable code (especially that which you plan to share with others, and subject to testing) as a package.
Creating R packages is a big topic — too big to cover in a single session — but you can read more about it in Hadley Wickham and Jenny Bryan’s book R Packages. It leans heavily on the usethis
package, a set of utilities created to simplify the setup of projects / packages.
Unit testing is crucial when creating reliable packages. A little time up front writing out expectations as to what a function will return in a number of different circumstances saves you having to manually check that it does before you share your project. It also means that you can automate your tests: ensuring that it is impossible to publish work which does not demonstrably do what it set out to do.
The testthat
package. The R Packages book has a section on how to get up and running with tests on R packages.
To read the documentation for a function, type ?name_of_function
into the R console and hit enter.
For example, to get the help documentation for read.table
, you would type:
?read.table
read.table package:utils R Documentation
Data Input
Description:
Reads a file in table format and creates a data frame from it,
with cases corresponding to lines and variables to fields in the
file.
Usage:
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = FALSE,
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)