This is an introduction to R designed for participants with no programming experience.
On a University device:
On a personal device:
tidyverse and click install.R is a programming language as well as the software that runs R code written in that language.
RStudio is separate software which provides an interface that makes it more conveneient to interact with R and write R code.
It’s good practice to organise files relating to research projects into their own project folder. A well-organised project is easier to navigate, more reproducible, and easier to share with others.
Let’s create a basic project directory outside of RStudio now to mimic the kind of project we might already have going. For the purposes of this lesson we will create our project folder on the desktop, but you will likely have your project folders elsewhere (such as in a research drive).
Create the following folders:
r-workshop
│
└── R
│
└── data
│
└── fig
RStudio provides a “Projects” feature that makes it easier to work on individual projects in R. We will create an R project using the project folder we just created.
One of the benefits of using RStudio projects is that they automatically set the working directory to the top-level folder for the project (r-workshop here). The working directory is the folder where R is looking for files to bring in, and the place to save outputs. R views the location of all files in the project as being relative to the working directory. You may come across scripts that include something like setwd(“/Users/YourUserName/MyRProject”), which directly sets a working directory. But this script is unlikely to run on anyone elses computer, since that specific directory is probably unique to the person who wrote the script. Using RStudio Projects means we don’t have to deal with manually setting the working directory, and it means that R projects are fully self-contained.
Next time you open RStudio, you can click that 3D cube icon, and you will see options to open recent projects, like the one you just made.
Let’s move the workshop data
file into the data folder now. Next we need to unzip the file so
that the 5 files inside are now sitting in the data folder. You should
see a file called surveys_complete.csv which is the data we
will be suing for the workshop. You should also see this same data file
split into 4 separate spreadsheet files, which we’ll use later in the
workshop to demonstrate how to read multiple files in at once.
The essence of programming is writing down instructions for the computer to follow, and then telling the computer to follow those instructions.We write the instructions in code, which is a common language that is understood by the computer and humans (after some practice). We call these instructions commands, and we tell the computer to follow the instructions by running (also called executing) the commands.
Console vs. script
You can run commands directly in the R console, or you can write them into an R script. It may help to think of working in the console vs. working in a script as something like cooking. The console is like making up a new recipe, but not writing anything down. You can carry out a series of steps and produce a final dish at the end. However, because you didn’t write anything down, it’s harder to figure out exactly what you did, and in what order.
Writing a script is like taking notes while cooking. You can tweak and edit the recipe all you want, you can come back in 6 months and try it again, and you don’t have to try to remember what went well and what didn’t. To cook the recipe again we can hita single button in RStudio to run an entire script.
You can also leave comments for yourself or others to read in your script. Lines that start with # are considered comments and will not be interpreted as R code.
Console
Script
Most of our R code should be saved in a script, but sometimes we need to do things that we don’t need to preserve in our script. For example, we shouldn’t put commands to install packages in our scripts, but we do need to load the packages we use at the start of a script.
R as a calculator
Go ahead and type a math equation into the console and hit enter. Note how the result is printed immediately. Now type the equation into a new script and hit enter. Note how the cursor moves to a new line. To run the equation in the script, click anywhere on the line of code and either click the ‘Run’ button on the top right of the script, or use Cmd+Enter (Mac) or Ctrl+Enter (Windows) to send the equation to the console.
From now on, use a script to write and execute the code in this workshop.
Assignment
In R, we can assign values to objects by using the
assignment arrow <- (shortcut Alt+- on Windows/Linux and
Cmd+- on Mac). Whatever is on the right of the arrow will be assigned to
whatever is on the left. To create an object we provide a name for the
object and assign a value.
For example, to assign the number 50 to an object called weight:
weight <- 50
weight
## [1] 50
We can do calculations with objects and store the results in other objects:
weight_lb <- weight * 2.2
weight_lb
## [1] 110
We can override assignment
weight <- 100
Even though we’ve changed the value of weight, the value of weight_lb does not change, unless we run the conversion to pounds again.
When naming objects:
Functions
Functions are the workhorse of many programming languages, including R. Functions usually take one or more inputs, do something with the input, and create an output. We can think of a function as a piece of ‘canned code’ because they are made up of one or many lines of code in the background.
Inputs are called arguments, and if they’re optional they’re called options. Understanding which function to use to achieve a particular task comes with time and experience. To understand which arguments or options can be used we can look at the documentation provided for the function.
Let’s look at the round() function as an example. To learn more about a function, you can type a ? in front of the name of the function, which will bring up the documentation for that function:
?round
## starting httpd help server ... done
We can see that round() takes x, a numeric vector, and an option called digits to indicate how many decimal places to round to. X is required so we need to provide it. We’ll come back to vectors later, but for now we can think of vectors as one or more values:
round(x = 3.1412)
## [1] 3
Note we get back 3. So the default behaviour is to round to whole numbers. If we wanted to specify differently, we could supply the digits option:
round(x = 3.1412, digits = 2)
## [1] 3.14
When specifying arguments in the order expected by the function we don’t have to name them:
round(3.1412, 2)
## [1] 3.14
But it makes your code clearer to name arguments, especially when you’re learning.
We’re going to create some plots using functions from the
ggplot2 package. R packages usually contain functions
related to a specific activity, which in this case is data
visualisation. Packages often include documentation which outlines how
the functions work, and sometimes they include other things like demo
datasets.
We’ve already installed the ggplot2 package. It was
installed as part of the tidyverse package, which is
actually a popular collection of packages developed by the same group of
people, that work well together.
Each time we start a new RStudio session, we need to load the
packages that we want to make use of in our script, and to do this we
use library().
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
After loading packages, the next thing we often need to do is to load
some data into R. To load our spreadsheet in we use the
read_csv function from the readr package, and
we’ll store it in an object called surveys.`
surveys <- read_csv(file = "data/surveys_complete.csv")
## Rows: 16878 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): species_id, sex, genus, species, taxa, plot_type
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
You’ll notice some useful information printed in the console,
including the number of rows and columns, and the number of different
column types. chr refers to character, or columns
containing text values, while dbl refers to double, another
name for numeric values.
Now we can see surveys in our environment pane and we
can click on it to take a look.
This dataset is taken from the Portal Project, a long-term study from
Portal, Arizona, in the Chihuahuan desert, which measures various
features of desert-dwelling mammals. Click on the surveys
object in the environment pane to see the data.
To get a bit more information about any object in R, we can look at it’s structure:
str(surveys)
## spc_tbl_ [16,878 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ record_id : num [1:16878] 1 2 3 4 5 6 7 8 9 10 ...
## $ month : num [1:16878] 7 7 7 7 7 7 7 7 7 7 ...
## $ day : num [1:16878] 16 16 16 16 16 16 16 16 16 16 ...
## $ year : num [1:16878] 1977 1977 1977 1977 1977 ...
## $ plot_id : num [1:16878] 2 3 2 7 3 1 2 1 1 6 ...
## $ species_id : chr [1:16878] "NL" "NL" "DM" "DM" ...
## $ sex : chr [1:16878] "M" "M" "F" "M" ...
## $ hindfoot_length: num [1:16878] 32 33 37 36 35 14 NA 37 34 20 ...
## $ weight : num [1:16878] NA NA NA NA NA NA NA NA NA NA ...
## $ genus : chr [1:16878] "Neotoma" "Neotoma" "Dipodomys" "Dipodomys" ...
## $ species : chr [1:16878] "albigula" "albigula" "merriami" "merriami" ...
## $ taxa : chr [1:16878] "Rodent" "Rodent" "Rodent" "Rodent" ...
## $ plot_type : chr [1:16878] "Control" "Long-term Krat Exclosure" "Control" "Rodent Exclosure" ...
## - attr(*, "spec")=
## .. cols(
## .. record_id = col_double(),
## .. month = col_double(),
## .. day = col_double(),
## .. year = col_double(),
## .. plot_id = col_double(),
## .. species_id = col_character(),
## .. sex = col_character(),
## .. hindfoot_length = col_double(),
## .. weight = col_double(),
## .. genus = col_character(),
## .. species = col_character(),
## .. taxa = col_character(),
## .. plot_type = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
The gg in ggplot2 stands for “grammar of graphics”, and the package uses consistent vocabulary to create plots of widely varying types. Therefore, we only need small changes to our code if the underlying data changes or we decide to make a box plot instead of a scatter plot. This approach helps you create publication-quality plots with minimal adjusting and tweaking.
ggplot plots are built step by step by adding new layers, which allows for extensive flexibility and customization of plots. We’re going to use the following template to build our plots:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
We need to change everything inside <>.
First we need to specify the data we want to use. Next, we can see
that the ggplot() call has a function nested within it, aes(), which
relates to aesthetic mappings. This is where we specify which parts of
the data should be mapped to different axes. Finally, we need to specify
the geometry we want to use, or in other words, the plot type. The
geometry function we use is added to the end with a +
sign.
Let’s go through step by step:
ggplot(data = surveys)
We get a blank plot because we haven’t told ggplot() which variables we want to correspond to parts of the plot.
ggplot(data = surveys, mapping = aes(x = weight, y = hindfoot_length))
Now we’ve got a plot with x and y axes corresponding to variables from surveys. However, we haven’t specified how we want the data to be displayed. We do this using geom_ functions, which specify the type of geometry we want, such as points, lines, or bars. We can add a geom_point() layer to our plot by using the + sign. We indent onto a new line to make it easier to read, and we have to end the first line with the + sign.
ggplot(data = surveys, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point()
## Warning: Removed 3081 rows containing missing values or values outside the scale range
## (`geom_point()`).
You may notice a warning that missing values were removed. If a variable necessary to make the plot is missing from a given row of data (in this case, hindfoot_length or weight), that observation can’t be plotted. ggplot2 warns us that this has happened.
Another common type of message is an error, which means R can’t execute your command. This is commonly due to misspelling, or missing punctuation such as brackets or commas.
We’ve learned how to create visualisations from the
surveys data, but what actually is surveys? R
commonly stores tabular data in data.frames, and that is how the
surveys data is stored. It’s useful to understand how R
thinks about, represents, and stores data in order for us to have a
productive working relationship with R.
We can check what surveys is by using the class()
function:
class(surveys)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
This tells us a smaller amount of information than when we used
str(), namely that surveys is ultimately a
data.frame.
When we’re looking at large or unfamiliar datasets it’s useful to do some quick checks to get an idea of the kind of data we’re working with. For example, we can look at the first and last 6 rows of our data:
head(surveys)
## # A tibble: 6 × 13
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 2 7 16 1977 3 NL M 33 NA
## 3 3 7 16 1977 2 DM F 37 NA
## 4 4 7 16 1977 7 DM M 36 NA
## 5 5 7 16 1977 3 DM M 35 NA
## 6 6 7 16 1977 1 PF M 14 NA
## # ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
tail(surveys)
## # A tibble: 6 × 13
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 16873 12 5 1989 8 DO M 37 51
## 2 16874 12 5 1989 16 RM F 18 15
## 3 16875 12 5 1989 5 RM M 17 9
## 4 16876 12 5 1989 4 DM M 37 31
## 5 16877 12 5 1989 11 DM M 37 50
## 6 16878 12 5 1989 8 DM F 37 42
## # ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
We used these functions without naming the argument because
head and tail take only one argument,
x, or the data object we want to feed in. We can still name
it, but sometimes our code is just as clear without doing so.
We can get a useful summary of each numeric variable (coloumn) by
using summary():
summary(surveys)
## record_id month day year plot_id
## Min. : 1 Min. : 1.000 Min. : 1.0 Min. :1977 Min. : 1.00
## 1st Qu.: 4220 1st Qu.: 3.000 1st Qu.: 9.0 1st Qu.:1981 1st Qu.: 5.00
## Median : 8440 Median : 6.000 Median :15.0 Median :1983 Median :11.00
## Mean : 8440 Mean : 6.382 Mean :15.6 Mean :1984 Mean :11.47
## 3rd Qu.:12659 3rd Qu.: 9.000 3rd Qu.:23.0 3rd Qu.:1987 3rd Qu.:17.00
## Max. :16878 Max. :12.000 Max. :31.0 Max. :1989 Max. :24.00
##
## species_id sex hindfoot_length weight
## Length:16878 Length:16878 Min. : 6.00 Min. : 4.00
## Class :character Class :character 1st Qu.:21.00 1st Qu.: 24.00
## Mode :character Mode :character Median :35.00 Median : 42.00
## Mean :31.98 Mean : 53.22
## 3rd Qu.:37.00 3rd Qu.: 53.00
## Max. :70.00 Max. :278.00
## NA's :2733 NA's :1692
## genus species taxa plot_type
## Length:16878 Length:16878 Length:16878 Length:16878
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
surveys_new <- read_csv(file = file_paths, id = “source_file”)
Function documentation Help Resources End