Data Analysis and Visualisation in R

This is an introduction to R designed for participants with no programming experience.

Set up

Install R & RStudio

On a University device:

Search ‘R project’ in the Software Center / Self Service and install.
Search RStudio in the Software Center / Self Service and install.

On a personal device:

Install R and then RStudio directly.

Install required R package

In RStudio, click on Tools -> Install Packages.
Type tidyverse and click install.
Wait for the background process to finish (can take a couple of minutes).

Know how to check for updates

Keep R up to date by checking Software Center or CRAN every now and then.
Keep RStudio up to date by checking Software Center or following the automatic prompts in outdated versions.
Keep packages up to date by clicking Tools > ‘Check for package updates’ every now and then.

Download the data

Download the data file and save it somewhere convenient.

Introduction to R & RStudio

What are R & RStudio?

R is a programming language as well as the software that runs R code written in that language.

RStudio is separate software which provides an interface that makes it more conveneient to interact with R and write R code.

Why learn R?

Programming is more efficient than pointing and clicking

Instead of remembering what you clicked on and in what order, writing out R code and saving it in a script file makes it far easier to see what you did and why.
If your data changes, or you have another similar dataset, a script can be rerun in seconds.

R is great for reproducibility

By publishing your data and code, you ensure that your results are transparent, verifiable, and can be built on by others.

R can do just about anything

R was originally written with statistics in mind, but it has been expanded and extended over the years into a fully fledged programming language capable of performing all kinds of tasks.
R works on data of all shapes and sizes, and there are community-developed packages to make it easier to perform all kinds of analyses and workflows.

R produces famously good graphics

With R you can create professional-looking plots and figures to submit to journals with your manuscripts.

The R community is welcoming and helpful

R is popular among researchers so you’ll find large and welcoming communities of fellow R users in places like Stack Overflow or the Posit Forums (Posit is the company which makes RStudio).

Navigating RStudio

RStudio can help with many advanced tasks, but we’re going to be focussing on the core features.

In the default layout there are 4 panes, clockwise from top left:

Top left: Source pane. Displays scripts, data, and other files.
Top right: Environment pane. Displays the objects in your current R session.
Bottom right: Files pane. Displays a file explorer, any plots you create, and help information.
Bottom left: Console. Where commands are executed by R.

RStudio Interface

These 4 panes contain most of the features you need to start working with R.

We will also change a couple of settings by selecting Tools → Global Options:

RStudio Interface

Ensure the settings under ‘R sessions’ are unticked, and the settings under ‘Workspace’ are unticked and set to ‘Never’. This ensures R starts a fresh session each time you open it and helps to prevent confusion, especially when learning.

Other RStudio features we will see as the lesson progresses:

Keyboard shortcuts
Autocompletion
Syntax highlighting

Getting Set Up in RStudio

It’s good practice to organise files relating to research projects into their own project folder. A well-organised project is easier to navigate, more reproducible, and easier to share with others.

Let’s create a basic project directory outside of RStudio now to mimic the kind of project we might already have going. For the purposes of this lesson we will create our project folder on the desktop, but you will likely have your project folders elsewhere (such as in a research drive).

Create the following folders:

r-workshop
│
└── R
│
└── data
│
└── fig

RStudio provides a “Projects” feature that makes it easier to work on individual projects in R. We will create an R project using the project folder we just created.

In the top right, you will see a blue 3D cube and the words “Project: (None)”. Click on this icon.
Click New Project from the dropdown menu.
Click ‘Existing folder’.
Click ‘Browse’ and find the ‘r-workshop’ folder on your desktop, and click ‘Create’.
The project will open in RStudio, the files pane will open to the project folder, and you’ll notice a new file called ‘r-workshop.Rproj’ in the r-workshop folder.

One of the benefits of using RStudio projects is that they automatically set the working directory to the top-level folder for the project (r-workshop here). The working directory is the folder where R is looking for files to bring in, and the place to save outputs. R views the location of all files in the project as being relative to the working directory. You may come across scripts that include something like setwd(“/Users/YourUserName/MyRProject”), which directly sets a working directory. But this script is unlikely to run on anyone elses computer, since that specific directory is probably unique to the person who wrote the script. Using RStudio Projects means we don’t have to deal with manually setting the working directory, and it means that R projects are fully self-contained.

Next time you open RStudio, you can click that 3D cube icon, and you will see options to open recent projects, like the one you just made.

Let’s move the workshop data file into the data folder now. Next we need to unzip the file so that the 5 files inside are now sitting in the data folder. You should see a file called surveys_complete.csv which is the data we will be suing for the workshop. You should also see this same data file split into 4 separate spreadsheet files, which we’ll use later in the workshop to demonstrate how to read multiple files in at once.

Working in R & RStudio

The essence of programming is writing down instructions for the computer to follow, and then telling the computer to follow those instructions.We write the instructions in code, which is a common language that is understood by the computer and humans (after some practice). We call these instructions commands, and we tell the computer to follow the instructions by running (also called executing) the commands.

Console vs. script

You can run commands directly in the R console, or you can write them into an R script. It may help to think of working in the console vs. working in a script as something like cooking. The console is like making up a new recipe, but not writing anything down. You can carry out a series of steps and produce a final dish at the end. However, because you didn’t write anything down, it’s harder to figure out exactly what you did, and in what order.

Writing a script is like taking notes while cooking. You can tweak and edit the recipe all you want, you can come back in 6 months and try it again, and you don’t have to try to remember what went well and what didn’t. To cook the recipe again we can hita single button in RStudio to run an entire script.

You can also leave comments for yourself or others to read in your script. Lines that start with # are considered comments and will not be interpreted as R code.

Console

Code is run/executed
Commands are typed at the prompt > symbol
R runs the commands when you press enter
What you write here is lost when your RStudio session ends

Script

A record of commands to send to R
A plain text file with a .R extension
File → New File → R Script, or Shift+Cmd+N (Mac) or Ctrl+Shift+N (Windows)
Regularly save scripts to avoid losing your work
R code in a script can be sent to the console with Cmd+Enter (Mac) or Ctrl+Enter (Windows)
Leave comments by starting lines with #
Insert sections with Shift+Cmd+R (Mac) or Ctrl+Shift+R (Windows)

Most of our R code should be saved in a script, but sometimes we need to do things that we don’t need to preserve in our script. For example, we shouldn’t put commands to install packages in our scripts, but we do need to load the packages we use at the start of a script.

R Fundamentals

R as a calculator

Go ahead and type a math equation into the console and hit enter. Note how the result is printed immediately. Now type the equation into a new script and hit enter. Note how the cursor moves to a new line. To run the equation in the script, click anywhere on the line of code and either click the ‘Run’ button on the top right of the script, or use Cmd+Enter (Mac) or Ctrl+Enter (Windows) to send the equation to the console.

From now on, use a script to write and execute the code in this workshop.

Assignment

In R, we can assign values to objects by using the assignment arrow <- (shortcut Alt+- on Windows/Linux and Cmd+- on Mac). Whatever is on the right of the arrow will be assigned to whatever is on the left. To create an object we provide a name for the object and assign a value.

For example, to assign the number 50 to an object called weight:

weight <- 50
weight

## [1] 50

We can do calculations with objects and store the results in other objects:

weight_lb <- weight * 2.2
weight_lb

## [1] 110

We can override assignment

weight <- 100

Even though we’ve changed the value of weight, the value of weight_lb does not change, unless we run the conversion to pounds again.

When naming objects:

use nouns
be descriptive but concise
use a consistent code style (like lower_snake)

Functions

Functions are the workhorse of many programming languages, including R. Functions usually take one or more inputs, do something with the input, and create an output. We can think of a function as a piece of ‘canned code’ because they are made up of one or many lines of code in the background.

Inputs are called arguments, and if they’re optional they’re called options. Understanding which function to use to achieve a particular task comes with time and experience. To understand which arguments or options can be used we can look at the documentation provided for the function.

Let’s look at the round() function as an example. To learn more about a function, you can type a ? in front of the name of the function, which will bring up the documentation for that function:

?round

## starting httpd help server ... done

We can see that round() takes x, a numeric vector, and an option called digits to indicate how many decimal places to round to. X is required so we need to provide it. We’ll come back to vectors later, but for now we can think of vectors as one or more values:

round(x = 3.1412)

## [1] 3

Note we get back 3. So the default behaviour is to round to whole numbers. If we wanted to specify differently, we could supply the digits option:

round(x = 3.1412, digits = 2)

## [1] 3.14

When specifying arguments in the order expected by the function we don’t have to name them:

round(3.1412, 2)

## [1] 3.14

But it makes your code clearer to name arguments, especially when you’re learning.

Data Visualisation with ggplot2

We’re going to create some plots using functions from the ggplot2 package. R packages usually contain functions related to a specific activity, which in this case is data visualisation. Packages often include documentation which outlines how the functions work, and sometimes they include other things like demo datasets.

We’ve already installed the ggplot2 package. It was installed as part of the tidyverse package, which is actually a popular collection of packages developed by the same group of people, that work well together.

Reading In Data

Each time we start a new RStudio session, we need to load the packages that we want to make use of in our script, and to do this we use library().

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

After loading packages, the next thing we often need to do is to load some data into R. To load our spreadsheet in we use the read_csv function from the readr package, and we’ll store it in an object called surveys.`

surveys <- read_csv(file = "data/surveys_complete.csv")

## Rows: 16878 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): species_id, sex, genus, species, taxa, plot_type
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

You’ll notice some useful information printed in the console, including the number of rows and columns, and the number of different column types. chr refers to character, or columns containing text values, while dbl refers to double, another name for numeric values.

Now we can see surveys in our environment pane and we can click on it to take a look.

This dataset is taken from the Portal Project, a long-term study from Portal, Arizona, in the Chihuahuan desert, which measures various features of desert-dwelling mammals. Click on the surveys object in the environment pane to see the data.

To get a bit more information about any object in R, we can look at it’s structure:

str(surveys)

## spc_tbl_ [16,878 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ record_id      : num [1:16878] 1 2 3 4 5 6 7 8 9 10 ...
##  $ month          : num [1:16878] 7 7 7 7 7 7 7 7 7 7 ...
##  $ day            : num [1:16878] 16 16 16 16 16 16 16 16 16 16 ...
##  $ year           : num [1:16878] 1977 1977 1977 1977 1977 ...
##  $ plot_id        : num [1:16878] 2 3 2 7 3 1 2 1 1 6 ...
##  $ species_id     : chr [1:16878] "NL" "NL" "DM" "DM" ...
##  $ sex            : chr [1:16878] "M" "M" "F" "M" ...
##  $ hindfoot_length: num [1:16878] 32 33 37 36 35 14 NA 37 34 20 ...
##  $ weight         : num [1:16878] NA NA NA NA NA NA NA NA NA NA ...
##  $ genus          : chr [1:16878] "Neotoma" "Neotoma" "Dipodomys" "Dipodomys" ...
##  $ species        : chr [1:16878] "albigula" "albigula" "merriami" "merriami" ...
##  $ taxa           : chr [1:16878] "Rodent" "Rodent" "Rodent" "Rodent" ...
##  $ plot_type      : chr [1:16878] "Control" "Long-term Krat Exclosure" "Control" "Rodent Exclosure" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   record_id = col_double(),
##   ..   month = col_double(),
##   ..   day = col_double(),
##   ..   year = col_double(),
##   ..   plot_id = col_double(),
##   ..   species_id = col_character(),
##   ..   sex = col_character(),
##   ..   hindfoot_length = col_double(),
##   ..   weight = col_double(),
##   ..   genus = col_character(),
##   ..   species = col_character(),
##   ..   taxa = col_character(),
##   ..   plot_type = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Plotting with ggplot2

The gg in ggplot2 stands for “grammar of graphics”, and the package uses consistent vocabulary to create plots of widely varying types. Therefore, we only need small changes to our code if the underlying data changes or we decide to make a box plot instead of a scatter plot. This approach helps you create publication-quality plots with minimal adjusting and tweaking.

ggplot plots are built step by step by adding new layers, which allows for extensive flexibility and customization of plots. We’re going to use the following template to build our plots:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()

We need to change everything inside <>.

First we need to specify the data we want to use. Next, we can see that the ggplot() call has a function nested within it, aes(), which relates to aesthetic mappings. This is where we specify which parts of the data should be mapped to different axes. Finally, we need to specify the geometry we want to use, or in other words, the plot type. The geometry function we use is added to the end with a + sign.

Let’s go through step by step:

ggplot(data = surveys)

We get a blank plot because we haven’t told ggplot() which variables we want to correspond to parts of the plot.

ggplot(data = surveys, mapping = aes(x = weight, y = hindfoot_length))

Now we’ve got a plot with x and y axes corresponding to variables from surveys. However, we haven’t specified how we want the data to be displayed. We do this using geom_ functions, which specify the type of geometry we want, such as points, lines, or bars. We can add a geom_point() layer to our plot by using the + sign. We indent onto a new line to make it easier to read, and we have to end the first line with the + sign.

ggplot(data = surveys, mapping = aes(x = weight, y = hindfoot_length)) +
  geom_point()

## Warning: Removed 3081 rows containing missing values or values outside the scale range
## (`geom_point()`).

You may notice a warning that missing values were removed. If a variable necessary to make the plot is missing from a given row of data (in this case, hindfoot_length or weight), that observation can’t be plotted. ggplot2 warns us that this has happened.

Another common type of message is an error, which means R can’t execute your command. This is commonly due to misspelling, or missing punctuation such as brackets or commas.

Exploring & Understanding Data

We’ve learned how to create visualisations from the surveys data, but what actually is surveys? R commonly stores tabular data in data.frames, and that is how the surveys data is stored. It’s useful to understand how R thinks about, represents, and stores data in order for us to have a productive working relationship with R.

We can check what surveys is by using the class() function:

class(surveys)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

This tells us a smaller amount of information than when we used str(), namely that surveys is ultimately a data.frame.

When we’re looking at large or unfamiliar datasets it’s useful to do some quick checks to get an idea of the kind of data we’re working with. For example, we can look at the first and last 6 rows of our data:

head(surveys)

## # A tibble: 6 × 13
##   record_id month   day  year plot_id species_id sex   hindfoot_length weight
##       <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
## 1         1     7    16  1977       2 NL         M                  32     NA
## 2         2     7    16  1977       3 NL         M                  33     NA
## 3         3     7    16  1977       2 DM         F                  37     NA
## 4         4     7    16  1977       7 DM         M                  36     NA
## 5         5     7    16  1977       3 DM         M                  35     NA
## 6         6     7    16  1977       1 PF         M                  14     NA
## # ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>

tail(surveys)

## # A tibble: 6 × 13
##   record_id month   day  year plot_id species_id sex   hindfoot_length weight
##       <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
## 1     16873    12     5  1989       8 DO         M                  37     51
## 2     16874    12     5  1989      16 RM         F                  18     15
## 3     16875    12     5  1989       5 RM         M                  17      9
## 4     16876    12     5  1989       4 DM         M                  37     31
## 5     16877    12     5  1989      11 DM         M                  37     50
## 6     16878    12     5  1989       8 DM         F                  37     42
## # ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>

We used these functions without naming the argument because head and tail take only one argument, x, or the data object we want to feed in. We can still name it, but sometimes our code is just as clear without doing so.

We can get a useful summary of each numeric variable (coloumn) by using summary():

summary(surveys)

##    record_id         month             day            year         plot_id     
##  Min.   :    1   Min.   : 1.000   Min.   : 1.0   Min.   :1977   Min.   : 1.00  
##  1st Qu.: 4220   1st Qu.: 3.000   1st Qu.: 9.0   1st Qu.:1981   1st Qu.: 5.00  
##  Median : 8440   Median : 6.000   Median :15.0   Median :1983   Median :11.00  
##  Mean   : 8440   Mean   : 6.382   Mean   :15.6   Mean   :1984   Mean   :11.47  
##  3rd Qu.:12659   3rd Qu.: 9.000   3rd Qu.:23.0   3rd Qu.:1987   3rd Qu.:17.00  
##  Max.   :16878   Max.   :12.000   Max.   :31.0   Max.   :1989   Max.   :24.00  
##                                                                                
##   species_id            sex            hindfoot_length     weight      
##  Length:16878       Length:16878       Min.   : 6.00   Min.   :  4.00  
##  Class :character   Class :character   1st Qu.:21.00   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Median :35.00   Median : 42.00  
##                                        Mean   :31.98   Mean   : 53.22  
##                                        3rd Qu.:37.00   3rd Qu.: 53.00  
##                                        Max.   :70.00   Max.   :278.00  
##                                        NA's   :2733    NA's   :1692    
##     genus             species              taxa            plot_type        
##  Length:16878       Length:16878       Length:16878       Length:16878      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##

Working with Data

surveys_new <- read_csv(file = file_paths, id = “source_file”)

Function documentation Help Resources End