class(surveys)[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
We’ve learned how to create visualisations from the surveys data, but what actually is surveys? R commonly stores tabular data in data.frames, and that is how the surveys data is stored. It’s useful to understand how R thinks about, represents, and stores data in order for us to have a productive working relationship with R.
We can check what surveys is by using the class() function:
This tells us a smaller amount of information than when we used str(), namely that surveys is ultimately a data.frame.
When we’re looking at large or unfamiliar datasets it’s useful to do some quick checks to get an idea of the kind of data we’re working with. For example, we can look at the first and last 6 rows of our data:
# A tibble: 6 × 13
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 7 16 1977 2 NL M 32 NA
2 2 7 16 1977 3 NL M 33 NA
3 3 7 16 1977 2 DM F 37 NA
4 4 7 16 1977 7 DM M 36 NA
5 5 7 16 1977 3 DM M 35 NA
6 6 7 16 1977 1 PF M 14 NA
# ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
# A tibble: 6 × 13
record_id month day year plot_id species_id sex hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 16873 12 5 1989 8 DO M 37 51
2 16874 12 5 1989 16 RM F 18 15
3 16875 12 5 1989 5 RM M 17 9
4 16876 12 5 1989 4 DM M 37 31
5 16877 12 5 1989 11 DM M 37 50
6 16878 12 5 1989 8 DM F 37 42
# ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
We can get a useful summary of each numeric variable (column) by using summary():
record_id month day year plot_id
Min. : 1 Min. : 1.000 Min. : 1.0 Min. :1977 Min. : 1.00
1st Qu.: 4220 1st Qu.: 3.000 1st Qu.: 9.0 1st Qu.:1981 1st Qu.: 5.00
Median : 8440 Median : 6.000 Median :15.0 Median :1983 Median :11.00
Mean : 8440 Mean : 6.382 Mean :15.6 Mean :1984 Mean :11.47
3rd Qu.:12659 3rd Qu.: 9.000 3rd Qu.:23.0 3rd Qu.:1987 3rd Qu.:17.00
Max. :16878 Max. :12.000 Max. :31.0 Max. :1989 Max. :24.00
species_id sex hindfoot_length weight
Length:16878 Length:16878 Min. : 6.00 Min. : 4.00
Class :character Class :character 1st Qu.:21.00 1st Qu.: 24.00
Mode :character Mode :character Median :35.00 Median : 42.00
Mean :31.98 Mean : 53.22
3rd Qu.:37.00 3rd Qu.: 53.00
Max. :70.00 Max. :278.00
NA's :2733 NA's :1692
genus species taxa plot_type
Length:16878 Length:16878 Length:16878 Length:16878
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Let’s take another look at the structure of surveys using str(). Because we’ve changed a column to factor R will print a second table showing underlying attributes, but we’ll hide that for now.
spc_tbl_ [16,878 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ record_id : num [1:16878] 1 2 3 4 5 6 7 8 9 10 ...
$ month : num [1:16878] 7 7 7 7 7 7 7 7 7 7 ...
$ day : num [1:16878] 16 16 16 16 16 16 16 16 16 16 ...
$ year : num [1:16878] 1977 1977 1977 1977 1977 ...
$ plot_id : num [1:16878] 2 3 2 7 3 1 2 1 1 6 ...
$ species_id : chr [1:16878] "NL" "NL" "DM" "DM" ...
$ sex : chr [1:16878] "M" "M" "F" "M" ...
$ hindfoot_length: num [1:16878] 32 33 37 36 35 14 NA 37 34 20 ...
$ weight : num [1:16878] NA NA NA NA NA NA NA NA NA NA ...
$ genus : chr [1:16878] "Neotoma" "Neotoma" "Dipodomys" "Dipodomys" ...
$ species : chr [1:16878] "albigula" "albigula" "merriami" "merriami" ...
$ taxa : chr [1:16878] "Rodent" "Rodent" "Rodent" "Rodent" ...
$ plot_type : chr [1:16878] "Control" "Long-term Krat Exclosure" "Control" "Rodent Exclosure" ...
The $ in front of each variable acts as an important operator in R. We can use it to select different columns within a data frame. If we type the name of a data frame followed by a $, RStudio will offer tab completion to select a variable within the data from a list by navigating up and down with the arrow keys and hitting enter, or by clicking on the variable you want.
[1] 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977
[16] 1977 1977 1977 1977 1977
[ reached 'max' / getOption("max.print") -- omitted 16858 entries ]
The year column we just printed is referred to as a vector by R. While data.frames are made up of columns and rows, vectors are 1-dimensional series of values. Each column in a data.frame is actually a vector. Vectors are the basic building blocks of all data in R.
There are 4 main types of vectors (also known as atomic vectors):
“character” for strings of characters, like our genus or sex columns. Each entry in a character vector is wrapped in quotes. In other programming languages, this type of data may be referred to as “strings”.
“integer” for integers (whole numbers). You may sometimes see integers represented like 2L or 20L. The L indicates to R that it is an integer, instead of the next data type, “numeric”.
“numeric”, aka “double”, vectors can contain numbers including decimals. Other languages may refer to these as “float” or “floating point” numbers.
“logical” for TRUE and FALSE, which can also be represented as T and F. In other contexts, these may be referred to as “Boolean” data. Note these values are not quoted as R recognises them as logical values.
Vectors can only be of a single type. Since each column in a data.frame is a vector, this means an accidental character following a number, like ‘29i’, can change the type of the whole vector. If you’re working with messy data it’s useful to check the vector types of each column to ensure they are all what you are expecting, otherwise it could indicate a typo in one of the values.
To create a vector from scratch, we can use the c() function, putting values inside, separated by commas.
The values get printed out in the console. To store this vector so we can continue to work with it, we need to assign it to an object.
We see that num is a numeric vector.
Let’s try making a character vector.
Remember that each entry, like “apple”, needs to be surrounded by quotes, and entries are separated with commas. If you do something like “apple, pear, grape”, you will have only a single entry containing that whole string.
Finally, let’s make a logical vector.
There is an important distinction between 0 and a missing value. A 0 means a 0 was observed, whereas a missing value means there was no observation recorded. R represents missing data as NA, without quotes, in vectors of any type.
Let’s make a numeric vector with an NA value
R doesn’t make assumptions about how you want to handle missing data, so if we pass this vector to a numeric function like min() to get the minimum value, it won’t know what to do, so it returns NA.
This is a very good thing, since we won’t accidentally forget to consider our missing data. If we decide to exclude our missing values, many basic mathematical functions have an argument to remove them.
A common use of vectors is to pass them as arguments in functions. The median() function calculates the median of a a set of values. We also need to set na.rm = TRUE, since there are NA values in the weight column.
Or we could pass this column to hist() to generate a quick histogram of weights, which is useful to check the skew of the data.
Sometimes we want to generate vectors ourselves, and there are a few functions that can help us to do this.
First, putting : between two numbers will generate a vector of integers starting with the first number and ending with the last.
The seq() function allows you to generate similar sequences, but changing by any amount.
Finally, the rep() function allows you to repeat a value, or even a whole vector, as many times as you want, and works with any type of vector.
Let’s talk about factors because they can be challenging to work with.
In general, it’s best to leave your categorical data as a character (or numeric) vector until you need to use a factor. You might need a factor because:
We can tell R we want a specific column to be a factor by passing the column to the factor() function, and reassigning the result back into the data.frame.
Let’s look at the current ordering of the data.
Warning: Removed 1692 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Females are plotted before males.
Next we’ll use some functions from the forcats package, which we’ve already loaded when we loaded the tidyverse package.
We can change the order of the levels (this can be useful to order the levels on plots).
We can confirm this in our plot.
Warning: Removed 1692 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Missing values (NA) are not considered factor levels, so if we want to explicitly have them show up when working with factors we can recode them to a factor level.
[1] "M" "F" "unknown"
We can change the names of the factors too.
[1] "male" "female" "unknown"
We can also convert a factor back to a character vector.