Working With Data

Reading in multiple files

Sometimes data is split over multiple spreadsheet files. For example, the data we’ve unzipped into the data_rawfolder contains both a complete spreadsheet we’ve been working with so far, and also 4 individual spreadsheets which represent parts of that whole spreadsheet (split by collection period).

It’s useful to know how to read multiple spreadsheet files into R at once and combine them together into a single data.frame. Rather than repeating what we did previously 4 times, we can take advantage of how the read_csv() function works to do this in only a couple of steps.

First, let’s remove the current surveys data.frame from R.

rm(surveys)

Next we get the list of files we want to read in by using list.files(). Then we use the id argument from read_csv() to specify that we want to read in multiple files and create a new column in our data called source to state which file the data came from.

files <- list.files(path = "data_raw", pattern = "part", full.names = TRUE)

surveys <- read_csv(files, id = "source")

Rows: 16878 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): species_id, sex, genus, species, taxa, plot_type
dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We can see all 4 files were instantly read in, combined together into a single data.frame, and there is now a new column called source which states which file each row of data came from (which could be useful if the filename has data inside it like a date or collector code).

Why aren’t we using Excel files?

You may have noticed we’re using only .csv files, which are plain text spreadsheet files. We’re doing this for a number of reasons:

Using a plain text data format means the data can be opened, modified, and saved by a wide range of software - even your operating system’s text editor. This increases the chance the data will be useable into the future when proprietary software like Excel may not exist any more.
Using a plain text format also removes any extraneous formatting that may have been entered when using something like Excel. Highlighting, notes in the margin, merged cells, and multiple tables in a single sheet may mean the file cannot be opened by software like R, and are all examples of things that should be handled differently to preserve data integrity (for example keeping metadata and notes ina a README instead).
Working with Excel files is possible in R but it requires installing and loading another package (popular options include readxl/writexl and openxlsx), which increases the dependencies in your code. In order to read multiple files in using these packages it’s necessary to use more complicated methods.

Manipulating data

One of the most important skills for working with data in R is the ability to manipulate, modify, and reshape data. The dplyr and tidyr packages in the tidyverse provide a series of powerful functions for many common data manipulation tasks.

`select()`

Narrows down a data.frame to specific columns. To use the select() function, the first argument is the name of the data.frame, and the rest of the arguments are unquoted names of the columns you want.

select(surveys, plot_id, species_id, hindfoot_length)

# A tibble: 16,878 × 3
   plot_id species_id hindfoot_length
     <dbl> <chr>                <dbl>
 1       2 NL                      32
 2       3 NL                      33
 3       2 DM                      37
 4       7 DM                      36
 5       3 DM                      35
 6       1 PF                      14
 7       2 PE                      NA
 8       1 DM                      37
 9       1 DM                      34
10       6 PF                      20
# ℹ 16,868 more rows

The columns are arranged in the order we specified inside select().

To select all columns except specific columns, put a - in front of the column you want to exclude.

select(surveys, -record_id, -year)

# A tibble: 16,878 × 12
   source      month   day plot_id species_id sex   hindfoot_length weight genus
   <chr>       <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl> <chr>
 1 data_raw/s…     7    16       2 NL         M                  32     NA Neot…
 2 data_raw/s…     7    16       3 NL         M                  33     NA Neot…
 3 data_raw/s…     7    16       2 DM         F                  37     NA Dipo…
 4 data_raw/s…     7    16       7 DM         M                  36     NA Dipo…
 5 data_raw/s…     7    16       3 DM         M                  35     NA Dipo…
 6 data_raw/s…     7    16       1 PF         M                  14     NA Pero…
 7 data_raw/s…     7    16       2 PE         F                  NA     NA Pero…
 8 data_raw/s…     7    16       1 DM         M                  37     NA Dipo…
 9 data_raw/s…     7    16       1 DM         F                  34     NA Dipo…
10 data_raw/s…     7    16       6 PF         F                  20     NA Pero…
# ℹ 16,868 more rows
# ℹ 3 more variables: species <chr>, taxa <chr>, plot_type <chr>

select() also works with numeric vectors for the order of the columns. To select the 3rd, 4th, 5th, and 10th columns, we could run the following code.

select(surveys, c(3:5, 10))

# A tibble: 16,878 × 4
   month   day  year weight
   <dbl> <dbl> <dbl>  <dbl>
 1     7    16  1977     NA
 2     7    16  1977     NA
 3     7    16  1977     NA
 4     7    16  1977     NA
 5     7    16  1977     NA
 6     7    16  1977     NA
 7     7    16  1977     NA
 8     7    16  1977     NA
 9     7    16  1977     NA
10     7    16  1977     NA
# ℹ 16,868 more rows

In general it’s better to be explicit and name the columns because this is clearer. However, it can be useful if you have a data.frame with many columns and you don’t want to type out too many names.

Finally, you can select columns based on whether they match a certain criteria by using the where() function. If we want all numeric columns, we can ask to select all the columns where the class is numeric.

select(surveys, where(is.numeric))

# A tibble: 16,878 × 7
   record_id month   day  year plot_id hindfoot_length weight
       <dbl> <dbl> <dbl> <dbl>   <dbl>           <dbl>  <dbl>
 1         1     7    16  1977       2              32     NA
 2         2     7    16  1977       3              33     NA
 3         3     7    16  1977       2              37     NA
 4         4     7    16  1977       7              36     NA
 5         5     7    16  1977       3              35     NA
 6         6     7    16  1977       1              14     NA
 7         7     7    16  1977       2              NA     NA
 8         8     7    16  1977       1              37     NA
 9         9     7    16  1977       1              34     NA
10        10     7    16  1977       6              20     NA
# ℹ 16,868 more rows

Instead of giving names or positions of columns, we instead pass the where() function with the name of another function inside it, in this case is.numeric(), and we get all the columns for which that function returns TRUE.

We omit the opening and closing brackets from as.numeric() here because we’re passing it as an instruction to be used later by the outer function, rather than running it immediately.

We can use this technique to select any columns that have any NA values in them too.

select(surveys, where(anyNA))

# A tibble: 16,878 × 7
   species_id sex   hindfoot_length weight genus       species  taxa  
   <chr>      <chr>           <dbl>  <dbl> <chr>       <chr>    <chr> 
 1 NL         M                  32     NA Neotoma     albigula Rodent
 2 NL         M                  33     NA Neotoma     albigula Rodent
 3 DM         F                  37     NA Dipodomys   merriami Rodent
 4 DM         M                  36     NA Dipodomys   merriami Rodent
 5 DM         M                  35     NA Dipodomys   merriami Rodent
 6 PF         M                  14     NA Perognathus flavus   Rodent
 7 PE         F                  NA     NA Peromyscus  eremicus Rodent
 8 DM         M                  37     NA Dipodomys   merriami Rodent
 9 DM         F                  34     NA Dipodomys   merriami Rodent
10 PF         F                  20     NA Perognathus flavus   Rodent
# ℹ 16,868 more rows

`filter()`

Narrows down a data.frame to specific rows based on certain critera. To get all the rows where the value of year is equal to 1985, we would run the following.

filter(surveys, year == 1985)

# A tibble: 1,438 × 14
   source   record_id month   day  year plot_id species_id sex   hindfoot_length
   <chr>        <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>
 1 data_ra…      9790     1    19  1985      16 RM         F                  16
 2 data_ra…      9791     1    19  1985      17 OT         F                  20
 3 data_ra…      9792     1    19  1985       6 DO         M                  35
 4 data_ra…      9793     1    19  1985      12 DO         F                  35
 5 data_ra…      9794     1    19  1985      24 RM         M                  16
 6 data_ra…      9795     1    19  1985      12 DO         M                  34
 7 data_ra…      9796     1    19  1985       6 DM         F                  37
 8 data_ra…      9797     1    19  1985      14 DM         M                  36
 9 data_ra…      9798     1    19  1985       6 DM         F                  36
10 data_ra…      9799     1    19  1985      19 RM         M                  16
# ℹ 1,428 more rows
# ℹ 5 more variables: weight <dbl>, genus <chr>, species <chr>, taxa <chr>,
#   plot_type <chr>

The == sign means “is equal to”. There are several other operators we can use: - > (greater than) - >= (greater than or equal to) - < (less than) - <= (less than or equal to) - != (not equal to)

Another useful operator is %in%, which checks if a specified set of values are present in a vector. For example, we can get rows with specific species_id values.

filter(surveys, species_id %in% c("RM", "DO"))

# A tibble: 2,835 × 14
   source   record_id month   day  year plot_id species_id sex   hindfoot_length
   <chr>        <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>
 1 data_ra…        68     8    19  1977       8 DO         F                  32
 2 data_ra…       292    10    17  1977       3 DO         F                  36
 3 data_ra…       294    10    17  1977       3 DO         F                  37
 4 data_ra…       311    10    17  1977      19 RM         M                  18
 5 data_ra…       317    10    17  1977      17 DO         F                  32
 6 data_ra…       323    10    17  1977      17 DO         F                  33
 7 data_ra…       337    10    18  1977       8 DO         F                  35
 8 data_ra…       356    11    12  1977       1 DO         F                  32
 9 data_ra…       378    11    12  1977       1 DO         M                  33
10 data_ra…       397    11    13  1977      17 RM         F                  16
# ℹ 2,825 more rows
# ℹ 5 more variables: weight <dbl>, genus <chr>, species <chr>, taxa <chr>,
#   plot_type <chr>

We can also use multiple conditions in one filter() statement. Here we will get rows with a year less than or equal to 1988 and whose hindfoot length values are not NA. The ! before the is.na() function means “not”.

filter(surveys, year <= 1988 & !is.na(hindfoot_length))

# A tibble: 12,779 × 14
   source   record_id month   day  year plot_id species_id sex   hindfoot_length
   <chr>        <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>
 1 data_ra…         1     7    16  1977       2 NL         M                  32
 2 data_ra…         2     7    16  1977       3 NL         M                  33
 3 data_ra…         3     7    16  1977       2 DM         F                  37
 4 data_ra…         4     7    16  1977       7 DM         M                  36
 5 data_ra…         5     7    16  1977       3 DM         M                  35
 6 data_ra…         6     7    16  1977       1 PF         M                  14
 7 data_ra…         8     7    16  1977       1 DM         M                  37
 8 data_ra…         9     7    16  1977       1 DM         F                  34
 9 data_ra…        10     7    16  1977       6 PF         F                  20
10 data_ra…        11     7    16  1977       5 DS         F                  53
# ℹ 12,769 more rows
# ℹ 5 more variables: weight <dbl>, genus <chr>, species <chr>, taxa <chr>,
#   plot_type <chr>

The pipe `|>`

What happens if we want to both select() and filter() our data? We have a couple of options.

First, we could use nested functions:

filter(select(surveys, -day), month >= 7)

# A tibble: 8,244 × 13
   source  record_id month  year plot_id species_id sex   hindfoot_length weight
   <chr>       <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
 1 data_r…         1     7  1977       2 NL         M                  32     NA
 2 data_r…         2     7  1977       3 NL         M                  33     NA
 3 data_r…         3     7  1977       2 DM         F                  37     NA
 4 data_r…         4     7  1977       7 DM         M                  36     NA
 5 data_r…         5     7  1977       3 DM         M                  35     NA
 6 data_r…         6     7  1977       1 PF         M                  14     NA
 7 data_r…         7     7  1977       2 PE         F                  NA     NA
 8 data_r…         8     7  1977       1 DM         M                  37     NA
 9 data_r…         9     7  1977       1 DM         F                  34     NA
10 data_r…        10     7  1977       6 PF         F                  20     NA
# ℹ 8,234 more rows
# ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>

R will evaluate statements from the inside out. First, select() will operate on the surveys data.frame, removing the column day. The resulting data.frame is then used as the first argument for filter(), which selects rows with a month greater than or equal to 7.

Nested functions can be very difficult to read with only a few functions, and nearly impossible when many functions are done at once.

An alternative approach is to create intermediate objects.

surveys_noday <- select(surveys, -day)
filter(surveys_noday, month >= 7)

# A tibble: 8,244 × 13
   source  record_id month  year plot_id species_id sex   hindfoot_length weight
   <chr>       <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
 1 data_r…         1     7  1977       2 NL         M                  32     NA
 2 data_r…         2     7  1977       3 NL         M                  33     NA
 3 data_r…         3     7  1977       2 DM         F                  37     NA
 4 data_r…         4     7  1977       7 DM         M                  36     NA
 5 data_r…         5     7  1977       3 DM         M                  35     NA
 6 data_r…         6     7  1977       1 PF         M                  14     NA
 7 data_r…         7     7  1977       2 PE         F                  NA     NA
 8 data_r…         8     7  1977       1 DM         M                  37     NA
 9 data_r…         9     7  1977       1 DM         F                  34     NA
10 data_r…        10     7  1977       6 PF         F                  20     NA
# ℹ 8,234 more rows
# ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>

This approach is easier to read, since we can see the steps in order, but after enough steps, we are left with a cluttered mess of intermediate objects, often with confusing names.

An elegant solution to this problem is an operator called the pipe, which looks like |>. You can insert it by using the keyboard shortcut Shift+Cmd+M (Mac) or Shift+Ctrl+M (Windows). Here’s how you could use a pipe to select and filter in one step:

surveys |> 
  select(-day) |>  
  filter(month >= 7)

# A tibble: 8,244 × 13
   source  record_id month  year plot_id species_id sex   hindfoot_length weight
   <chr>       <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
 1 data_r…         1     7  1977       2 NL         M                  32     NA
 2 data_r…         2     7  1977       3 NL         M                  33     NA
 3 data_r…         3     7  1977       2 DM         F                  37     NA
 4 data_r…         4     7  1977       7 DM         M                  36     NA
 5 data_r…         5     7  1977       3 DM         M                  35     NA
 6 data_r…         6     7  1977       1 PF         M                  14     NA
 7 data_r…         7     7  1977       2 PE         F                  NA     NA
 8 data_r…         8     7  1977       1 DM         M                  37     NA
 9 data_r…         9     7  1977       1 DM         F                  34     NA
10 data_r…        10     7  1977       6 PF         F                  20     NA
# ℹ 8,234 more rows
# ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>

The pipe takes the thing on the lefthand side and inserts it as the first argument of the function on the righthand side. By putting each of our functions onto a new line, we can build a nice, readable pipeline.

It can be useful to think of this as a little assembly line for our data. It starts at the top and gets piped into a select() function, and it comes out modified somewhat. It then gets sent into the filter() function, where it is further modified, and then the final product gets printed out to our console. It can also be helpful to think of |> as meaning “and then”. Since many tidyverse functions have verbs for names, a pipeline can be read like a sentence.

If we want to store this final product as an object, we use an assignment arrow at the start.

surveys_sub <- surveys |>  
  select(-day) |> 
  filter(month >= 7)

# Can also be written as 

surveys_sub <- 
  surveys |>  
  select(-day) |> 
  filter(month >= 7)

A good approach is to build a pipeline step by step prior to assignment. You add functions to the pipeline as you go, with the results printing in the console for you to view. Once you’re satisfied with your final result, go back and add the assignment arrow statement at the start. This approach is very interactive, allowing you to see the results of each step as you build the pipeline, and produces nicely readable code.

You may also see the older style pipe in code online: %>%. This was widely used before a native pipe |> was introduced in R 4.1.0 in May 2021. The two pipes are quite similar but there are a few differences we won’t get into here.

New columns with `mutate()`

Creating new columns based on existing columns is a common task during data analysis. For example, we could add a new column that has the weight in kilograms instead of grams.

surveys |> 
  mutate(weight_kg = weight / 1000)

# A tibble: 16,878 × 15
   source   record_id month   day  year plot_id species_id sex   hindfoot_length
   <chr>        <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>
 1 data_ra…         1     7    16  1977       2 NL         M                  32
 2 data_ra…         2     7    16  1977       3 NL         M                  33
 3 data_ra…         3     7    16  1977       2 DM         F                  37
 4 data_ra…         4     7    16  1977       7 DM         M                  36
 5 data_ra…         5     7    16  1977       3 DM         M                  35
 6 data_ra…         6     7    16  1977       1 PF         M                  14
 7 data_ra…         7     7    16  1977       2 PE         F                  NA
 8 data_ra…         8     7    16  1977       1 DM         M                  37
 9 data_ra…         9     7    16  1977       1 DM         F                  34
10 data_ra…        10     7    16  1977       6 PF         F                  20
# ℹ 16,868 more rows
# ℹ 6 more variables: weight <dbl>, genus <chr>, species <chr>, taxa <chr>,
#   plot_type <chr>, weight_kg <dbl>

You can create multiple columns in one mutate() call, and they will get created in the order you write them. This means you can even reference the first new column in the second new column:

surveys |> 
  mutate(weight_kg = weight / 1000,
         weight_lbs = weight_kg * 2.2)

# A tibble: 16,878 × 16
   source   record_id month   day  year plot_id species_id sex   hindfoot_length
   <chr>        <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>
 1 data_ra…         1     7    16  1977       2 NL         M                  32
 2 data_ra…         2     7    16  1977       3 NL         M                  33
 3 data_ra…         3     7    16  1977       2 DM         F                  37
 4 data_ra…         4     7    16  1977       7 DM         M                  36
 5 data_ra…         5     7    16  1977       3 DM         M                  35
 6 data_ra…         6     7    16  1977       1 PF         M                  14
 7 data_ra…         7     7    16  1977       2 PE         F                  NA
 8 data_ra…         8     7    16  1977       1 DM         M                  37
 9 data_ra…         9     7    16  1977       1 DM         F                  34
10 data_ra…        10     7    16  1977       6 PF         F                  20
# ℹ 16,868 more rows
# ℹ 7 more variables: weight <dbl>, genus <chr>, species <chr>, taxa <chr>,
#   plot_type <chr>, weight_kg <dbl>, weight_lbs <dbl>

We can also use multiple columns to create a single column. For example, it’s often good practice to keep the components of a date in separate columns until necessary, as in the surveys dataset. This is because programs like Excel automatically convert and change data interpreted as dates, and this can introduce errors. However, now that we are working in R, we can safely put together a date column.

To put the data in multiple columns together we can use the paste() function, which takes the items to paste together and sep, which is the character used to separate the items.

surveys |> 
  mutate(date = paste(year, month, day, sep = "-"))

# A tibble: 16,878 × 15
   source   record_id month   day  year plot_id species_id sex   hindfoot_length
   <chr>        <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>
 1 data_ra…         1     7    16  1977       2 NL         M                  32
 2 data_ra…         2     7    16  1977       3 NL         M                  33
 3 data_ra…         3     7    16  1977       2 DM         F                  37
 4 data_ra…         4     7    16  1977       7 DM         M                  36
 5 data_ra…         5     7    16  1977       3 DM         M                  35
 6 data_ra…         6     7    16  1977       1 PF         M                  14
 7 data_ra…         7     7    16  1977       2 PE         F                  NA
 8 data_ra…         8     7    16  1977       1 DM         M                  37
 9 data_ra…         9     7    16  1977       1 DM         F                  34
10 data_ra…        10     7    16  1977       6 PF         F                  20
# ℹ 16,868 more rows
# ℹ 6 more variables: weight <dbl>, genus <chr>, species <chr>, taxa <chr>,
#   plot_type <chr>, date <chr>

Since our new column gets moved all the way to the end, it doesn’t end up printing out. We can use the relocate() function to put it after our year column.

surveys |>  
  mutate(date = paste(year, month, day, sep = "-")) |>  
  relocate(date, .after = year)

# A tibble: 16,878 × 15
   source             record_id month   day  year date  plot_id species_id sex  
   <chr>                  <dbl> <dbl> <dbl> <dbl> <chr>   <dbl> <chr>      <chr>
 1 data_raw/surveys_…         1     7    16  1977 1977…       2 NL         M    
 2 data_raw/surveys_…         2     7    16  1977 1977…       3 NL         M    
 3 data_raw/surveys_…         3     7    16  1977 1977…       2 DM         F    
 4 data_raw/surveys_…         4     7    16  1977 1977…       7 DM         M    
 5 data_raw/surveys_…         5     7    16  1977 1977…       3 DM         M    
 6 data_raw/surveys_…         6     7    16  1977 1977…       1 PF         M    
 7 data_raw/surveys_…         7     7    16  1977 1977…       2 PE         F    
 8 data_raw/surveys_…         8     7    16  1977 1977…       1 DM         M    
 9 data_raw/surveys_…         9     7    16  1977 1977…       1 DM         F    
10 data_raw/surveys_…        10     7    16  1977 1977…       6 PF         F    
# ℹ 16,868 more rows
# ℹ 6 more variables: hindfoot_length <dbl>, weight <dbl>, genus <chr>,
#   species <chr>, taxa <chr>, plot_type <chr>

Now we can see that we have a character column that contains our date string. However, R doesn’t yet recognise it as a proper date column. Dates are a particular type of variable with a defined, ordered scale. To turn this column into a proper date, we will use a function from the tidyverse’s lubridate package, which has lots of useful functions for working with dates. The function ymd() will parse a date string that has the order year-month-day.

surveys |> 
  mutate(date = paste(year, month, day, sep = "-"),
         date = ymd(date)) |> 
  relocate(date, .after = year)

# A tibble: 16,878 × 15
   source        record_id month   day  year date       plot_id species_id sex  
   <chr>             <dbl> <dbl> <dbl> <dbl> <date>       <dbl> <chr>      <chr>
 1 data_raw/sur…         1     7    16  1977 1977-07-16       2 NL         M    
 2 data_raw/sur…         2     7    16  1977 1977-07-16       3 NL         M    
 3 data_raw/sur…         3     7    16  1977 1977-07-16       2 DM         F    
 4 data_raw/sur…         4     7    16  1977 1977-07-16       7 DM         M    
 5 data_raw/sur…         5     7    16  1977 1977-07-16       3 DM         M    
 6 data_raw/sur…         6     7    16  1977 1977-07-16       1 PF         M    
 7 data_raw/sur…         7     7    16  1977 1977-07-16       2 PE         F    
 8 data_raw/sur…         8     7    16  1977 1977-07-16       1 DM         M    
 9 data_raw/sur…         9     7    16  1977 1977-07-16       1 DM         F    
10 data_raw/sur…        10     7    16  1977 1977-07-16       6 PF         F    
# ℹ 16,868 more rows
# ℹ 6 more variables: hindfoot_length <dbl>, weight <dbl>, genus <chr>,
#   species <chr>, taxa <chr>, plot_type <chr>

We could also simplify this operation by nesting the functions so that the result of paste() is then provided to ymd().

surveys |> 
  mutate(date = ymd(paste(year, month, day, sep = "-"))) |>  
  relocate(date, .after = year)

# A tibble: 16,878 × 15
   source        record_id month   day  year date       plot_id species_id sex  
   <chr>             <dbl> <dbl> <dbl> <dbl> <date>       <dbl> <chr>      <chr>
 1 data_raw/sur…         1     7    16  1977 1977-07-16       2 NL         M    
 2 data_raw/sur…         2     7    16  1977 1977-07-16       3 NL         M    
 3 data_raw/sur…         3     7    16  1977 1977-07-16       2 DM         F    
 4 data_raw/sur…         4     7    16  1977 1977-07-16       7 DM         M    
 5 data_raw/sur…         5     7    16  1977 1977-07-16       3 DM         M    
 6 data_raw/sur…         6     7    16  1977 1977-07-16       1 PF         M    
 7 data_raw/sur…         7     7    16  1977 1977-07-16       2 PE         F    
 8 data_raw/sur…         8     7    16  1977 1977-07-16       1 DM         M    
 9 data_raw/sur…         9     7    16  1977 1977-07-16       1 DM         F    
10 data_raw/sur…        10     7    16  1977 1977-07-16       6 PF         F    
# ℹ 16,868 more rows
# ℹ 6 more variables: hindfoot_length <dbl>, weight <dbl>, genus <chr>,
#   species <chr>, taxa <chr>, plot_type <chr>

The split-apply-combine approach

Many data analysis tasks can be achieved using the split-apply-combine approach: split the data into groups, apply some analysis to each group, and combine the results in some way. dplyr has a few convenient functions to enable this approach, mainly group_by() and summarise().

group_by() takes a data.frame and the name of one or more columns with categorical values that define the groups. summarize() then collapses each group into a one-row summary of the group, giving you back a data.frame with one row per group. The syntax for summarize() is similar to mutate(), where you define new columns based on values of other columns. Let’s try calculating the mean weight of all our animals by sex.

surveys |> 
  group_by(sex) |> 
  summarise(mean_weight = mean(weight, na.rm = T))

# A tibble: 3 × 2
  sex   mean_weight
  <chr>       <dbl>
1 F            53.1
2 M            53.2
3 <NA>         74.0

You can see that the mean weight for males is slightly higher than for females, but that animals whose sex is unknown have much higher weights. This is probably due to small sample size, but we should check to be sure. Like mutate(), we can define multiple columns in one summarise() call. The function n() will count the number of rows in each group.

surveys |> 
  group_by(sex) |>  
  summarise(mean_weight = mean(weight, na.rm = T),
            n = n())

# A tibble: 3 × 3
  sex   mean_weight     n
  <chr>       <dbl> <int>
1 F            53.1  7318
2 M            53.2  8260
3 <NA>         74.0  1300

You will often want to create groups based on multiple columns. For example, we might be interested in the mean weight of every species + sex combination. All we have to do is add another column to our group_by() call.

surveys |> 
  group_by(species_id, sex) |> 
  summarise(mean_weight = mean(weight, na.rm = T),
            n = n())

`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.

# A tibble: 67 × 4
# Groups:   species_id [36]
   species_id sex   mean_weight     n
   <chr>      <chr>       <dbl> <int>
 1 AB         <NA>        NaN     223
 2 AH         <NA>        NaN     136
 3 BA         M             7       3
 4 CB         <NA>        NaN      23
 5 CM         <NA>        NaN      13
 6 CQ         <NA>        NaN      16
 7 CS         <NA>        NaN       1
 8 CV         <NA>        NaN       1
 9 DM         F            40.7  2522
10 DM         M            44.0  3108
# ℹ 57 more rows

Our resulting data.frame is much larger, since we have a greater number of groups. We also see a strange value showing up in our mean_weight column: NaN. This stands for “Not a Number”, and it often results from trying to do an operation on a vector with zero entries. How can a vector have zero entries? If a particular group (like the AB species ID + NA sex group) has only NA values for weight, then the na.rm = T argument in mean() will remove all the values prior to calculating the mean. The result will be a value of NaN. Since we are not particularly interested in these values, let’s add a step to our pipeline to remove rows where weight is NA before doing any other steps. This means that any groups with only NA values will disappear from our data.frame before we formally create the groups with group_by().

surveys |> 
  filter(!is.na(weight)) |> 
  group_by(species_id, sex) |> 
  summarise(mean_weight = mean(weight),
            n = n())

`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.

# A tibble: 46 × 4
# Groups:   species_id [18]
   species_id sex   mean_weight     n
   <chr>      <chr>       <dbl> <int>
 1 BA         M             7       3
 2 DM         F            40.7  2460
 3 DM         M            44.0  3013
 4 DM         <NA>         37       8
 5 DO         F            48.4   679
 6 DO         M            49.3   748
 7 DO         <NA>         44       1
 8 DS         F           118.   1055
 9 DS         M           123.   1184
10 DS         <NA>        121.     16
# ℹ 36 more rows

That looks better!

It’s often useful to take a look at the results in some order, like the lowest mean weight to highest. We can use the arrange() function for that.

surveys |>  
  filter(!is.na(weight)) |>  
  group_by(species_id, sex) |>  
  summarise(mean_weight = mean(weight),
            n = n()) |>  
  arrange(mean_weight)

`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.

# A tibble: 46 × 4
# Groups:   species_id [18]
   species_id sex   mean_weight     n
   <chr>      <chr>       <dbl> <int>
 1 PF         <NA>         6        2
 2 BA         M            7        3
 3 PF         F            7.09   215
 4 PF         M            7.10   296
 5 RM         M            9.92   678
 6 RM         <NA>        10.4      7
 7 RM         F           10.7    629
 8 RF         M           12.4     16
 9 RF         F           13.7     46
10 PP         <NA>        15        2
# ℹ 36 more rows

If we want to reverse the order, we can wrap the column name in desc():

surveys |>  
  filter(!is.na(weight)) |>  
  group_by(species_id, sex) |>  
  summarise(mean_weight = mean(weight),
            n = n()) |>  
  arrange(desc(mean_weight))

`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.

# A tibble: 46 × 4
# Groups:   species_id [18]
   species_id sex   mean_weight     n
   <chr>      <chr>       <dbl> <int>
 1 NL         M           168.    355
 2 NL         <NA>        164.      9
 3 NL         F           151.    460
 4 SS         M           130       1
 5 DS         M           123.   1184
 6 DS         <NA>        121.     16
 7 DS         F           118.   1055
 8 SH         F            79.2    61
 9 SH         M            67.6    34
10 SF         F            58.3     3
# ℹ 36 more rows

You may have seen several messages saying summarise() has grouped output by 'species_id'. You can override using the .groups argument.

These are warning you that your resulting data.frame has retained some group structure, which means any subsequent operations on that data.frame will happen at the group level. If you look at the resulting data.frame printed out in your console, you will see these lines:

# A tibble: 46 × 4
# Groups:   species_id [18]

They tell us we have a data.frame with 46 rows, 4 columns, and a group variable species_id, for which there are 18 groups.

This can lead to confusing results if you forget about the grouping and want to carry out operations on the whole data.frame, not by group. Therefore, it is a good habit to remove the groups at the end of a pipeline containing group_by():

surveys |> 
  filter(!is.na(weight)) |>  
  group_by(species_id, sex) |> 
  summarize(mean_weight = mean(weight),
            n = n()) |> 
  arrange(desc(mean_weight)) |>  
  ungroup()

`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.

# A tibble: 46 × 4
   species_id sex   mean_weight     n
   <chr>      <chr>       <dbl> <int>
 1 NL         M           168.    355
 2 NL         <NA>        164.      9
 3 NL         F           151.    460
 4 SS         M           130       1
 5 DS         M           123.   1184
 6 DS         <NA>        121.     16
 7 DS         F           118.   1055
 8 SH         F            79.2    61
 9 SH         M            67.6    34
10 SF         F            58.3     3
# ℹ 36 more rows

While it is common that you will want to get the one-row-per-group summary that summarise() provides, there are times where you want to calculate a per-group value but keep all the rows in your data.frame. For example, we might want to know the mean weight for each species ID + sex combination, and then we might want to know how far from that mean value each observation in the group is. For this, we can use group_by() and mutate() together:

surveys |> 
  filter(!is.na(weight)) |> 
  group_by(species_id, sex) |> 
  mutate(mean_weight = mean(weight),
            weight_diff = weight - mean_weight)

# A tibble: 15,186 × 16
# Groups:   species_id, sex [46]
   source   record_id month   day  year plot_id species_id sex   hindfoot_length
   <chr>        <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>
 1 data_ra…        63     8    19  1977       3 DM         M                  35
 2 data_ra…        64     8    19  1977       7 DM         M                  37
 3 data_ra…        65     8    19  1977       4 DM         F                  34
 4 data_ra…        66     8    19  1977       4 DM         F                  35
 5 data_ra…        67     8    19  1977       7 DM         M                  35
 6 data_ra…        68     8    19  1977       8 DO         F                  32
 7 data_ra…        69     8    19  1977       2 PF         M                  15
 8 data_ra…        70     8    19  1977       3 OX         F                  21
 9 data_ra…        71     8    19  1977       7 DM         F                  36
10 data_ra…        74     8    19  1977       8 PF         M                  12
# ℹ 15,176 more rows
# ℹ 7 more variables: weight <dbl>, genus <chr>, species <chr>, taxa <chr>,
#   plot_type <chr>, mean_weight <dbl>, weight_diff <dbl>

Since we get all our columns back, the new columns are at the very end and don’t print out in the console. Let’s use select() to just look at the columns of interest. Inside select() we can use the contains() function to get any column containing the word “weight” in the name:

surveys |>  
  filter(!is.na(weight)) |>  
  group_by(species_id, sex) |> 
  mutate(mean_weight = mean(weight),
            weight_diff = weight - mean_weight) |> 
  select(species_id, sex, contains("weight"))

# A tibble: 15,186 × 5
# Groups:   species_id, sex [46]
   species_id sex   weight mean_weight weight_diff
   <chr>      <chr>  <dbl>       <dbl>       <dbl>
 1 DM         M         40       44.0      -4.00  
 2 DM         M         48       44.0       4.00  
 3 DM         F         29       40.7     -11.7   
 4 DM         F         46       40.7       5.28  
 5 DM         M         36       44.0      -8.00  
 6 DO         F         52       48.4       3.63  
 7 PF         M          8        7.10      0.902 
 8 OX         F         22       21         1     
 9 DM         F         35       40.7      -5.72  
10 PF         M          7        7.10     -0.0980
# ℹ 15,176 more rows

What happens with the group_by() + mutate() combination is similar to using summarise(): for each group, the mean weight is calculated. However, instead of reporting only one row per group, the mean weight for each group is added to each row in that group. For each row in a group (like DM species ID + male sex), you will see the same value in mean_weight.

Reshaping data

It’s possible to record the same data in multiple formats in a spreadsheet. For example, data on points scored by 3 teams during a series of basketball games could be formatted in a wide format like this:

Team	Points	Assists	Rebounds
A	88	12	22
B	91	17	28
C	99	24	30

Or it could be formatted like this in a longer format:

Team	Variable	Value
A	Points	88
A	Assists	12
A	Rebounds	22
B	Points	91
B	Assists	17
B	Rebounds	28
C	Points	99
C	Assists	24
C	Rebounds	30

Wide format data does not contain any repeated values in first column, whereas longer format data does. Many datasets in the real world are recorded in a wide format because people generally find it easier to interpret and more intuitive to record it that way.

In general, when visualising multiple variables in a plot you need data in long format in order for the software to create the plot.

Back to our data. Let’s say we are interested in comparing the mean weights of each species across our different plot_types. We can begin this process using the group_by() + summarise() approach:

sp_by_plot <- surveys |>  
  filter(!is.na(weight)) |>  
  group_by(species_id, plot_id) |>  
  summarise(mean_weight = mean(weight)) |> 
  arrange(species_id, plot_id)

`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.

sp_by_plot

# A tibble: 300 × 3
# Groups:   species_id [18]
   species_id plot_id mean_weight
   <chr>        <dbl>       <dbl>
 1 BA               3         8  
 2 BA              21         6.5
 3 DM               1        42.7
 4 DM               2        42.6
 5 DM               3        41.2
 6 DM               4        41.9
 7 DM               5        42.6
 8 DM               6        42.1
 9 DM               7        43.2
10 DM               8        43.4
# ℹ 290 more rows

It’s a bit difficult to compare values across plots when the data is in long format.

The tidyr package in the tidyverse has a pair of functions that allow you to reshape data by pivoting it: pivot_wider() and pivot_longer().

pivot_wider() will make the data wider, which means increasing the number of columns and reducing the number of rows.

pivot_longer() will do the opposite, reducing the number of columns and increasing the number of rows.

In this case, it might be nice to create a data.frame where each species has its own row, and each plot has its own column containing the mean weight for a given species. We will use pivot_wider() to reshape our data in this way. It takes 3 arguments:

the name of the data.frame
names_from: which column should be used to generate the names of the new columns?
values_from: which column should be used to fill in the values of the new columns?

In our case, we want the new columns to be named from our plot_id column, with the values coming from the mean_weight column. We can pipe our data.frame right into pivot_wider() and add those two arguments.

Any columns not used for names_from or values_from will not be pivoted.

sp_by_plot_wide <- sp_by_plot |>  
  pivot_wider(names_from = plot_id, 
              values_from = mean_weight)

sp_by_plot_wide

# A tibble: 18 × 25
# Groups:   species_id [18]
   species_id    `3`   `21`    `1`    `2`    `4`   `5`    `6`   `7`    `8`
   <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl>
 1 BA           8      6.5   NA     NA     NA     NA    NA     NA    NA   
 2 DM          41.2   41.5   42.7   42.6   41.9   42.6  42.1   43.2  43.4 
 3 DO          42.7   NA     50.1   50.3   46.8   50.4  49.0   52    49.2 
 4 DS         128.    NA    129.   125.   118.   111.  114.   126.  128.  
 5 NL         171.   136.   154.   171.   164.   192.  176.   170.  134.  
 6 OL          32.1   28.6   35.5   34     33.0   32.6  31.8   NA    30.3 
 7 OT          24.1   24.1   23.7   24.9   26.5   23.6  23.5   22    24.1 
 8 OX          22     NA     NA     22     NA     20    NA     NA    NA   
 9 PE          22.7   19.6   21.6   22.0   NA     21    21.6   22.8  19.4 
10 PF           7.12   7.23   6.57   6.89   6.75   7.5   7.54   7     6.78
11 PH          28     31     NA     NA     NA     29    NA     NA    NA   
12 PM          20.1   23.6   23.7   23.9   NA     23.7  22.3   23.4  23   
13 PP          17.1   13.6   14.3   16.4   14.8   19.8  16.8   NA    13.9 
14 RF          14.8   17     NA     16     NA     14    12.1   13    NA   
15 RM          10.3    9.89  10.9   10.6   10.4   10.8  10.6   10.7   9   
16 SF          NA     49     NA     NA     NA     NA    NA     NA    NA   
17 SH          76.0   79.9   NA     88     NA     82.7  NA     NA    NA   
18 SS          NA     NA     NA     NA     NA     NA    NA     NA    NA   
# ℹ 15 more variables: `9` <dbl>, `10` <dbl>, `11` <dbl>, `12` <dbl>,
#   `13` <dbl>, `14` <dbl>, `15` <dbl>, `16` <dbl>, `17` <dbl>, `18` <dbl>,
#   `19` <dbl>, `20` <dbl>, `22` <dbl>, `23` <dbl>, `24` <dbl>

We have a new column for each plot_id value. There is one old column left in the data.frame: species_id. It wasn’t used in pivot_wider(), so it stays, and now contains a single entry for each unique species_id value.

Finally, a lot of NAs have appeared. Some species aren’t found in every plot, but because a data.frame has to have a value in every row and every column, an NA is inserted.

We can now pivot this table to a long format with pivot_longer(). It takes 3 arguments:

cols for the columns you want to pivot
names_to for the name of the new column which will contain the old column names
values_to for the name of the new column which will contain the old values

We want to pivot all the columns except species_id, and we will use PLOT for the new column of plot IDs, and MEAN_WT for the new column of mean weight values.

sp_by_plot_wide |>  
  pivot_longer(cols = -species_id, names_to = "PLOT", values_to = "MEAN_WT")

# A tibble: 432 × 3
# Groups:   species_id [18]
   species_id PLOT  MEAN_WT
   <chr>      <chr>   <dbl>
 1 BA         3         8  
 2 BA         21        6.5
 3 BA         1        NA  
 4 BA         2        NA  
 5 BA         4        NA  
 6 BA         5        NA  
 7 BA         6        NA  
 8 BA         7        NA  
 9 BA         8        NA  
10 BA         9        NA  
# ℹ 422 more rows

One thing you will notice is that all those NA values that got generated when we pivoted wider. However, we can filter those out, which gets us back to the same data as sp_by_plot, before we pivoted it wider.

sp_by_plot_wide |> 
  pivot_longer(cols = -species_id, names_to = "PLOT", values_to = "MEAN_WT") |> 
  filter(!is.na(MEAN_WT))

# A tibble: 300 × 3
# Groups:   species_id [18]
   species_id PLOT  MEAN_WT
   <chr>      <chr>   <dbl>
 1 BA         3         8  
 2 BA         21        6.5
 3 DM         3        41.2
 4 DM         21       41.5
 5 DM         1        42.7
 6 DM         2        42.6
 7 DM         4        41.9
 8 DM         5        42.6
 9 DM         6        42.1
10 DM         7        43.2
# ℹ 290 more rows

Exporting data

Let’s say we want to export the wide version of our table so we can submit it with a manuscript to include in a publication.

First, we might want to modify the names of the columns, since right now they are bare numbers, which aren’t very informative. Luckily, pivot_wider() has an argument names_prefix which will allow us to add “plot_” to the start of each column.

sp_by_plot <-   sp_by_plot |> 
  pivot_wider(names_from = plot_id, values_from = mean_weight,
              names_prefix = "plot_")

Now we can save this data.frame to a CSV using the write_csv() function from the readr package. The first argument is the name of the data.frame, and the second is the path to the new file we want to create, including the file extension .csv.

write_csv(sp_by_plot, "figures/table_survey_sp.csv")

Reading in multiple files

Manipulating data

select()

filter()

The pipe |>