rm(surveys)Working With Data
Reading in multiple files
Sometimes data is split over multiple spreadsheet files. For example, the data we’ve unzipped into the data_rawfolder contains both a complete spreadsheet we’ve been working with so far, and also 4 individual spreadsheets which represent parts of that whole spreadsheet (split by collection period).
It’s useful to know how to read multiple spreadsheet files into R at once and combine them together into a single data.frame. Rather than repeating what we did previously 4 times, we can take advantage of how the read_csv() function works to do this in only a couple of steps.
First, let’s remove the current surveys data.frame from R.
Next we get the list of files we want to read in by using list.files(). Then we use the id argument from read_csv() to specify that we want to read in multiple files and create a new column in our data called source to state which file the data came from.
files <- list.files(path = "data_raw", pattern = "part", full.names = TRUE)
surveys <- read_csv(files, id = "source")Rows: 16878 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): species_id, sex, genus, species, taxa, plot_type
dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We can see all 4 files were instantly read in, combined together into a single data.frame, and there is now a new column called source which states which file each row of data came from (which could be useful if the filename has data inside it like a date or collector code).
Why aren’t we using Excel files?
You may have noticed we’re using only .csv files, which are plain text spreadsheet files. We’re doing this for a number of reasons:
- Using a plain text data format means the data can be opened, modified, and saved by a wide range of software - even your operating system’s text editor. This increases the chance the data will be useable into the future when proprietary software like Excel may not exist any more.
- Using a plain text format also removes any extraneous formatting that may have been entered when using something like Excel. Highlighting, notes in the margin, merged cells, and multiple tables in a single sheet may mean the file cannot be opened by software like R, and are all examples of things that should be handled differently to preserve data integrity (for example keeping metadata and notes ina a README instead).
- Working with Excel files is possible in R but it requires installing and loading another package (popular options include readxl/writexl and openxlsx), which increases the dependencies in your code. In order to read multiple files in using these packages it’s necessary to use more complicated methods.
Manipulating data
One of the most important skills for working with data in R is the ability to manipulate, modify, and reshape data. The dplyr and tidyr packages in the tidyverse provide a series of powerful functions for many common data manipulation tasks.
select()
Narrows down a data.frame to specific columns. To use the select() function, the first argument is the name of the data.frame, and the rest of the arguments are unquoted names of the columns you want.
select(surveys, plot_id, species_id, hindfoot_length)# A tibble: 16,878 × 3
plot_id species_id hindfoot_length
<dbl> <chr> <dbl>
1 2 NL 32
2 3 NL 33
3 2 DM 37
4 7 DM 36
5 3 DM 35
6 1 PF 14
7 2 PE NA
8 1 DM 37
9 1 DM 34
10 6 PF 20
# ℹ 16,868 more rows
The columns are arranged in the order we specified inside select().
To select all columns except specific columns, put a - in front of the column you want to exclude.
select(surveys, -record_id, -year)# A tibble: 16,878 × 12
source month day plot_id species_id sex hindfoot_length weight genus
<chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
1 data_raw/s… 7 16 2 NL M 32 NA Neot…
2 data_raw/s… 7 16 3 NL M 33 NA Neot…
3 data_raw/s… 7 16 2 DM F 37 NA Dipo…
4 data_raw/s… 7 16 7 DM M 36 NA Dipo…
5 data_raw/s… 7 16 3 DM M 35 NA Dipo…
6 data_raw/s… 7 16 1 PF M 14 NA Pero…
7 data_raw/s… 7 16 2 PE F NA NA Pero…
8 data_raw/s… 7 16 1 DM M 37 NA Dipo…
9 data_raw/s… 7 16 1 DM F 34 NA Dipo…
10 data_raw/s… 7 16 6 PF F 20 NA Pero…
# ℹ 16,868 more rows
# ℹ 3 more variables: species <chr>, taxa <chr>, plot_type <chr>
select() also works with numeric vectors for the order of the columns. To select the 3rd, 4th, 5th, and 10th columns, we could run the following code.
select(surveys, c(3:5, 10))# A tibble: 16,878 × 4
month day year weight
<dbl> <dbl> <dbl> <dbl>
1 7 16 1977 NA
2 7 16 1977 NA
3 7 16 1977 NA
4 7 16 1977 NA
5 7 16 1977 NA
6 7 16 1977 NA
7 7 16 1977 NA
8 7 16 1977 NA
9 7 16 1977 NA
10 7 16 1977 NA
# ℹ 16,868 more rows
In general it’s better to be explicit and name the columns because this is clearer. However, it can be useful if you have a data.frame with many columns and you don’t want to type out too many names.
Finally, you can select columns based on whether they match a certain criteria by using the where() function. If we want all numeric columns, we can ask to select all the columns where the class is numeric.
select(surveys, where(is.numeric))# A tibble: 16,878 × 7
record_id month day year plot_id hindfoot_length weight
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 7 16 1977 2 32 NA
2 2 7 16 1977 3 33 NA
3 3 7 16 1977 2 37 NA
4 4 7 16 1977 7 36 NA
5 5 7 16 1977 3 35 NA
6 6 7 16 1977 1 14 NA
7 7 7 16 1977 2 NA NA
8 8 7 16 1977 1 37 NA
9 9 7 16 1977 1 34 NA
10 10 7 16 1977 6 20 NA
# ℹ 16,868 more rows
Instead of giving names or positions of columns, we instead pass the where() function with the name of another function inside it, in this case is.numeric(), and we get all the columns for which that function returns TRUE.
We omit the opening and closing brackets from as.numeric() here because we’re passing it as an instruction to be used later by the outer function, rather than running it immediately.
We can use this technique to select any columns that have any NA values in them too.
select(surveys, where(anyNA))# A tibble: 16,878 × 7
species_id sex hindfoot_length weight genus species taxa
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 NL M 32 NA Neotoma albigula Rodent
2 NL M 33 NA Neotoma albigula Rodent
3 DM F 37 NA Dipodomys merriami Rodent
4 DM M 36 NA Dipodomys merriami Rodent
5 DM M 35 NA Dipodomys merriami Rodent
6 PF M 14 NA Perognathus flavus Rodent
7 PE F NA NA Peromyscus eremicus Rodent
8 DM M 37 NA Dipodomys merriami Rodent
9 DM F 34 NA Dipodomys merriami Rodent
10 PF F 20 NA Perognathus flavus Rodent
# ℹ 16,868 more rows
filter()
Narrows down a data.frame to specific rows based on certain critera. To get all the rows where the value of year is equal to 1985, we would run the following.
filter(surveys, year == 1985)# A tibble: 1,438 × 14
source record_id month day year plot_id species_id sex hindfoot_length
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 data_ra… 9790 1 19 1985 16 RM F 16
2 data_ra… 9791 1 19 1985 17 OT F 20
3 data_ra… 9792 1 19 1985 6 DO M 35
4 data_ra… 9793 1 19 1985 12 DO F 35
5 data_ra… 9794 1 19 1985 24 RM M 16
6 data_ra… 9795 1 19 1985 12 DO M 34
7 data_ra… 9796 1 19 1985 6 DM F 37
8 data_ra… 9797 1 19 1985 14 DM M 36
9 data_ra… 9798 1 19 1985 6 DM F 36
10 data_ra… 9799 1 19 1985 19 RM M 16
# ℹ 1,428 more rows
# ℹ 5 more variables: weight <dbl>, genus <chr>, species <chr>, taxa <chr>,
# plot_type <chr>
The == sign means “is equal to”. There are several other operators we can use: - > (greater than) - >= (greater than or equal to) - < (less than) - <= (less than or equal to) - != (not equal to)
Another useful operator is %in%, which checks if a specified set of values are present in a vector. For example, we can get rows with specific species_id values.
filter(surveys, species_id %in% c("RM", "DO"))# A tibble: 2,835 × 14
source record_id month day year plot_id species_id sex hindfoot_length
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 data_ra… 68 8 19 1977 8 DO F 32
2 data_ra… 292 10 17 1977 3 DO F 36
3 data_ra… 294 10 17 1977 3 DO F 37
4 data_ra… 311 10 17 1977 19 RM M 18
5 data_ra… 317 10 17 1977 17 DO F 32
6 data_ra… 323 10 17 1977 17 DO F 33
7 data_ra… 337 10 18 1977 8 DO F 35
8 data_ra… 356 11 12 1977 1 DO F 32
9 data_ra… 378 11 12 1977 1 DO M 33
10 data_ra… 397 11 13 1977 17 RM F 16
# ℹ 2,825 more rows
# ℹ 5 more variables: weight <dbl>, genus <chr>, species <chr>, taxa <chr>,
# plot_type <chr>
We can also use multiple conditions in one filter() statement. Here we will get rows with a year less than or equal to 1988 and whose hindfoot length values are not NA. The ! before the is.na() function means “not”.
filter(surveys, year <= 1988 & !is.na(hindfoot_length))# A tibble: 12,779 × 14
source record_id month day year plot_id species_id sex hindfoot_length
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 data_ra… 1 7 16 1977 2 NL M 32
2 data_ra… 2 7 16 1977 3 NL M 33
3 data_ra… 3 7 16 1977 2 DM F 37
4 data_ra… 4 7 16 1977 7 DM M 36
5 data_ra… 5 7 16 1977 3 DM M 35
6 data_ra… 6 7 16 1977 1 PF M 14
7 data_ra… 8 7 16 1977 1 DM M 37
8 data_ra… 9 7 16 1977 1 DM F 34
9 data_ra… 10 7 16 1977 6 PF F 20
10 data_ra… 11 7 16 1977 5 DS F 53
# ℹ 12,769 more rows
# ℹ 5 more variables: weight <dbl>, genus <chr>, species <chr>, taxa <chr>,
# plot_type <chr>
The pipe |>
What happens if we want to both select() and filter() our data? We have a couple of options.
First, we could use nested functions:
filter(select(surveys, -day), month >= 7)# A tibble: 8,244 × 13
source record_id month year plot_id species_id sex hindfoot_length weight
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 data_r… 1 7 1977 2 NL M 32 NA
2 data_r… 2 7 1977 3 NL M 33 NA
3 data_r… 3 7 1977 2 DM F 37 NA
4 data_r… 4 7 1977 7 DM M 36 NA
5 data_r… 5 7 1977 3 DM M 35 NA
6 data_r… 6 7 1977 1 PF M 14 NA
7 data_r… 7 7 1977 2 PE F NA NA
8 data_r… 8 7 1977 1 DM M 37 NA
9 data_r… 9 7 1977 1 DM F 34 NA
10 data_r… 10 7 1977 6 PF F 20 NA
# ℹ 8,234 more rows
# ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
R will evaluate statements from the inside out. First, select() will operate on the surveys data.frame, removing the column day. The resulting data.frame is then used as the first argument for filter(), which selects rows with a month greater than or equal to 7.
Nested functions can be very difficult to read with only a few functions, and nearly impossible when many functions are done at once.
An alternative approach is to create intermediate objects.
surveys_noday <- select(surveys, -day)
filter(surveys_noday, month >= 7)# A tibble: 8,244 × 13
source record_id month year plot_id species_id sex hindfoot_length weight
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 data_r… 1 7 1977 2 NL M 32 NA
2 data_r… 2 7 1977 3 NL M 33 NA
3 data_r… 3 7 1977 2 DM F 37 NA
4 data_r… 4 7 1977 7 DM M 36 NA
5 data_r… 5 7 1977 3 DM M 35 NA
6 data_r… 6 7 1977 1 PF M 14 NA
7 data_r… 7 7 1977 2 PE F NA NA
8 data_r… 8 7 1977 1 DM M 37 NA
9 data_r… 9 7 1977 1 DM F 34 NA
10 data_r… 10 7 1977 6 PF F 20 NA
# ℹ 8,234 more rows
# ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
This approach is easier to read, since we can see the steps in order, but after enough steps, we are left with a cluttered mess of intermediate objects, often with confusing names.
An elegant solution to this problem is an operator called the pipe, which looks like |>. You can insert it by using the keyboard shortcut Shift+Cmd+M (Mac) or Shift+Ctrl+M (Windows). Here’s how you could use a pipe to select and filter in one step:
surveys |>
select(-day) |>
filter(month >= 7)# A tibble: 8,244 × 13
source record_id month year plot_id species_id sex hindfoot_length weight
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 data_r… 1 7 1977 2 NL M 32 NA
2 data_r… 2 7 1977 3 NL M 33 NA
3 data_r… 3 7 1977 2 DM F 37 NA
4 data_r… 4 7 1977 7 DM M 36 NA
5 data_r… 5 7 1977 3 DM M 35 NA
6 data_r… 6 7 1977 1 PF M 14 NA
7 data_r… 7 7 1977 2 PE F NA NA
8 data_r… 8 7 1977 1 DM M 37 NA
9 data_r… 9 7 1977 1 DM F 34 NA
10 data_r… 10 7 1977 6 PF F 20 NA
# ℹ 8,234 more rows
# ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
The pipe takes the thing on the lefthand side and inserts it as the first argument of the function on the righthand side. By putting each of our functions onto a new line, we can build a nice, readable pipeline.
It can be useful to think of this as a little assembly line for our data. It starts at the top and gets piped into a select() function, and it comes out modified somewhat. It then gets sent into the filter() function, where it is further modified, and then the final product gets printed out to our console. It can also be helpful to think of |> as meaning “and then”. Since many tidyverse functions have verbs for names, a pipeline can be read like a sentence.
If we want to store this final product as an object, we use an assignment arrow at the start.
surveys_sub <- surveys |>
select(-day) |>
filter(month >= 7)
# Can also be written as
surveys_sub <-
surveys |>
select(-day) |>
filter(month >= 7)A good approach is to build a pipeline step by step prior to assignment. You add functions to the pipeline as you go, with the results printing in the console for you to view. Once you’re satisfied with your final result, go back and add the assignment arrow statement at the start. This approach is very interactive, allowing you to see the results of each step as you build the pipeline, and produces nicely readable code.
You may also see the older style pipe in code online: %>%. This was widely used before a native pipe |> was introduced in R 4.1.0 in May 2021. The two pipes are quite similar but there are a few differences we won’t get into here.
New columns with mutate()
Creating new columns based on existing columns is a common task during data analysis. For example, we could add a new column that has the weight in kilograms instead of grams.
surveys |>
mutate(weight_kg = weight / 1000)# A tibble: 16,878 × 15
source record_id month day year plot_id species_id sex hindfoot_length
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 data_ra… 1 7 16 1977 2 NL M 32
2 data_ra… 2 7 16 1977 3 NL M 33
3 data_ra… 3 7 16 1977 2 DM F 37
4 data_ra… 4 7 16 1977 7 DM M 36
5 data_ra… 5 7 16 1977 3 DM M 35
6 data_ra… 6 7 16 1977 1 PF M 14
7 data_ra… 7 7 16 1977 2 PE F NA
8 data_ra… 8 7 16 1977 1 DM M 37
9 data_ra… 9 7 16 1977 1 DM F 34
10 data_ra… 10 7 16 1977 6 PF F 20
# ℹ 16,868 more rows
# ℹ 6 more variables: weight <dbl>, genus <chr>, species <chr>, taxa <chr>,
# plot_type <chr>, weight_kg <dbl>
You can create multiple columns in one mutate() call, and they will get created in the order you write them. This means you can even reference the first new column in the second new column:
surveys |>
mutate(weight_kg = weight / 1000,
weight_lbs = weight_kg * 2.2)# A tibble: 16,878 × 16
source record_id month day year plot_id species_id sex hindfoot_length
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 data_ra… 1 7 16 1977 2 NL M 32
2 data_ra… 2 7 16 1977 3 NL M 33
3 data_ra… 3 7 16 1977 2 DM F 37
4 data_ra… 4 7 16 1977 7 DM M 36
5 data_ra… 5 7 16 1977 3 DM M 35
6 data_ra… 6 7 16 1977 1 PF M 14
7 data_ra… 7 7 16 1977 2 PE F NA
8 data_ra… 8 7 16 1977 1 DM M 37
9 data_ra… 9 7 16 1977 1 DM F 34
10 data_ra… 10 7 16 1977 6 PF F 20
# ℹ 16,868 more rows
# ℹ 7 more variables: weight <dbl>, genus <chr>, species <chr>, taxa <chr>,
# plot_type <chr>, weight_kg <dbl>, weight_lbs <dbl>
We can also use multiple columns to create a single column. For example, it’s often good practice to keep the components of a date in separate columns until necessary, as in the surveys dataset. This is because programs like Excel automatically convert and change data interpreted as dates, and this can introduce errors. However, now that we are working in R, we can safely put together a date column.
To put the data in multiple columns together we can use the paste() function, which takes the items to paste together and sep, which is the character used to separate the items.
surveys |>
mutate(date = paste(year, month, day, sep = "-"))# A tibble: 16,878 × 15
source record_id month day year plot_id species_id sex hindfoot_length
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 data_ra… 1 7 16 1977 2 NL M 32
2 data_ra… 2 7 16 1977 3 NL M 33
3 data_ra… 3 7 16 1977 2 DM F 37
4 data_ra… 4 7 16 1977 7 DM M 36
5 data_ra… 5 7 16 1977 3 DM M 35
6 data_ra… 6 7 16 1977 1 PF M 14
7 data_ra… 7 7 16 1977 2 PE F NA
8 data_ra… 8 7 16 1977 1 DM M 37
9 data_ra… 9 7 16 1977 1 DM F 34
10 data_ra… 10 7 16 1977 6 PF F 20
# ℹ 16,868 more rows
# ℹ 6 more variables: weight <dbl>, genus <chr>, species <chr>, taxa <chr>,
# plot_type <chr>, date <chr>
Since our new column gets moved all the way to the end, it doesn’t end up printing out. We can use the relocate() function to put it after our year column.
surveys |>
mutate(date = paste(year, month, day, sep = "-")) |>
relocate(date, .after = year)# A tibble: 16,878 × 15
source record_id month day year date plot_id species_id sex
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr>
1 data_raw/surveys_… 1 7 16 1977 1977… 2 NL M
2 data_raw/surveys_… 2 7 16 1977 1977… 3 NL M
3 data_raw/surveys_… 3 7 16 1977 1977… 2 DM F
4 data_raw/surveys_… 4 7 16 1977 1977… 7 DM M
5 data_raw/surveys_… 5 7 16 1977 1977… 3 DM M
6 data_raw/surveys_… 6 7 16 1977 1977… 1 PF M
7 data_raw/surveys_… 7 7 16 1977 1977… 2 PE F
8 data_raw/surveys_… 8 7 16 1977 1977… 1 DM M
9 data_raw/surveys_… 9 7 16 1977 1977… 1 DM F
10 data_raw/surveys_… 10 7 16 1977 1977… 6 PF F
# ℹ 16,868 more rows
# ℹ 6 more variables: hindfoot_length <dbl>, weight <dbl>, genus <chr>,
# species <chr>, taxa <chr>, plot_type <chr>
Now we can see that we have a character column that contains our date string. However, R doesn’t yet recognise it as a proper date column. Dates are a particular type of variable with a defined, ordered scale. To turn this column into a proper date, we will use a function from the tidyverse’s lubridate package, which has lots of useful functions for working with dates. The function ymd() will parse a date string that has the order year-month-day.
surveys |>
mutate(date = paste(year, month, day, sep = "-"),
date = ymd(date)) |>
relocate(date, .after = year)# A tibble: 16,878 × 15
source record_id month day year date plot_id species_id sex
<chr> <dbl> <dbl> <dbl> <dbl> <date> <dbl> <chr> <chr>
1 data_raw/sur… 1 7 16 1977 1977-07-16 2 NL M
2 data_raw/sur… 2 7 16 1977 1977-07-16 3 NL M
3 data_raw/sur… 3 7 16 1977 1977-07-16 2 DM F
4 data_raw/sur… 4 7 16 1977 1977-07-16 7 DM M
5 data_raw/sur… 5 7 16 1977 1977-07-16 3 DM M
6 data_raw/sur… 6 7 16 1977 1977-07-16 1 PF M
7 data_raw/sur… 7 7 16 1977 1977-07-16 2 PE F
8 data_raw/sur… 8 7 16 1977 1977-07-16 1 DM M
9 data_raw/sur… 9 7 16 1977 1977-07-16 1 DM F
10 data_raw/sur… 10 7 16 1977 1977-07-16 6 PF F
# ℹ 16,868 more rows
# ℹ 6 more variables: hindfoot_length <dbl>, weight <dbl>, genus <chr>,
# species <chr>, taxa <chr>, plot_type <chr>
We could also simplify this operation by nesting the functions so that the result of paste() is then provided to ymd().
surveys |>
mutate(date = ymd(paste(year, month, day, sep = "-"))) |>
relocate(date, .after = year)# A tibble: 16,878 × 15
source record_id month day year date plot_id species_id sex
<chr> <dbl> <dbl> <dbl> <dbl> <date> <dbl> <chr> <chr>
1 data_raw/sur… 1 7 16 1977 1977-07-16 2 NL M
2 data_raw/sur… 2 7 16 1977 1977-07-16 3 NL M
3 data_raw/sur… 3 7 16 1977 1977-07-16 2 DM F
4 data_raw/sur… 4 7 16 1977 1977-07-16 7 DM M
5 data_raw/sur… 5 7 16 1977 1977-07-16 3 DM M
6 data_raw/sur… 6 7 16 1977 1977-07-16 1 PF M
7 data_raw/sur… 7 7 16 1977 1977-07-16 2 PE F
8 data_raw/sur… 8 7 16 1977 1977-07-16 1 DM M
9 data_raw/sur… 9 7 16 1977 1977-07-16 1 DM F
10 data_raw/sur… 10 7 16 1977 1977-07-16 6 PF F
# ℹ 16,868 more rows
# ℹ 6 more variables: hindfoot_length <dbl>, weight <dbl>, genus <chr>,
# species <chr>, taxa <chr>, plot_type <chr>
The split-apply-combine approach
Many data analysis tasks can be achieved using the split-apply-combine approach: split the data into groups, apply some analysis to each group, and combine the results in some way. dplyr has a few convenient functions to enable this approach, mainly group_by() and summarise().
group_by() takes a data.frame and the name of one or more columns with categorical values that define the groups. summarize() then collapses each group into a one-row summary of the group, giving you back a data.frame with one row per group. The syntax for summarize() is similar to mutate(), where you define new columns based on values of other columns. Let’s try calculating the mean weight of all our animals by sex.
surveys |>
group_by(sex) |>
summarise(mean_weight = mean(weight, na.rm = T))# A tibble: 3 × 2
sex mean_weight
<chr> <dbl>
1 F 53.1
2 M 53.2
3 <NA> 74.0
You can see that the mean weight for males is slightly higher than for females, but that animals whose sex is unknown have much higher weights. This is probably due to small sample size, but we should check to be sure. Like mutate(), we can define multiple columns in one summarise() call. The function n() will count the number of rows in each group.
surveys |>
group_by(sex) |>
summarise(mean_weight = mean(weight, na.rm = T),
n = n())# A tibble: 3 × 3
sex mean_weight n
<chr> <dbl> <int>
1 F 53.1 7318
2 M 53.2 8260
3 <NA> 74.0 1300
You will often want to create groups based on multiple columns. For example, we might be interested in the mean weight of every species + sex combination. All we have to do is add another column to our group_by() call.
surveys |>
group_by(species_id, sex) |>
summarise(mean_weight = mean(weight, na.rm = T),
n = n())`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.
# A tibble: 67 × 4
# Groups: species_id [36]
species_id sex mean_weight n
<chr> <chr> <dbl> <int>
1 AB <NA> NaN 223
2 AH <NA> NaN 136
3 BA M 7 3
4 CB <NA> NaN 23
5 CM <NA> NaN 13
6 CQ <NA> NaN 16
7 CS <NA> NaN 1
8 CV <NA> NaN 1
9 DM F 40.7 2522
10 DM M 44.0 3108
# ℹ 57 more rows
Our resulting data.frame is much larger, since we have a greater number of groups. We also see a strange value showing up in our mean_weight column: NaN. This stands for “Not a Number”, and it often results from trying to do an operation on a vector with zero entries. How can a vector have zero entries? If a particular group (like the AB species ID + NA sex group) has only NA values for weight, then the na.rm = T argument in mean() will remove all the values prior to calculating the mean. The result will be a value of NaN. Since we are not particularly interested in these values, let’s add a step to our pipeline to remove rows where weight is NA before doing any other steps. This means that any groups with only NA values will disappear from our data.frame before we formally create the groups with group_by().
surveys |>
filter(!is.na(weight)) |>
group_by(species_id, sex) |>
summarise(mean_weight = mean(weight),
n = n())`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.
# A tibble: 46 × 4
# Groups: species_id [18]
species_id sex mean_weight n
<chr> <chr> <dbl> <int>
1 BA M 7 3
2 DM F 40.7 2460
3 DM M 44.0 3013
4 DM <NA> 37 8
5 DO F 48.4 679
6 DO M 49.3 748
7 DO <NA> 44 1
8 DS F 118. 1055
9 DS M 123. 1184
10 DS <NA> 121. 16
# ℹ 36 more rows
That looks better!
It’s often useful to take a look at the results in some order, like the lowest mean weight to highest. We can use the arrange() function for that.
surveys |>
filter(!is.na(weight)) |>
group_by(species_id, sex) |>
summarise(mean_weight = mean(weight),
n = n()) |>
arrange(mean_weight)`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.
# A tibble: 46 × 4
# Groups: species_id [18]
species_id sex mean_weight n
<chr> <chr> <dbl> <int>
1 PF <NA> 6 2
2 BA M 7 3
3 PF F 7.09 215
4 PF M 7.10 296
5 RM M 9.92 678
6 RM <NA> 10.4 7
7 RM F 10.7 629
8 RF M 12.4 16
9 RF F 13.7 46
10 PP <NA> 15 2
# ℹ 36 more rows
If we want to reverse the order, we can wrap the column name in desc():
surveys |>
filter(!is.na(weight)) |>
group_by(species_id, sex) |>
summarise(mean_weight = mean(weight),
n = n()) |>
arrange(desc(mean_weight))`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.
# A tibble: 46 × 4
# Groups: species_id [18]
species_id sex mean_weight n
<chr> <chr> <dbl> <int>
1 NL M 168. 355
2 NL <NA> 164. 9
3 NL F 151. 460
4 SS M 130 1
5 DS M 123. 1184
6 DS <NA> 121. 16
7 DS F 118. 1055
8 SH F 79.2 61
9 SH M 67.6 34
10 SF F 58.3 3
# ℹ 36 more rows
You may have seen several messages saying summarise() has grouped output by 'species_id'. You can override using the .groups argument.
These are warning you that your resulting data.frame has retained some group structure, which means any subsequent operations on that data.frame will happen at the group level. If you look at the resulting data.frame printed out in your console, you will see these lines:
# A tibble: 46 × 4
# Groups: species_id [18]
They tell us we have a data.frame with 46 rows, 4 columns, and a group variable species_id, for which there are 18 groups.
This can lead to confusing results if you forget about the grouping and want to carry out operations on the whole data.frame, not by group. Therefore, it is a good habit to remove the groups at the end of a pipeline containing group_by():
surveys |>
filter(!is.na(weight)) |>
group_by(species_id, sex) |>
summarize(mean_weight = mean(weight),
n = n()) |>
arrange(desc(mean_weight)) |>
ungroup()`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.
# A tibble: 46 × 4
species_id sex mean_weight n
<chr> <chr> <dbl> <int>
1 NL M 168. 355
2 NL <NA> 164. 9
3 NL F 151. 460
4 SS M 130 1
5 DS M 123. 1184
6 DS <NA> 121. 16
7 DS F 118. 1055
8 SH F 79.2 61
9 SH M 67.6 34
10 SF F 58.3 3
# ℹ 36 more rows
While it is common that you will want to get the one-row-per-group summary that summarise() provides, there are times where you want to calculate a per-group value but keep all the rows in your data.frame. For example, we might want to know the mean weight for each species ID + sex combination, and then we might want to know how far from that mean value each observation in the group is. For this, we can use group_by() and mutate() together:
surveys |>
filter(!is.na(weight)) |>
group_by(species_id, sex) |>
mutate(mean_weight = mean(weight),
weight_diff = weight - mean_weight)# A tibble: 15,186 × 16
# Groups: species_id, sex [46]
source record_id month day year plot_id species_id sex hindfoot_length
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 data_ra… 63 8 19 1977 3 DM M 35
2 data_ra… 64 8 19 1977 7 DM M 37
3 data_ra… 65 8 19 1977 4 DM F 34
4 data_ra… 66 8 19 1977 4 DM F 35
5 data_ra… 67 8 19 1977 7 DM M 35
6 data_ra… 68 8 19 1977 8 DO F 32
7 data_ra… 69 8 19 1977 2 PF M 15
8 data_ra… 70 8 19 1977 3 OX F 21
9 data_ra… 71 8 19 1977 7 DM F 36
10 data_ra… 74 8 19 1977 8 PF M 12
# ℹ 15,176 more rows
# ℹ 7 more variables: weight <dbl>, genus <chr>, species <chr>, taxa <chr>,
# plot_type <chr>, mean_weight <dbl>, weight_diff <dbl>
Since we get all our columns back, the new columns are at the very end and don’t print out in the console. Let’s use select() to just look at the columns of interest. Inside select() we can use the contains() function to get any column containing the word “weight” in the name:
surveys |>
filter(!is.na(weight)) |>
group_by(species_id, sex) |>
mutate(mean_weight = mean(weight),
weight_diff = weight - mean_weight) |>
select(species_id, sex, contains("weight"))# A tibble: 15,186 × 5
# Groups: species_id, sex [46]
species_id sex weight mean_weight weight_diff
<chr> <chr> <dbl> <dbl> <dbl>
1 DM M 40 44.0 -4.00
2 DM M 48 44.0 4.00
3 DM F 29 40.7 -11.7
4 DM F 46 40.7 5.28
5 DM M 36 44.0 -8.00
6 DO F 52 48.4 3.63
7 PF M 8 7.10 0.902
8 OX F 22 21 1
9 DM F 35 40.7 -5.72
10 PF M 7 7.10 -0.0980
# ℹ 15,176 more rows
What happens with the group_by() + mutate() combination is similar to using summarise(): for each group, the mean weight is calculated. However, instead of reporting only one row per group, the mean weight for each group is added to each row in that group. For each row in a group (like DM species ID + male sex), you will see the same value in mean_weight.
Reshaping data
It’s possible to record the same data in multiple formats in a spreadsheet. For example, data on points scored by 3 teams during a series of basketball games could be formatted in a wide format like this:
| Team | Points | Assists | Rebounds |
|---|---|---|---|
| A | 88 | 12 | 22 |
| B | 91 | 17 | 28 |
| C | 99 | 24 | 30 |
Or it could be formatted like this in a longer format:
| Team | Variable | Value | ||
|---|---|---|---|---|
| A | Points | 88 | ||
| A | Assists | 12 | ||
| A | Rebounds | 22 | ||
| B | Points | 91 | ||
| B | Assists | 17 | ||
| B | Rebounds | 28 | ||
| C | Points | 99 | ||
| C | Assists | 24 | ||
| C | Rebounds | 30 |
Wide format data does not contain any repeated values in first column, whereas longer format data does. Many datasets in the real world are recorded in a wide format because people generally find it easier to interpret and more intuitive to record it that way.
In general, when visualising multiple variables in a plot you need data in long format in order for the software to create the plot.
Back to our data. Let’s say we are interested in comparing the mean weights of each species across our different plot_types. We can begin this process using the group_by() + summarise() approach:
sp_by_plot <- surveys |>
filter(!is.na(weight)) |>
group_by(species_id, plot_id) |>
summarise(mean_weight = mean(weight)) |>
arrange(species_id, plot_id)`summarise()` has grouped output by 'species_id'. You can override using the
`.groups` argument.
sp_by_plot# A tibble: 300 × 3
# Groups: species_id [18]
species_id plot_id mean_weight
<chr> <dbl> <dbl>
1 BA 3 8
2 BA 21 6.5
3 DM 1 42.7
4 DM 2 42.6
5 DM 3 41.2
6 DM 4 41.9
7 DM 5 42.6
8 DM 6 42.1
9 DM 7 43.2
10 DM 8 43.4
# ℹ 290 more rows
It’s a bit difficult to compare values across plots when the data is in long format.
The tidyr package in the tidyverse has a pair of functions that allow you to reshape data by pivoting it: pivot_wider() and pivot_longer().
pivot_wider() will make the data wider, which means increasing the number of columns and reducing the number of rows.
pivot_longer() will do the opposite, reducing the number of columns and increasing the number of rows.
In this case, it might be nice to create a data.frame where each species has its own row, and each plot has its own column containing the mean weight for a given species. We will use pivot_wider() to reshape our data in this way. It takes 3 arguments:
- the name of the data.frame
- names_from: which column should be used to generate the names of the new columns?
- values_from: which column should be used to fill in the values of the new columns?
In our case, we want the new columns to be named from our plot_id column, with the values coming from the mean_weight column. We can pipe our data.frame right into pivot_wider() and add those two arguments.
Any columns not used for names_from or values_from will not be pivoted.
sp_by_plot_wide <- sp_by_plot |>
pivot_wider(names_from = plot_id,
values_from = mean_weight)
sp_by_plot_wide# A tibble: 18 × 25
# Groups: species_id [18]
species_id `3` `21` `1` `2` `4` `5` `6` `7` `8`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BA 8 6.5 NA NA NA NA NA NA NA
2 DM 41.2 41.5 42.7 42.6 41.9 42.6 42.1 43.2 43.4
3 DO 42.7 NA 50.1 50.3 46.8 50.4 49.0 52 49.2
4 DS 128. NA 129. 125. 118. 111. 114. 126. 128.
5 NL 171. 136. 154. 171. 164. 192. 176. 170. 134.
6 OL 32.1 28.6 35.5 34 33.0 32.6 31.8 NA 30.3
7 OT 24.1 24.1 23.7 24.9 26.5 23.6 23.5 22 24.1
8 OX 22 NA NA 22 NA 20 NA NA NA
9 PE 22.7 19.6 21.6 22.0 NA 21 21.6 22.8 19.4
10 PF 7.12 7.23 6.57 6.89 6.75 7.5 7.54 7 6.78
11 PH 28 31 NA NA NA 29 NA NA NA
12 PM 20.1 23.6 23.7 23.9 NA 23.7 22.3 23.4 23
13 PP 17.1 13.6 14.3 16.4 14.8 19.8 16.8 NA 13.9
14 RF 14.8 17 NA 16 NA 14 12.1 13 NA
15 RM 10.3 9.89 10.9 10.6 10.4 10.8 10.6 10.7 9
16 SF NA 49 NA NA NA NA NA NA NA
17 SH 76.0 79.9 NA 88 NA 82.7 NA NA NA
18 SS NA NA NA NA NA NA NA NA NA
# ℹ 15 more variables: `9` <dbl>, `10` <dbl>, `11` <dbl>, `12` <dbl>,
# `13` <dbl>, `14` <dbl>, `15` <dbl>, `16` <dbl>, `17` <dbl>, `18` <dbl>,
# `19` <dbl>, `20` <dbl>, `22` <dbl>, `23` <dbl>, `24` <dbl>
We have a new column for each plot_id value. There is one old column left in the data.frame: species_id. It wasn’t used in pivot_wider(), so it stays, and now contains a single entry for each unique species_id value.
Finally, a lot of NAs have appeared. Some species aren’t found in every plot, but because a data.frame has to have a value in every row and every column, an NA is inserted.
We can now pivot this table to a long format with pivot_longer(). It takes 3 arguments:
- cols for the columns you want to pivot
- names_to for the name of the new column which will contain the old column names
- values_to for the name of the new column which will contain the old values
We want to pivot all the columns except species_id, and we will use PLOT for the new column of plot IDs, and MEAN_WT for the new column of mean weight values.
sp_by_plot_wide |>
pivot_longer(cols = -species_id, names_to = "PLOT", values_to = "MEAN_WT")# A tibble: 432 × 3
# Groups: species_id [18]
species_id PLOT MEAN_WT
<chr> <chr> <dbl>
1 BA 3 8
2 BA 21 6.5
3 BA 1 NA
4 BA 2 NA
5 BA 4 NA
6 BA 5 NA
7 BA 6 NA
8 BA 7 NA
9 BA 8 NA
10 BA 9 NA
# ℹ 422 more rows
One thing you will notice is that all those NA values that got generated when we pivoted wider. However, we can filter those out, which gets us back to the same data as sp_by_plot, before we pivoted it wider.
sp_by_plot_wide |>
pivot_longer(cols = -species_id, names_to = "PLOT", values_to = "MEAN_WT") |>
filter(!is.na(MEAN_WT))# A tibble: 300 × 3
# Groups: species_id [18]
species_id PLOT MEAN_WT
<chr> <chr> <dbl>
1 BA 3 8
2 BA 21 6.5
3 DM 3 41.2
4 DM 21 41.5
5 DM 1 42.7
6 DM 2 42.6
7 DM 4 41.9
8 DM 5 42.6
9 DM 6 42.1
10 DM 7 43.2
# ℹ 290 more rows
Exporting data
Let’s say we want to export the wide version of our table so we can submit it with a manuscript to include in a publication.
First, we might want to modify the names of the columns, since right now they are bare numbers, which aren’t very informative. Luckily, pivot_wider() has an argument names_prefix which will allow us to add “plot_” to the start of each column.
sp_by_plot <- sp_by_plot |>
pivot_wider(names_from = plot_id, values_from = mean_weight,
names_prefix = "plot_")Now we can save this data.frame to a CSV using the write_csv() function from the readr package. The first argument is the name of the data.frame, and the second is the path to the new file we want to create, including the file extension .csv.
write_csv(sp_by_plot, "figures/table_survey_sp.csv")