Lesson 1 Getting Started with dplyr
This lesson and the next one will introduce you to some powerful tools for exploratory data analysis, using the gapminder
dataset as an example. Simple descriptive statistics can be extremely powerful. When you read a newspaper columnist arguing that \(X\) is to blame for the massive change in \(Y\), how often do you take a look at the data to see if \(Y\) is really changing at all? Or if it's actually changing in the opposite direction? These are easy wins and we should make the most of them, particularly given that many common misconceptions are predictable.
Today we'll focus on basic data manipulation and summary statistics using the dplyr
package, part of the Tidyverse family of R packages. We'll make heavy use of dplyr
and other Tidyverse packages throughout these lessons. This lesson will only scratch the surface of dplyr
, although we'll also pick up a few more dplyr
tricks in later lessons as well. A good resource for learning more is R for Data Science. The dplyr cheat sheet is also extremely helpful.
So what is dplyr
and why should you use it? There are three popular toolkits for data manipulation in R. The first is so-called "base R." This simply means the built-in commands like subset()
and operators like []
, $
, and so on. You need to know base R and you'll end up using it frequently, but it's simply not a great tool for complicated data manipulation. If you wrangle your data using base R commands, your code will be complicated, hard to read, and hard to maintain. The two main alternatives are dplyr
and data.table
. As you might expect, opinions differ on which of these is the "better" choice. Having used and taught both packages, here's my summary of the pros and cons. The data.table
package is extremely fast. If you routinely work with very large datasets (a million rows or more), it's worth learning data.table
. But data.table
syntax is a bit arcane, and can off-putting to newcomers. The key advantage of dplyr
is that it's easy to learn and its syntax is extremely intuitive and easy to read. It also has many helpful features that data.table
lacks. Given the importance of replicability and open science, this tips the balance in favor of dplyr
for this book.
1.1 Package Installation
Before we can get started, you'll need to install two packages: dplyr
, and gapminder
. To do this, you can either click on the "Packages" tab in RStudio or use the command install.packages()
at the R console, e.g.
install.packages('dplyr')
install.packages('gapminder')
You only need to install a package once, but you need to load it every time you want to use it.
To load a package, use the library()
command, e.g.
library(dplyr)
library(gapminder)
Now we're ready to go! I could tell you all about the data contained in gapminder
, but an important part of this book is helping you to become self-sufficient. So instead I will leave this as an exercise for the reader!
1.1.1 Exercise
After loading gapminder
enter the command ?gapminder
in the R console to view the R help file for this dataset. Read the documentation you find there to answer the following questions:
- How many rows and columns does
gapminder
contain? - What information is contained in each row and column?
- What is the source of the data?
1.2 What is a tibble?
The dplyr
package uses a special operator called the pipe, written %>%
to chain together commands called verbs that act on objects called tibbles. This probably sounds complicated, but it's easier that it looks. Over the next few sections we'll slowly unpack the preceding sentence by looking at a number of simple examples. But first things first: what is a tibble? Let's see what happens if we display the gapminder
dataset:
gapminder
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,694 more rows
If you're used to working with dataframes in R, this may surprise you. Rather than trying to print all nrow(gapminder)
rows on the screen, R helpfully shows us a useful summary of the information contained in gapminder
. This is because gapminder
is not a dataframe; it's a tibble, often abbreviated tbl. For our purposes, all you really need to know about tibbles is that they are souped up versions of R dataframes that are designed to work seamlessly with dplyr
. To learn more see the chapter "Tibbles" in R for Data Science.
1.3 Filter Rows with filter
We're ready to learn our first dplyr
verb: filter
selects rows.
Here's an example:
%>%
gapminder filter(year == 2007)
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
## 7 Austria Europe 2007 79.8 8199783 36126.
## 8 Bahrain Asia 2007 75.6 708573 29796.
## 9 Bangladesh Asia 2007 64.1 150448339 1391.
## 10 Belgium Europe 2007 79.4 10392226 33693.
## # … with 132 more rows
Compare the results of running this command to what we got when we typed gapminder
into the console above.
Rather than displaying the whole dataset, now R is only showing us the 142 rows for which the column year
has a value of 2007
.
So how does this work? The pipe operator %>%
"pipes" the tibble gapminder
into the function filter()
. The argument year == 2007
tells filter()
that it should find all the rows such that the logical condition year == 2007
is TRUE
.
Oh no! Have we accidentally deleted all of the other rows of gapminder
?
No: we haven't made any changes to gapminder
at all.
If you don't believe me try entering gapminder
at the console.
All that this command does is display a subset of gapminder
.
If we wanted to store the result of running this command, we'd need to assign it to a variable, for example
<- gapminder %>%
gapminder2007 filter(year == 2007)
gapminder2007
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
## 7 Austria Europe 2007 79.8 8199783 36126.
## 8 Bahrain Asia 2007 75.6 708573 29796.
## 9 Bangladesh Asia 2007 64.1 150448339 1391.
## 10 Belgium Europe 2007 79.4 10392226 33693.
## # … with 132 more rows
We can also use filter
to subset on two or more variables.
For example, here we display data for the US in 2007:
%>%
gapminder filter(year == 2007, country == 'United States')
## # A tibble: 1 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 United States Americas 2007 78.2 301139947 42952.
Notice that I always put a linebreak after the pipe operator %>%
in my code examples. This isn't required to make the code run, but it's a good habit. As we start to write longer and more complicated dplyr
commands, linebreaks will make it much easier to understand how our code works.
1.3.1 Exercises
- What is the difference between
x = 3
andx == 3
in R?
The first assigns the value 3
to the variable x
; the second tests whether x
is equal to 3
and returns either TRUE
or FALSE
.
- Write code that uses
filter
to choose the subset ofgapminder
for whichyear
is 2002.
%>% filter(year == 2002) gapminder
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2002 42.1 25268405 727.
## 2 Albania Europe 2002 75.7 3508512 4604.
## 3 Algeria Africa 2002 71.0 31287142 5288.
## 4 Angola Africa 2002 41.0 10866106 2773.
## 5 Argentina Americas 2002 74.3 38331121 8798.
## 6 Australia Oceania 2002 80.4 19546792 30688.
## 7 Austria Europe 2002 79.0 8148312 32418.
## 8 Bahrain Asia 2002 74.8 656397 23404.
## 9 Bangladesh Asia 2002 62.0 135656790 1136.
## 10 Belgium Europe 2002 78.3 10311970 30486.
## # … with 132 more rows
- When I displayed data for the US in 2007, I put quotes around
United States
but not aroundyear
. Why?
This is because year
contains numeric data while country
contains character data, aka string data.
- If you instead try to choose the subset with
year
equal to 2005, something will go wrong. Try it. What happens and why?
If you go back to the help file for gapminder
you'll see that it only contains data for every fifth year. The year 2005 isn't in our dataset so dplyr
will display an empty tibble:
%>%
gapminder filter(year == 2005)
## # A tibble: 0 × 6
## # … with 6 variables: country <fct>, continent <fct>, year <int>,
## # lifeExp <dbl>, pop <int>, gdpPercap <dbl>
- Write code that stores the data for Asian countries in a tibble called
gapminder_asia
. Then display this tibble.
<- gapminder %>%
gapminder_asia filter(continent == 'Asia')
gapminder_asia
## # A tibble: 396 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 386 more rows
- Which country had the higher life expectancy in 1977: Ireland or Brazil? Which had the higher GDP per capita?
%>%
gapminder filter(year == 1977, country %in% c('Ireland', 'Brazil'))
## # A tibble: 2 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Brazil Americas 1977 61.5 114313951 6660.
## 2 Ireland Europe 1977 72.0 3271900 11151.
1.4 Sort data with arrange
Suppose we wanted to sort gapminder
by gdpPercap
.
To do this we can use the arrange
command along with the pipe %>%
as follows:
%>%
gapminder arrange(gdpPercap)
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Congo, Dem. Rep. Africa 2002 45.0 55379852 241.
## 2 Congo, Dem. Rep. Africa 2007 46.5 64606759 278.
## 3 Lesotho Africa 1952 42.1 748747 299.
## 4 Guinea-Bissau Africa 1952 32.5 580653 300.
## 5 Congo, Dem. Rep. Africa 1997 42.6 47798986 312.
## 6 Eritrea Africa 1952 35.9 1438760 329.
## 7 Myanmar Asia 1952 36.3 20092996 331
## 8 Lesotho Africa 1957 45.0 813338 336.
## 9 Burundi Africa 1952 39.0 2445618 339.
## 10 Eritrea Africa 1957 38.0 1542611 344.
## # … with 1,694 more rows
The logic is very similar to what we saw above for filter
. We pipe the tibble gapminder
into the function arrange()
. The argument gdpPercap
tells arrange()
that we want to sort by GDP per capita. Note that by default arrange()
sorts in ascending order. If we want to sort in descending order, we use the function desc()
as follows:
%>% arrange(desc(gdpPercap)) gapminder
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Kuwait Asia 1957 58.0 212846 113523.
## 2 Kuwait Asia 1972 67.7 841934 109348.
## 3 Kuwait Asia 1952 55.6 160000 108382.
## 4 Kuwait Asia 1962 60.5 358266 95458.
## 5 Kuwait Asia 1967 64.6 575003 80895.
## 6 Kuwait Asia 1977 69.3 1140357 59265.
## 7 Norway Europe 2007 80.2 4627926 49357.
## 8 Kuwait Asia 2007 77.6 2505559 47307.
## 9 Singapore Asia 2007 80.0 4553009 47143.
## 10 Norway Europe 2002 79.0 4535591 44684.
## # … with 1,694 more rows
1.4.1 Exercises
- What is the lowest life expectancy in the
gapminder
dataset? Which country and year does it correspond to?
%>%
gapminder arrange(lifeExp)
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Rwanda Africa 1992 23.6 7290203 737.
## 2 Afghanistan Asia 1952 28.8 8425333 779.
## 3 Gambia Africa 1952 30 284320 485.
## 4 Angola Africa 1952 30.0 4232095 3521.
## 5 Sierra Leone Africa 1952 30.3 2143249 880.
## 6 Afghanistan Asia 1957 30.3 9240934 821.
## 7 Cambodia Asia 1977 31.2 6978607 525.
## 8 Mozambique Africa 1952 31.3 6446316 469.
## 9 Sierra Leone Africa 1957 31.6 2295678 1004.
## 10 Burkina Faso Africa 1952 32.0 4469979 543.
## # … with 1,694 more rows
- What is the highest life expectancy in the
gapminder
dataset? Which country and year does it correspond to?
%>%
gapminder arrange(desc(lifeExp))
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Japan Asia 2007 82.6 127467972 31656.
## 2 Hong Kong, China Asia 2007 82.2 6980412 39725.
## 3 Japan Asia 2002 82 127065841 28605.
## 4 Iceland Europe 2007 81.8 301931 36181.
## 5 Switzerland Europe 2007 81.7 7554661 37506.
## 6 Hong Kong, China Asia 2002 81.5 6762476 30209.
## 7 Australia Oceania 2007 81.2 20434176 34435.
## 8 Spain Europe 2007 80.9 40448191 28821.
## 9 Sweden Europe 2007 80.9 9031088 33860.
## 10 Israel Asia 2007 80.7 6426679 25523.
## # … with 1,694 more rows
1.5 Choose columns with select
We use the select
verb to choose a subset of columns.
For example, to display only pop
, country
, and year
, we would write
%>%
gapminder select(pop, country, year)
## # A tibble: 1,704 × 3
## pop country year
## <int> <fct> <int>
## 1 8425333 Afghanistan 1952
## 2 9240934 Afghanistan 1957
## 3 10267083 Afghanistan 1962
## 4 11537966 Afghanistan 1967
## 5 13079460 Afghanistan 1972
## 6 14880372 Afghanistan 1977
## 7 12881816 Afghanistan 1982
## 8 13867957 Afghanistan 1987
## 9 16317921 Afghanistan 1992
## 10 22227415 Afghanistan 1997
## # … with 1,694 more rows
Now suppose that we wanted to select every column except pop
.
Here's one way to do it:
%>%
gapminder select(country, continent, year, lifeExp, gdpPercap)
## # A tibble: 1,704 × 5
## country continent year lifeExp gdpPercap
## <fct> <fct> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 779.
## 2 Afghanistan Asia 1957 30.3 821.
## 3 Afghanistan Asia 1962 32.0 853.
## 4 Afghanistan Asia 1967 34.0 836.
## 5 Afghanistan Asia 1972 36.1 740.
## 6 Afghanistan Asia 1977 38.4 786.
## 7 Afghanistan Asia 1982 39.9 978.
## 8 Afghanistan Asia 1987 40.8 852.
## 9 Afghanistan Asia 1992 41.7 649.
## 10 Afghanistan Asia 1997 41.8 635.
## # … with 1,694 more rows
but that takes a lot of typing! If there were more than a handful of columns in our tibble it would be very difficult to deselect a column in this way. Fortunately there's a shortcut: use the minus sign
%>% select(-pop) gapminder
## # A tibble: 1,704 × 5
## country continent year lifeExp gdpPercap
## <fct> <fct> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 779.
## 2 Afghanistan Asia 1957 30.3 821.
## 3 Afghanistan Asia 1962 32.0 853.
## 4 Afghanistan Asia 1967 34.0 836.
## 5 Afghanistan Asia 1972 36.1 740.
## 6 Afghanistan Asia 1977 38.4 786.
## 7 Afghanistan Asia 1982 39.9 978.
## 8 Afghanistan Asia 1987 40.8 852.
## 9 Afghanistan Asia 1992 41.7 649.
## 10 Afghanistan Asia 1997 41.8 635.
## # … with 1,694 more rows
Just as we could when selecting, we can deselect multiple columns by separating their names with a comma:
%>% select(-pop, -year) gapminder
## # A tibble: 1,704 × 4
## country continent lifeExp gdpPercap
## <fct> <fct> <dbl> <dbl>
## 1 Afghanistan Asia 28.8 779.
## 2 Afghanistan Asia 30.3 821.
## 3 Afghanistan Asia 32.0 853.
## 4 Afghanistan Asia 34.0 836.
## 5 Afghanistan Asia 36.1 740.
## 6 Afghanistan Asia 38.4 786.
## 7 Afghanistan Asia 39.9 978.
## 8 Afghanistan Asia 40.8 852.
## 9 Afghanistan Asia 41.7 649.
## 10 Afghanistan Asia 41.8 635.
## # … with 1,694 more rows
It's easy to mix up the dplyr
verbs select
and filter
.
Here's a handy mnemonic: filteR
filters Rows while seleCt
selects Columns.
Suppose we wanted to select only the column pop
from gapminder
.
1.5.1 Exercise
- Select only the columns
year
,lifeExp
, andcountry
ingapminder
.
%>%
gapminder select(year, lifeExp, country)
## # A tibble: 1,704 × 3
## year lifeExp country
## <int> <dbl> <fct>
## 1 1952 28.8 Afghanistan
## 2 1957 30.3 Afghanistan
## 3 1962 32.0 Afghanistan
## 4 1967 34.0 Afghanistan
## 5 1972 36.1 Afghanistan
## 6 1977 38.4 Afghanistan
## 7 1982 39.9 Afghanistan
## 8 1987 40.8 Afghanistan
## 9 1992 41.7 Afghanistan
## 10 1997 41.8 Afghanistan
## # … with 1,694 more rows
- Select all the columns except
year
,lifeExp
, andcountry
ingapminder
.
%>%
gapminder select(-year, -lifeExp, -country)
## # A tibble: 1,704 × 3
## continent pop gdpPercap
## <fct> <int> <dbl>
## 1 Asia 8425333 779.
## 2 Asia 9240934 821.
## 3 Asia 10267083 853.
## 4 Asia 11537966 836.
## 5 Asia 13079460 740.
## 6 Asia 14880372 786.
## 7 Asia 12881816 978.
## 8 Asia 13867957 852.
## 9 Asia 16317921 649.
## 10 Asia 22227415 635.
## # … with 1,694 more rows
1.6 The summarize
verb
Suppose we want to calculate the sample mean of the column lifeExp
in gapminder
.
We can do this using the summarize
verb as follows:
%>% summarize(mean_lifeExp = mean(lifeExp)) gapminder
## # A tibble: 1 × 1
## mean_lifeExp
## <dbl>
## 1 59.5
Note the syntax: within summarize
we have an assignment statement.
In particular, we assign mean(lifeExp)
to the variable mean_lifeExp
.
The key thing to know about summarize
is that it always returns collapses a tibble with many rows into a single row.
When we think about computing a sample mean, this makes sense: we want to summarize the column lifeExp
as a single number.
It doesn't actually make much sense to compute the mean of lifeExp
because this involves averaging over different countries and different years.
Instead let's compute the mean for a single year: 1952:
%>%
gapminder filter(year == 1952) %>%
summarize(mean_lifeExp = mean(lifeExp))
## # A tibble: 1 × 1
## mean_lifeExp
## <dbl>
## 1 49.1
We can use summarize to compute multiple summary statistics for a single variable, the same summary statistic for multiple variables, or both:
%>%
gapminder filter(year == 1952) %>%
summarize(mean_lifeExp = mean(lifeExp),
sd_lifeExp = sd(lifeExp),
mean_pop = mean(pop))
## # A tibble: 1 × 3
## mean_lifeExp sd_lifeExp mean_pop
## <dbl> <dbl> <dbl>
## 1 49.1 12.2 16950402.
Note that if we don't explicitly use an assignment statement, R will make up names for us based on the commands that we used:
%>%
gapminder filter(year == 1952) %>%
summarize(mean(lifeExp), median(lifeExp), max(lifeExp))
## # A tibble: 1 × 3
## `mean(lifeExp)` `median(lifeExp)` `max(lifeExp)`
## <dbl> <dbl> <dbl>
## 1 49.1 45.1 72.7
1.6.1 Exercise
- Use
summarize
to compute the 75th percentile of life expectancy in 1977.
%>%
gapminder filter(year == 1977) %>%
summarize(quantile(lifeExp, 0.75))
## # A tibble: 1 × 1
## `quantile(lifeExp, 0.75)`
## <dbl>
## 1 70.4
- Use
summarize
to compute the 75th percentile of life expectancy among Asian countries in 1977.
%>%
gapminder filter(year == 1977, continent == 'Asia') %>%
summarize(quantile(lifeExp, 0.75))
## # A tibble: 1 × 1
## `quantile(lifeExp, 0.75)`
## <dbl>
## 1 65.9
1.7 The group_by
verb
The true power of summarize
is its ability to compute grouped summary statistics in combination with another dplyr
verb: group_by
.
In essence, group_by
allows us to tell dplyr
that we don't want to work with the whole dataset at once; rather we want to work with particular subsets or groups.
The basic idea is similar to what we've done using filter
in the past.
For example, to calculate mean population (in millions) and mean life expectancy in the year 2007, we could use the following code:
%>%
gapminder filter(year == 2007) %>%
summarize(meanPop = mean(pop) / 1000000, meanLifeExp = mean(lifeExp))
## # A tibble: 1 × 2
## meanPop meanLifeExp
## <dbl> <dbl>
## 1 44.0 67.0
Using group_by
we could do the same thing for all years in the dataset at once:
%>%
gapminder group_by(year) %>%
summarize(meanPop = mean(pop) / 1000000, meanLifeExp = mean(lifeExp))
## # A tibble: 12 × 3
## year meanPop meanLifeExp
## <int> <dbl> <dbl>
## 1 1952 17.0 49.1
## 2 1957 18.8 51.5
## 3 1962 20.4 53.6
## 4 1967 22.7 55.7
## 5 1972 25.2 57.6
## 6 1977 27.7 59.6
## 7 1982 30.2 61.5
## 8 1987 33.0 63.2
## 9 1992 36.0 64.2
## 10 1997 38.8 65.0
## 11 2002 41.5 65.7
## 12 2007 44.0 67.0
Notice what has changed in the second code block: we replaced filter(year == 2007)
with group_by(year)
.
This tells dplyr
that, rather than simply restricting attention to data from 2007, we want to form subsets (groups) of the dataset that correspond to the values of the year
variable.
Whatever comes after group_by
will then be calculated for these subsets.
Here's another example.
Suppose we wanted to calculate mean life expectancy and total population in each continent during the year 2007.
To accomplish this, we can chain together the filter
, group_by
and summarize
verbs as follows:
%>%
gapminder filter(year == 2007) %>%
group_by(continent) %>%
summarize(meanPop = mean(pop) / 1000000, meanLifeExp = mean(lifeExp))
## # A tibble: 5 × 3
## continent meanPop meanLifeExp
## <fct> <dbl> <dbl>
## 1 Africa 17.9 54.8
## 2 Americas 36.0 73.6
## 3 Asia 116. 70.7
## 4 Europe 19.5 77.6
## 5 Oceania 12.3 80.7
We can also use group_by
to subset over multiple variables at once.
For example, to calculate mean life expectancy and total population in each continent separately for every year, we can use the following code:
%>%
gapminder group_by(year, continent) %>%
summarize(meanPop = mean(pop) / 1000000, meanLifeExp = mean(lifeExp))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 60 × 4
## # Groups: year [12]
## year continent meanPop meanLifeExp
## <int> <fct> <dbl> <dbl>
## 1 1952 Africa 4.57 39.1
## 2 1952 Americas 13.8 53.3
## 3 1952 Asia 42.3 46.3
## 4 1952 Europe 13.9 64.4
## 5 1952 Oceania 5.34 69.3
## 6 1957 Africa 5.09 41.3
## 7 1957 Americas 15.5 56.0
## 8 1957 Asia 47.4 49.3
## 9 1957 Europe 14.6 66.7
## 10 1957 Oceania 5.97 70.3
## # … with 50 more rows
1.7.1 Exercise
- Why doesn't the following code work as expected?
%>%
gapminder summarize(meanLifeExp = mean(lifeExp)) %>%
group_by(year)
The steps are carried out in the wrong order: we need to form groups first and then calculate our desired summaries.
- Calculate the median GDP per capita in each continent in 1977.
%>%
gapminder filter(year == 1977) %>%
group_by(continent) %>%
summarize(medGDPc = median(gdpPercap))
## # A tibble: 5 × 2
## continent medGDPc
## <fct> <dbl>
## 1 Africa 1400.
## 2 Americas 6281.
## 3 Asia 3195.
## 4 Europe 14226.
## 5 Oceania 17284.
- Repeat 2. but sort your results in descending order.
%>%
gapminder filter(year == 1977) %>%
group_by(continent) %>%
summarize(medGDPc = median(gdpPercap)) %>%
arrange(desc(medGDPc))
## # A tibble: 5 × 2
## continent medGDPc
## <fct> <dbl>
## 1 Oceania 17284.
## 2 Europe 14226.
## 3 Americas 6281.
## 4 Asia 3195.
## 5 Africa 1400.
- Calculate the mean and standard deviation of life expectancy for separately for each continent in every year after 1977. Sort your results in ascending order by the standard deviation of life expectancy.
%>%
gapminder filter(year > 1977) %>%
group_by(continent, year) %>%
summarize(meanGDPc = mean(gdpPercap), sdGDPc = sd(gdpPercap)) %>%
arrange(sdGDPc)
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
## # A tibble: 30 × 4
## # Groups: continent [5]
## continent year meanGDPc sdGDPc
## <fct> <int> <dbl> <dbl>
## 1 Oceania 1982 18555. 1304.
## 2 Oceania 1987 20448. 2038.
## 3 Africa 1987 2283. 2567.
## 4 Africa 1992 2282. 2644.
## 5 Africa 1997 2379. 2821.
## 6 Africa 2002 2599. 2973.
## 7 Africa 1982 2482. 3243.
## 8 Oceania 1992 20894. 3579.
## 9 Africa 2007 3089. 3618.
## 10 Oceania 1997 24024. 4206.
## # … with 20 more rows
1.8 Understanding the pipe: %>%
Let's revisit the pipe, %>%
, that we've used in the code examples above.
I told you that the command gapminder %>% filter(year == 2007)
"pipes" the tibble gapminder
into the function filter()
.
But what exactly does this mean?
Take a look at the R help file for the dplyr
function filter
.
We see that filter()
takes something called .data
as its first argument.
Moving on to the "Arguments" section of the help file, we see that .data
is "A tbl" i.e. a tibble.
To better understand what this means, let's try using filter
without the pipe:
filter(gapminder, year == 2007, country == 'United States')
## # A tibble: 1 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 United States Americas 2007 78.2 301139947 42952.
Notice that this gives us exactly the same result as
%>%
gapminder filter(year == 2007, country == 'United States')
## # A tibble: 1 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 United States Americas 2007 78.2 301139947 42952.
In other words The pipe is gives us an alternative way of supplying the first argument to a function.
Let's try this with a more familiar R function: mean
.
The first argument of mean
is a vector x
.
So let's try using the pipe to compute the mean of some data:
<- c(1, 5, 2, 7, 2)
x %>% mean x
## [1] 3.4
The pipe supplies a function with its first argument. If we want to specify additional arguments, we need to do so within the function call itself. For example, here's how we could use the pipe to compute the mean after dropping missing observations:
<- c(1, 5, NA, 7, 2)
y %>%
y mean(na.rm = TRUE)
## [1] 3.75
One important note about the pipe: it's not a base R command.
Instead it's a command provided by the package Magrittr
.
(If you're familiar with the Belgian painter Magritte, you may realize that the name of this package is quite witty!)
This package is installed automatically along with dplyr
.
1.8.1 Exercise
- Write R code that uses the pipe to calculate the sample variance of
z <- c(4, 1, 5, NA, 3)
excluding the missing observation from the calculation.
<- c(4, 1, 5, NA, 3)
z %>% var(na.rm = TRUE) z
## [1] 2.916667
- Write code to arrange
gapminder
bylifeExp
, first in increasing order and then in decreasing order, without using the pipe.
arrange(gapminder,lifeExp)
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Rwanda Africa 1992 23.6 7290203 737.
## 2 Afghanistan Asia 1952 28.8 8425333 779.
## 3 Gambia Africa 1952 30 284320 485.
## 4 Angola Africa 1952 30.0 4232095 3521.
## 5 Sierra Leone Africa 1952 30.3 2143249 880.
## 6 Afghanistan Asia 1957 30.3 9240934 821.
## 7 Cambodia Asia 1977 31.2 6978607 525.
## 8 Mozambique Africa 1952 31.3 6446316 469.
## 9 Sierra Leone Africa 1957 31.6 2295678 1004.
## 10 Burkina Faso Africa 1952 32.0 4469979 543.
## # … with 1,694 more rows
arrange(gapminder, desc(lifeExp))
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Japan Asia 2007 82.6 127467972 31656.
## 2 Hong Kong, China Asia 2007 82.2 6980412 39725.
## 3 Japan Asia 2002 82 127065841 28605.
## 4 Iceland Europe 2007 81.8 301931 36181.
## 5 Switzerland Europe 2007 81.7 7554661 37506.
## 6 Hong Kong, China Asia 2002 81.5 6762476 30209.
## 7 Australia Oceania 2007 81.2 20434176 34435.
## 8 Spain Europe 2007 80.9 40448191 28821.
## 9 Sweden Europe 2007 80.9 9031088 33860.
## 10 Israel Asia 2007 80.7 6426679 25523.
## # … with 1,694 more rows
1.9 Chaining commands
In the examples we've looked at so far, the pipe doesn't seem all that useful: it's just an alternative way of specifying the first argument to a function.
The true power and convenience of the pipe only becomes apparent we need to chain a series of commands together.
For example, suppose we wanted to display the 1952 data from gapminder
sorted by gdpPercap
in descending order.
Using the pipe, this is easy:
%>%
gapminder filter(year == 1952) %>%
arrange(desc(gdpPercap))
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Kuwait Asia 1952 55.6 160000 108382.
## 2 Switzerland Europe 1952 69.6 4815000 14734.
## 3 United States Americas 1952 68.4 157553000 13990.
## 4 Canada Americas 1952 68.8 14785584 11367.
## 5 New Zealand Oceania 1952 69.4 1994794 10557.
## 6 Norway Europe 1952 72.7 3327728 10095.
## 7 Australia Oceania 1952 69.1 8691212 10040.
## 8 United Kingdom Europe 1952 69.2 50430000 9980.
## 9 Bahrain Asia 1952 50.9 120447 9867.
## 10 Denmark Europe 1952 70.8 4334000 9692.
## # … with 132 more rows
Notice how I split the commands across multiple lines.
This is good practice: it makes your code much easier to read.
So what's happening when we chain commands in this way?
The first step in the chain gapminder %>% filter(year == 1952)
returns a tibble: the subset of gapminder
for which year
is 1952
.
The next step %>% arrange(gdpPercap)
pipes this new tibble into the function arrange()
, giving us the desired result.
I hope you agree with me that this is pretty intuitive: even if we didn't know anything about dplyr
we could almost figure out what this code is supposed to do.
In stark contrast, let's look at the code we'd have to use if we wanted to accomplish the same task without using the pipe:
arrange(filter(gapminder, year == 1952),
desc(gdpPercap))
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Kuwait Asia 1952 55.6 160000 108382.
## 2 Switzerland Europe 1952 69.6 4815000 14734.
## 3 United States Americas 1952 68.4 157553000 13990.
## 4 Canada Americas 1952 68.8 14785584 11367.
## 5 New Zealand Oceania 1952 69.4 1994794 10557.
## 6 Norway Europe 1952 72.7 3327728 10095.
## 7 Australia Oceania 1952 69.1 8691212 10040.
## 8 United Kingdom Europe 1952 69.2 50430000 9980.
## 9 Bahrain Asia 1952 50.9 120447 9867.
## 10 Denmark Europe 1952 70.8 4334000 9692.
## # … with 132 more rows
There are may reasons why this code is harder to read, but the most important one is that the commands arrange
and filter
have to appear in the code in the opposite of the order in which they are actually being carried out.
This is because parentheses are evaluated from inside to outside.
This is what's great about the pipe: it lets us write our code in a way that accords with the actual order of the steps we want to carry out.
1.9.1 Exercise
- What was the most populous European country in 1992? Write appropriate
dplyr
code using the pipe to display the information you need to answer this question.
%>%
gapminder filter(year == 1992, continent == 'Europe') %>%
arrange(desc(pop))
## # A tibble: 30 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Germany Europe 1992 76.1 80597764 26505.
## 2 Turkey Europe 1992 66.1 58179144 5678.
## 3 United Kingdom Europe 1992 76.4 57866349 22705.
## 4 France Europe 1992 77.5 57374179 24704.
## 5 Italy Europe 1992 77.4 56840847 22014.
## 6 Spain Europe 1992 77.6 39549438 18603.
## 7 Poland Europe 1992 71.0 38370697 7739.
## 8 Romania Europe 1992 69.4 22797027 6598.
## 9 Netherlands Europe 1992 77.4 15174244 26791.
## 10 Hungary Europe 1992 69.2 10348684 10536.
## # … with 20 more rows
- Re-write your code from part 1. without using the pipe.
arrange(filter(gapminder, year == 1992, continent == 'Europe'), desc(pop))
## # A tibble: 30 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Germany Europe 1992 76.1 80597764 26505.
## 2 Turkey Europe 1992 66.1 58179144 5678.
## 3 United Kingdom Europe 1992 76.4 57866349 22705.
## 4 France Europe 1992 77.5 57374179 24704.
## 5 Italy Europe 1992 77.4 56840847 22014.
## 6 Spain Europe 1992 77.6 39549438 18603.
## 7 Poland Europe 1992 71.0 38370697 7739.
## 8 Romania Europe 1992 69.4 22797027 6598.
## 9 Netherlands Europe 1992 77.4 15174244 26791.
## 10 Hungary Europe 1992 69.2 10348684 10536.
## # … with 20 more rows
1.10 Change an existing variable or create a new one with mutate
It's a little hard to read the column pop
in gapminder
since there are so many digits.
Suppose that, instead of raw population, we wanted to display population in millions.
This requires us to pop
by 1000000
, which we can do using the function mutate()
from dplyr
as follows:
%>%
gapminder mutate(pop = pop / 1000000)
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8.43 779.
## 2 Afghanistan Asia 1957 30.3 9.24 821.
## 3 Afghanistan Asia 1962 32.0 10.3 853.
## 4 Afghanistan Asia 1967 34.0 11.5 836.
## 5 Afghanistan Asia 1972 36.1 13.1 740.
## 6 Afghanistan Asia 1977 38.4 14.9 786.
## 7 Afghanistan Asia 1982 39.9 12.9 978.
## 8 Afghanistan Asia 1987 40.8 13.9 852.
## 9 Afghanistan Asia 1992 41.7 16.3 649.
## 10 Afghanistan Asia 1997 41.8 22.2 635.
## # … with 1,694 more rows
Note the syntax here: within mutate()
we have an assignment statement, namely pop = pop / 1000000
.
This tells R to calculate pop / 1000000
and assign the result to pop
, in place of the original variable.
We can also use mutate()
to create a new variable.
The gapminder
dataset doesn't contain overall GDP, only GDP per capita.
To calculate GDP, we need to multiply gdpPercap
by pop
.
But wait!
Didn't we just change pop so it's expressed in millions?
No: we never stored the results of our previous command, we simply displayed them.
Just as I discussed above, unless you overwrite it, the original gapminder
dataset will be unchanged.
With this in mind, we can create the gdp
variable as follows:
%>%
gapminder mutate(gdp = pop * gdpPercap)
## # A tibble: 1,704 × 7
## country continent year lifeExp pop gdpPercap gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
## 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.
## 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.
## 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.
## 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.
## # … with 1,694 more rows
1.10.1 Exercise
- Explain why we used
=
rather than==
in themutate()
examples above.
We used =
because we're carrying out an assignment operation. In contrast ==
tests for equality, returning TRUE
or FALSE
.
- Calculate life expectancy in months and use it to answer the following question: "which country in the Americas had the shortest life expectancy in months in the year 1962?"
%>%
gapminder mutate(lifeExpMonths = 12 * lifeExp) %>%
filter(year == 1962, continent == 'Americas') %>%
arrange(lifeExpMonths)
## # A tibble: 25 × 7
## country continent year lifeExp pop gdpPercap lifeExpMonths
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Bolivia Americas 1962 43.4 3593918 2181. 521.
## 2 Haiti Americas 1962 43.6 3880130 1797. 523.
## 3 Guatemala Americas 1962 47.0 4208858 2750. 563.
## 4 Honduras Americas 1962 48.0 2090162 2291. 576.
## 5 Nicaragua Americas 1962 48.6 1590597 3634. 584.
## 6 Peru Americas 1962 49.1 10516500 4957. 589.
## 7 El Salvador Americas 1962 52.3 2747687 3777. 628.
## 8 Dominican Republic Americas 1962 53.5 3453434 1662. 642.
## 9 Ecuador Americas 1962 54.6 4681707 4086. 656.
## 10 Brazil Americas 1962 55.7 76039390 3337. 668.
## # … with 15 more rows