Lesson 1 Getting Started with dplyr

This lesson and the next one will introduce you to some powerful tools for exploratory data analysis, using the gapminder dataset as an example. Simple descriptive statistics can be extremely powerful. When you read a newspaper columnist arguing that \(X\) is to blame for the massive change in \(Y\), how often do you take a look at the data to see if \(Y\) is really changing at all? Or if it's actually changing in the opposite direction? These are easy wins and we should make the most of them, particularly given that many common misconceptions are predictable.

Today we'll focus on basic data manipulation and summary statistics using the dplyr package, part of the Tidyverse family of R packages. We'll make heavy use of dplyr and other Tidyverse packages throughout these lessons. This lesson will only scratch the surface of dplyr, although we'll also pick up a few more dplyr tricks in later lessons as well. A good resource for learning more is R for Data Science. The dplyr cheat sheet is also extremely helpful.

So what is dplyr and why should you use it? There are three popular toolkits for data manipulation in R. The first is so-called "base R." This simply means the built-in commands like subset() and operators like [], $, and so on. You need to know base R and you'll end up using it frequently, but it's simply not a great tool for complicated data manipulation. If you wrangle your data using base R commands, your code will be complicated, hard to read, and hard to maintain. The two main alternatives are dplyr and data.table. As you might expect, opinions differ on which of these is the "better" choice. Having used and taught both packages, here's my summary of the pros and cons. The data.table package is extremely fast. If you routinely work with very large datasets (a million rows or more), it's worth learning data.table. But data.table syntax is a bit arcane, and can off-putting to newcomers. The key advantage of dplyr is that it's easy to learn and its syntax is extremely intuitive and easy to read. It also has many helpful features that data.table lacks. Given the importance of replicability and open science, this tips the balance in favor of dplyr for this book.

1.1 Package Installation

Before we can get started, you'll need to install two packages: dplyr, and gapminder. To do this, you can either click on the "Packages" tab in RStudio or use the command install.packages() at the R console, e.g.

install.packages('dplyr')
install.packages('gapminder')

You only need to install a package once, but you need to load it every time you want to use it. To load a package, use the library() command, e.g.

library(dplyr)
library(gapminder)

Now we're ready to go! I could tell you all about the data contained in gapminder, but an important part of this book is helping you to become self-sufficient. So instead I will leave this as an exercise for the reader!

1.1.1 Exercise

After loading gapminder enter the command ?gapminder in the R console to view the R help file for this dataset. Read the documentation you find there to answer the following questions:

  1. How many rows and columns does gapminder contain?
  2. What information is contained in each row and column?
  3. What is the source of the data?

1.2 What is a tibble?

The dplyr package uses a special operator called the pipe, written %>% to chain together commands called verbs that act on objects called tibbles. This probably sounds complicated, but it's easier that it looks. Over the next few sections we'll slowly unpack the preceding sentence by looking at a number of simple examples. But first things first: what is a tibble? Let's see what happens if we display the gapminder dataset:

gapminder
## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows

If you're used to working with dataframes in R, this may surprise you. Rather than trying to print all nrow(gapminder) rows on the screen, R helpfully shows us a useful summary of the information contained in gapminder. This is because gapminder is not a dataframe; it's a tibble, often abbreviated tbl. For our purposes, all you really need to know about tibbles is that they are souped up versions of R dataframes that are designed to work seamlessly with dplyr. To learn more see the chapter "Tibbles" in R for Data Science.

1.3 Filter Rows with filter

We're ready to learn our first dplyr verb: filter selects rows. Here's an example:

gapminder %>% 
  filter(year == 2007)
## # A tibble: 142 × 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # … with 132 more rows

Compare the results of running this command to what we got when we typed gapminder into the console above. Rather than displaying the whole dataset, now R is only showing us the 142 rows for which the column year has a value of 2007. So how does this work? The pipe operator %>% "pipes" the tibble gapminder into the function filter(). The argument year == 2007 tells filter() that it should find all the rows such that the logical condition year == 2007 is TRUE.

Oh no! Have we accidentally deleted all of the other rows of gapminder? No: we haven't made any changes to gapminder at all. If you don't believe me try entering gapminder at the console. All that this command does is display a subset of gapminder. If we wanted to store the result of running this command, we'd need to assign it to a variable, for example

gapminder2007 <- gapminder %>% 
  filter(year == 2007)
gapminder2007
## # A tibble: 142 × 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # … with 132 more rows

We can also use filter to subset on two or more variables. For example, here we display data for the US in 2007:

gapminder %>% 
  filter(year == 2007, country == 'United States')
## # A tibble: 1 × 6
##   country       continent  year lifeExp       pop gdpPercap
##   <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
## 1 United States Americas   2007    78.2 301139947    42952.

Notice that I always put a linebreak after the pipe operator %>% in my code examples. This isn't required to make the code run, but it's a good habit. As we start to write longer and more complicated dplyr commands, linebreaks will make it much easier to understand how our code works.

1.3.1 Exercises

  1. What is the difference between x = 3 and x == 3 in R?

The first assigns the value 3 to the variable x; the second tests whether x is equal to 3 and returns either TRUE or FALSE.

  1. Write code that uses filter to choose the subset of gapminder for which year is 2002.
gapminder %>% filter(year == 2002)
## # A tibble: 142 × 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2002    42.1  25268405      727.
##  2 Albania     Europe     2002    75.7   3508512     4604.
##  3 Algeria     Africa     2002    71.0  31287142     5288.
##  4 Angola      Africa     2002    41.0  10866106     2773.
##  5 Argentina   Americas   2002    74.3  38331121     8798.
##  6 Australia   Oceania    2002    80.4  19546792    30688.
##  7 Austria     Europe     2002    79.0   8148312    32418.
##  8 Bahrain     Asia       2002    74.8    656397    23404.
##  9 Bangladesh  Asia       2002    62.0 135656790     1136.
## 10 Belgium     Europe     2002    78.3  10311970    30486.
## # … with 132 more rows
  1. When I displayed data for the US in 2007, I put quotes around United States but not around year. Why?

This is because year contains numeric data while country contains character data, aka string data.

  1. If you instead try to choose the subset with year equal to 2005, something will go wrong. Try it. What happens and why?

If you go back to the help file for gapminder you'll see that it only contains data for every fifth year. The year 2005 isn't in our dataset so dplyr will display an empty tibble:

gapminder %>% 
  filter(year == 2005)
## # A tibble: 0 × 6
## # … with 6 variables: country <fct>, continent <fct>, year <int>,
## #   lifeExp <dbl>, pop <int>, gdpPercap <dbl>
  1. Write code that stores the data for Asian countries in a tibble called gapminder_asia. Then display this tibble.
gapminder_asia <- gapminder %>% 
  filter(continent == 'Asia')
gapminder_asia
## # A tibble: 396 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 386 more rows
  1. Which country had the higher life expectancy in 1977: Ireland or Brazil? Which had the higher GDP per capita?
gapminder %>% 
  filter(year == 1977, country %in% c('Ireland', 'Brazil'))
## # A tibble: 2 × 6
##   country continent  year lifeExp       pop gdpPercap
##   <fct>   <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Brazil  Americas   1977    61.5 114313951     6660.
## 2 Ireland Europe     1977    72.0   3271900    11151.

1.4 Sort data with arrange

Suppose we wanted to sort gapminder by gdpPercap. To do this we can use the arrange command along with the pipe %>% as follows:

gapminder %>% 
  arrange(gdpPercap)
## # A tibble: 1,704 × 6
##    country          continent  year lifeExp      pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Congo, Dem. Rep. Africa     2002    45.0 55379852      241.
##  2 Congo, Dem. Rep. Africa     2007    46.5 64606759      278.
##  3 Lesotho          Africa     1952    42.1   748747      299.
##  4 Guinea-Bissau    Africa     1952    32.5   580653      300.
##  5 Congo, Dem. Rep. Africa     1997    42.6 47798986      312.
##  6 Eritrea          Africa     1952    35.9  1438760      329.
##  7 Myanmar          Asia       1952    36.3 20092996      331 
##  8 Lesotho          Africa     1957    45.0   813338      336.
##  9 Burundi          Africa     1952    39.0  2445618      339.
## 10 Eritrea          Africa     1957    38.0  1542611      344.
## # … with 1,694 more rows

The logic is very similar to what we saw above for filter. We pipe the tibble gapminder into the function arrange(). The argument gdpPercap tells arrange() that we want to sort by GDP per capita. Note that by default arrange() sorts in ascending order. If we want to sort in descending order, we use the function desc() as follows:

gapminder %>% arrange(desc(gdpPercap))
## # A tibble: 1,704 × 6
##    country   continent  year lifeExp     pop gdpPercap
##    <fct>     <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Kuwait    Asia       1957    58.0  212846   113523.
##  2 Kuwait    Asia       1972    67.7  841934   109348.
##  3 Kuwait    Asia       1952    55.6  160000   108382.
##  4 Kuwait    Asia       1962    60.5  358266    95458.
##  5 Kuwait    Asia       1967    64.6  575003    80895.
##  6 Kuwait    Asia       1977    69.3 1140357    59265.
##  7 Norway    Europe     2007    80.2 4627926    49357.
##  8 Kuwait    Asia       2007    77.6 2505559    47307.
##  9 Singapore Asia       2007    80.0 4553009    47143.
## 10 Norway    Europe     2002    79.0 4535591    44684.
## # … with 1,694 more rows

1.4.1 Exercises

  1. What is the lowest life expectancy in the gapminder dataset? Which country and year does it correspond to?
gapminder %>% 
  arrange(lifeExp)
## # A tibble: 1,704 × 6
##    country      continent  year lifeExp     pop gdpPercap
##    <fct>        <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Rwanda       Africa     1992    23.6 7290203      737.
##  2 Afghanistan  Asia       1952    28.8 8425333      779.
##  3 Gambia       Africa     1952    30    284320      485.
##  4 Angola       Africa     1952    30.0 4232095     3521.
##  5 Sierra Leone Africa     1952    30.3 2143249      880.
##  6 Afghanistan  Asia       1957    30.3 9240934      821.
##  7 Cambodia     Asia       1977    31.2 6978607      525.
##  8 Mozambique   Africa     1952    31.3 6446316      469.
##  9 Sierra Leone Africa     1957    31.6 2295678     1004.
## 10 Burkina Faso Africa     1952    32.0 4469979      543.
## # … with 1,694 more rows
  1. What is the highest life expectancy in the gapminder dataset? Which country and year does it correspond to?
gapminder %>% 
  arrange(desc(lifeExp))
## # A tibble: 1,704 × 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Japan            Asia       2007    82.6 127467972    31656.
##  2 Hong Kong, China Asia       2007    82.2   6980412    39725.
##  3 Japan            Asia       2002    82   127065841    28605.
##  4 Iceland          Europe     2007    81.8    301931    36181.
##  5 Switzerland      Europe     2007    81.7   7554661    37506.
##  6 Hong Kong, China Asia       2002    81.5   6762476    30209.
##  7 Australia        Oceania    2007    81.2  20434176    34435.
##  8 Spain            Europe     2007    80.9  40448191    28821.
##  9 Sweden           Europe     2007    80.9   9031088    33860.
## 10 Israel           Asia       2007    80.7   6426679    25523.
## # … with 1,694 more rows

1.5 Choose columns with select

We use the select verb to choose a subset of columns. For example, to display only pop, country, and year, we would write

gapminder %>% 
  select(pop, country, year)
## # A tibble: 1,704 × 3
##         pop country      year
##       <int> <fct>       <int>
##  1  8425333 Afghanistan  1952
##  2  9240934 Afghanistan  1957
##  3 10267083 Afghanistan  1962
##  4 11537966 Afghanistan  1967
##  5 13079460 Afghanistan  1972
##  6 14880372 Afghanistan  1977
##  7 12881816 Afghanistan  1982
##  8 13867957 Afghanistan  1987
##  9 16317921 Afghanistan  1992
## 10 22227415 Afghanistan  1997
## # … with 1,694 more rows

Now suppose that we wanted to select every column except pop. Here's one way to do it:

gapminder %>% 
  select(country, continent, year, lifeExp, gdpPercap)
## # A tibble: 1,704 × 5
##    country     continent  year lifeExp gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8      779.
##  2 Afghanistan Asia       1957    30.3      821.
##  3 Afghanistan Asia       1962    32.0      853.
##  4 Afghanistan Asia       1967    34.0      836.
##  5 Afghanistan Asia       1972    36.1      740.
##  6 Afghanistan Asia       1977    38.4      786.
##  7 Afghanistan Asia       1982    39.9      978.
##  8 Afghanistan Asia       1987    40.8      852.
##  9 Afghanistan Asia       1992    41.7      649.
## 10 Afghanistan Asia       1997    41.8      635.
## # … with 1,694 more rows

but that takes a lot of typing! If there were more than a handful of columns in our tibble it would be very difficult to deselect a column in this way. Fortunately there's a shortcut: use the minus sign

gapminder %>% select(-pop)
## # A tibble: 1,704 × 5
##    country     continent  year lifeExp gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8      779.
##  2 Afghanistan Asia       1957    30.3      821.
##  3 Afghanistan Asia       1962    32.0      853.
##  4 Afghanistan Asia       1967    34.0      836.
##  5 Afghanistan Asia       1972    36.1      740.
##  6 Afghanistan Asia       1977    38.4      786.
##  7 Afghanistan Asia       1982    39.9      978.
##  8 Afghanistan Asia       1987    40.8      852.
##  9 Afghanistan Asia       1992    41.7      649.
## 10 Afghanistan Asia       1997    41.8      635.
## # … with 1,694 more rows

Just as we could when selecting, we can deselect multiple columns by separating their names with a comma:

gapminder %>% select(-pop, -year)
## # A tibble: 1,704 × 4
##    country     continent lifeExp gdpPercap
##    <fct>       <fct>       <dbl>     <dbl>
##  1 Afghanistan Asia         28.8      779.
##  2 Afghanistan Asia         30.3      821.
##  3 Afghanistan Asia         32.0      853.
##  4 Afghanistan Asia         34.0      836.
##  5 Afghanistan Asia         36.1      740.
##  6 Afghanistan Asia         38.4      786.
##  7 Afghanistan Asia         39.9      978.
##  8 Afghanistan Asia         40.8      852.
##  9 Afghanistan Asia         41.7      649.
## 10 Afghanistan Asia         41.8      635.
## # … with 1,694 more rows

It's easy to mix up the dplyr verbs select and filter. Here's a handy mnemonic: filteR filters Rows while seleCt selects Columns. Suppose we wanted to select only the column pop from gapminder.

1.5.1 Exercise

  1. Select only the columns year, lifeExp, and country in gapminder.
gapminder %>% 
  select(year, lifeExp, country)
## # A tibble: 1,704 × 3
##     year lifeExp country    
##    <int>   <dbl> <fct>      
##  1  1952    28.8 Afghanistan
##  2  1957    30.3 Afghanistan
##  3  1962    32.0 Afghanistan
##  4  1967    34.0 Afghanistan
##  5  1972    36.1 Afghanistan
##  6  1977    38.4 Afghanistan
##  7  1982    39.9 Afghanistan
##  8  1987    40.8 Afghanistan
##  9  1992    41.7 Afghanistan
## 10  1997    41.8 Afghanistan
## # … with 1,694 more rows
  1. Select all the columns except year, lifeExp, and country in gapminder.
gapminder %>% 
  select(-year, -lifeExp, -country)
## # A tibble: 1,704 × 3
##    continent      pop gdpPercap
##    <fct>        <int>     <dbl>
##  1 Asia       8425333      779.
##  2 Asia       9240934      821.
##  3 Asia      10267083      853.
##  4 Asia      11537966      836.
##  5 Asia      13079460      740.
##  6 Asia      14880372      786.
##  7 Asia      12881816      978.
##  8 Asia      13867957      852.
##  9 Asia      16317921      649.
## 10 Asia      22227415      635.
## # … with 1,694 more rows

1.6 The summarize verb

Suppose we want to calculate the sample mean of the column lifeExp in gapminder. We can do this using the summarize verb as follows:

gapminder %>% summarize(mean_lifeExp = mean(lifeExp))
## # A tibble: 1 × 1
##   mean_lifeExp
##          <dbl>
## 1         59.5

Note the syntax: within summarize we have an assignment statement. In particular, we assign mean(lifeExp) to the variable mean_lifeExp. The key thing to know about summarize is that it always returns collapses a tibble with many rows into a single row. When we think about computing a sample mean, this makes sense: we want to summarize the column lifeExp as a single number. It doesn't actually make much sense to compute the mean of lifeExp because this involves averaging over different countries and different years. Instead let's compute the mean for a single year: 1952:

gapminder %>% 
  filter(year == 1952) %>%
  summarize(mean_lifeExp = mean(lifeExp))
## # A tibble: 1 × 1
##   mean_lifeExp
##          <dbl>
## 1         49.1

We can use summarize to compute multiple summary statistics for a single variable, the same summary statistic for multiple variables, or both:

gapminder %>% 
  filter(year == 1952) %>%
  summarize(mean_lifeExp = mean(lifeExp),
            sd_lifeExp = sd(lifeExp), 
            mean_pop = mean(pop))
## # A tibble: 1 × 3
##   mean_lifeExp sd_lifeExp  mean_pop
##          <dbl>      <dbl>     <dbl>
## 1         49.1       12.2 16950402.

Note that if we don't explicitly use an assignment statement, R will make up names for us based on the commands that we used:

gapminder %>% 
  filter(year == 1952) %>%
  summarize(mean(lifeExp), median(lifeExp), max(lifeExp))
## # A tibble: 1 × 3
##   `mean(lifeExp)` `median(lifeExp)` `max(lifeExp)`
##             <dbl>             <dbl>          <dbl>
## 1            49.1              45.1           72.7

1.6.1 Exercise

  1. Use summarize to compute the 75th percentile of life expectancy in 1977.
gapminder %>% 
  filter(year == 1977) %>%
  summarize(quantile(lifeExp, 0.75))
## # A tibble: 1 × 1
##   `quantile(lifeExp, 0.75)`
##                       <dbl>
## 1                      70.4
  1. Use summarize to compute the 75th percentile of life expectancy among Asian countries in 1977.
gapminder %>% 
  filter(year == 1977, continent == 'Asia') %>%
  summarize(quantile(lifeExp, 0.75))
## # A tibble: 1 × 1
##   `quantile(lifeExp, 0.75)`
##                       <dbl>
## 1                      65.9

1.7 The group_by verb

The true power of summarize is its ability to compute grouped summary statistics in combination with another dplyr verb: group_by. In essence, group_by allows us to tell dplyr that we don't want to work with the whole dataset at once; rather we want to work with particular subsets or groups. The basic idea is similar to what we've done using filter in the past. For example, to calculate mean population (in millions) and mean life expectancy in the year 2007, we could use the following code:

gapminder %>%
  filter(year == 2007) %>%
  summarize(meanPop = mean(pop) / 1000000, meanLifeExp = mean(lifeExp))
## # A tibble: 1 × 2
##   meanPop meanLifeExp
##     <dbl>       <dbl>
## 1    44.0        67.0

Using group_by we could do the same thing for all years in the dataset at once:

gapminder %>%
  group_by(year) %>%
  summarize(meanPop = mean(pop) / 1000000, meanLifeExp = mean(lifeExp))
## # A tibble: 12 × 3
##     year meanPop meanLifeExp
##    <int>   <dbl>       <dbl>
##  1  1952    17.0        49.1
##  2  1957    18.8        51.5
##  3  1962    20.4        53.6
##  4  1967    22.7        55.7
##  5  1972    25.2        57.6
##  6  1977    27.7        59.6
##  7  1982    30.2        61.5
##  8  1987    33.0        63.2
##  9  1992    36.0        64.2
## 10  1997    38.8        65.0
## 11  2002    41.5        65.7
## 12  2007    44.0        67.0

Notice what has changed in the second code block: we replaced filter(year == 2007) with group_by(year). This tells dplyr that, rather than simply restricting attention to data from 2007, we want to form subsets (groups) of the dataset that correspond to the values of the year variable. Whatever comes after group_by will then be calculated for these subsets.

Here's another example. Suppose we wanted to calculate mean life expectancy and total population in each continent during the year 2007. To accomplish this, we can chain together the filter, group_by and summarize verbs as follows:

gapminder %>% 
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarize(meanPop = mean(pop) / 1000000, meanLifeExp = mean(lifeExp))
## # A tibble: 5 × 3
##   continent meanPop meanLifeExp
##   <fct>       <dbl>       <dbl>
## 1 Africa       17.9        54.8
## 2 Americas     36.0        73.6
## 3 Asia        116.         70.7
## 4 Europe       19.5        77.6
## 5 Oceania      12.3        80.7

We can also use group_by to subset over multiple variables at once. For example, to calculate mean life expectancy and total population in each continent separately for every year, we can use the following code:

gapminder %>% 
  group_by(year, continent) %>%
  summarize(meanPop = mean(pop) / 1000000, meanLifeExp = mean(lifeExp))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 60 × 4
## # Groups:   year [12]
##     year continent meanPop meanLifeExp
##    <int> <fct>       <dbl>       <dbl>
##  1  1952 Africa       4.57        39.1
##  2  1952 Americas    13.8         53.3
##  3  1952 Asia        42.3         46.3
##  4  1952 Europe      13.9         64.4
##  5  1952 Oceania      5.34        69.3
##  6  1957 Africa       5.09        41.3
##  7  1957 Americas    15.5         56.0
##  8  1957 Asia        47.4         49.3
##  9  1957 Europe      14.6         66.7
## 10  1957 Oceania      5.97        70.3
## # … with 50 more rows

1.7.1 Exercise

  1. Why doesn't the following code work as expected?
gapminder %>%
  summarize(meanLifeExp = mean(lifeExp)) %>%
  group_by(year)

The steps are carried out in the wrong order: we need to form groups first and then calculate our desired summaries.

  1. Calculate the median GDP per capita in each continent in 1977.
gapminder %>% 
  filter(year == 1977) %>%
  group_by(continent) %>%
  summarize(medGDPc = median(gdpPercap))
## # A tibble: 5 × 2
##   continent medGDPc
##   <fct>       <dbl>
## 1 Africa      1400.
## 2 Americas    6281.
## 3 Asia        3195.
## 4 Europe     14226.
## 5 Oceania    17284.
  1. Repeat 2. but sort your results in descending order.
gapminder %>% 
  filter(year == 1977) %>%
  group_by(continent) %>%
  summarize(medGDPc = median(gdpPercap)) %>%
  arrange(desc(medGDPc))
## # A tibble: 5 × 2
##   continent medGDPc
##   <fct>       <dbl>
## 1 Oceania    17284.
## 2 Europe     14226.
## 3 Americas    6281.
## 4 Asia        3195.
## 5 Africa      1400.
  1. Calculate the mean and standard deviation of life expectancy for separately for each continent in every year after 1977. Sort your results in ascending order by the standard deviation of life expectancy.
gapminder %>% 
  filter(year > 1977) %>%
  group_by(continent, year) %>%
  summarize(meanGDPc = mean(gdpPercap), sdGDPc = sd(gdpPercap)) %>%
  arrange(sdGDPc)
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
## # A tibble: 30 × 4
## # Groups:   continent [5]
##    continent  year meanGDPc sdGDPc
##    <fct>     <int>    <dbl>  <dbl>
##  1 Oceania    1982   18555.  1304.
##  2 Oceania    1987   20448.  2038.
##  3 Africa     1987    2283.  2567.
##  4 Africa     1992    2282.  2644.
##  5 Africa     1997    2379.  2821.
##  6 Africa     2002    2599.  2973.
##  7 Africa     1982    2482.  3243.
##  8 Oceania    1992   20894.  3579.
##  9 Africa     2007    3089.  3618.
## 10 Oceania    1997   24024.  4206.
## # … with 20 more rows

1.8 Understanding the pipe: %>%

Let's revisit the pipe, %>%, that we've used in the code examples above. I told you that the command gapminder %>% filter(year == 2007) "pipes" the tibble gapminder into the function filter(). But what exactly does this mean? Take a look at the R help file for the dplyr function filter. We see that filter() takes something called .data as its first argument. Moving on to the "Arguments" section of the help file, we see that .data is "A tbl" i.e. a tibble. To better understand what this means, let's try using filter without the pipe:

filter(gapminder, year == 2007, country == 'United States')
## # A tibble: 1 × 6
##   country       continent  year lifeExp       pop gdpPercap
##   <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
## 1 United States Americas   2007    78.2 301139947    42952.

Notice that this gives us exactly the same result as

gapminder %>% 
  filter(year == 2007, country == 'United States')
## # A tibble: 1 × 6
##   country       continent  year lifeExp       pop gdpPercap
##   <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
## 1 United States Americas   2007    78.2 301139947    42952.

In other words The pipe is gives us an alternative way of supplying the first argument to a function. Let's try this with a more familiar R function: mean. The first argument of mean is a vector x. So let's try using the pipe to compute the mean of some data:

x <- c(1, 5, 2, 7, 2)
x %>% mean
## [1] 3.4

The pipe supplies a function with its first argument. If we want to specify additional arguments, we need to do so within the function call itself. For example, here's how we could use the pipe to compute the mean after dropping missing observations:

y <- c(1, 5, NA, 7, 2)
y %>% 
  mean(na.rm = TRUE)
## [1] 3.75

One important note about the pipe: it's not a base R command. Instead it's a command provided by the package Magrittr. (If you're familiar with the Belgian painter Magritte, you may realize that the name of this package is quite witty!) This package is installed automatically along with dplyr.

1.8.1 Exercise

  1. Write R code that uses the pipe to calculate the sample variance of z <- c(4, 1, 5, NA, 3) excluding the missing observation from the calculation.
z <- c(4, 1, 5, NA, 3)
z %>% var(na.rm = TRUE)
## [1] 2.916667
  1. Write code to arrange gapminder by lifeExp, first in increasing order and then in decreasing order, without using the pipe.
arrange(gapminder,lifeExp)
## # A tibble: 1,704 × 6
##    country      continent  year lifeExp     pop gdpPercap
##    <fct>        <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Rwanda       Africa     1992    23.6 7290203      737.
##  2 Afghanistan  Asia       1952    28.8 8425333      779.
##  3 Gambia       Africa     1952    30    284320      485.
##  4 Angola       Africa     1952    30.0 4232095     3521.
##  5 Sierra Leone Africa     1952    30.3 2143249      880.
##  6 Afghanistan  Asia       1957    30.3 9240934      821.
##  7 Cambodia     Asia       1977    31.2 6978607      525.
##  8 Mozambique   Africa     1952    31.3 6446316      469.
##  9 Sierra Leone Africa     1957    31.6 2295678     1004.
## 10 Burkina Faso Africa     1952    32.0 4469979      543.
## # … with 1,694 more rows
arrange(gapminder, desc(lifeExp))
## # A tibble: 1,704 × 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Japan            Asia       2007    82.6 127467972    31656.
##  2 Hong Kong, China Asia       2007    82.2   6980412    39725.
##  3 Japan            Asia       2002    82   127065841    28605.
##  4 Iceland          Europe     2007    81.8    301931    36181.
##  5 Switzerland      Europe     2007    81.7   7554661    37506.
##  6 Hong Kong, China Asia       2002    81.5   6762476    30209.
##  7 Australia        Oceania    2007    81.2  20434176    34435.
##  8 Spain            Europe     2007    80.9  40448191    28821.
##  9 Sweden           Europe     2007    80.9   9031088    33860.
## 10 Israel           Asia       2007    80.7   6426679    25523.
## # … with 1,694 more rows

1.9 Chaining commands

In the examples we've looked at so far, the pipe doesn't seem all that useful: it's just an alternative way of specifying the first argument to a function. The true power and convenience of the pipe only becomes apparent we need to chain a series of commands together. For example, suppose we wanted to display the 1952 data from gapminder sorted by gdpPercap in descending order. Using the pipe, this is easy:

gapminder %>% 
  filter(year == 1952) %>% 
  arrange(desc(gdpPercap))
## # A tibble: 142 × 6
##    country        continent  year lifeExp       pop gdpPercap
##    <fct>          <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Kuwait         Asia       1952    55.6    160000   108382.
##  2 Switzerland    Europe     1952    69.6   4815000    14734.
##  3 United States  Americas   1952    68.4 157553000    13990.
##  4 Canada         Americas   1952    68.8  14785584    11367.
##  5 New Zealand    Oceania    1952    69.4   1994794    10557.
##  6 Norway         Europe     1952    72.7   3327728    10095.
##  7 Australia      Oceania    1952    69.1   8691212    10040.
##  8 United Kingdom Europe     1952    69.2  50430000     9980.
##  9 Bahrain        Asia       1952    50.9    120447     9867.
## 10 Denmark        Europe     1952    70.8   4334000     9692.
## # … with 132 more rows

Notice how I split the commands across multiple lines. This is good practice: it makes your code much easier to read. So what's happening when we chain commands in this way? The first step in the chain gapminder %>% filter(year == 1952) returns a tibble: the subset of gapminder for which year is 1952. The next step %>% arrange(gdpPercap) pipes this new tibble into the function arrange(), giving us the desired result. I hope you agree with me that this is pretty intuitive: even if we didn't know anything about dplyr we could almost figure out what this code is supposed to do. In stark contrast, let's look at the code we'd have to use if we wanted to accomplish the same task without using the pipe:

arrange(filter(gapminder, year == 1952), 
        desc(gdpPercap))
## # A tibble: 142 × 6
##    country        continent  year lifeExp       pop gdpPercap
##    <fct>          <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Kuwait         Asia       1952    55.6    160000   108382.
##  2 Switzerland    Europe     1952    69.6   4815000    14734.
##  3 United States  Americas   1952    68.4 157553000    13990.
##  4 Canada         Americas   1952    68.8  14785584    11367.
##  5 New Zealand    Oceania    1952    69.4   1994794    10557.
##  6 Norway         Europe     1952    72.7   3327728    10095.
##  7 Australia      Oceania    1952    69.1   8691212    10040.
##  8 United Kingdom Europe     1952    69.2  50430000     9980.
##  9 Bahrain        Asia       1952    50.9    120447     9867.
## 10 Denmark        Europe     1952    70.8   4334000     9692.
## # … with 132 more rows

There are may reasons why this code is harder to read, but the most important one is that the commands arrange and filter have to appear in the code in the opposite of the order in which they are actually being carried out. This is because parentheses are evaluated from inside to outside. This is what's great about the pipe: it lets us write our code in a way that accords with the actual order of the steps we want to carry out.

1.9.1 Exercise

  1. What was the most populous European country in 1992? Write appropriate dplyr code using the pipe to display the information you need to answer this question.
gapminder %>%
  filter(year == 1992, continent == 'Europe') %>%
  arrange(desc(pop))
## # A tibble: 30 × 6
##    country        continent  year lifeExp      pop gdpPercap
##    <fct>          <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Germany        Europe     1992    76.1 80597764    26505.
##  2 Turkey         Europe     1992    66.1 58179144     5678.
##  3 United Kingdom Europe     1992    76.4 57866349    22705.
##  4 France         Europe     1992    77.5 57374179    24704.
##  5 Italy          Europe     1992    77.4 56840847    22014.
##  6 Spain          Europe     1992    77.6 39549438    18603.
##  7 Poland         Europe     1992    71.0 38370697     7739.
##  8 Romania        Europe     1992    69.4 22797027     6598.
##  9 Netherlands    Europe     1992    77.4 15174244    26791.
## 10 Hungary        Europe     1992    69.2 10348684    10536.
## # … with 20 more rows
  1. Re-write your code from part 1. without using the pipe.
arrange(filter(gapminder, year == 1992, continent == 'Europe'), desc(pop))
## # A tibble: 30 × 6
##    country        continent  year lifeExp      pop gdpPercap
##    <fct>          <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Germany        Europe     1992    76.1 80597764    26505.
##  2 Turkey         Europe     1992    66.1 58179144     5678.
##  3 United Kingdom Europe     1992    76.4 57866349    22705.
##  4 France         Europe     1992    77.5 57374179    24704.
##  5 Italy          Europe     1992    77.4 56840847    22014.
##  6 Spain          Europe     1992    77.6 39549438    18603.
##  7 Poland         Europe     1992    71.0 38370697     7739.
##  8 Romania        Europe     1992    69.4 22797027     6598.
##  9 Netherlands    Europe     1992    77.4 15174244    26791.
## 10 Hungary        Europe     1992    69.2 10348684    10536.
## # … with 20 more rows

1.10 Change an existing variable or create a new one with mutate

It's a little hard to read the column pop in gapminder since there are so many digits. Suppose that, instead of raw population, we wanted to display population in millions. This requires us to pop by 1000000, which we can do using the function mutate() from dplyr as follows:

gapminder %>% 
  mutate(pop = pop / 1000000)
## # A tibble: 1,704 × 6
##    country     continent  year lifeExp   pop gdpPercap
##    <fct>       <fct>     <int>   <dbl> <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8.43      779.
##  2 Afghanistan Asia       1957    30.3  9.24      821.
##  3 Afghanistan Asia       1962    32.0 10.3       853.
##  4 Afghanistan Asia       1967    34.0 11.5       836.
##  5 Afghanistan Asia       1972    36.1 13.1       740.
##  6 Afghanistan Asia       1977    38.4 14.9       786.
##  7 Afghanistan Asia       1982    39.9 12.9       978.
##  8 Afghanistan Asia       1987    40.8 13.9       852.
##  9 Afghanistan Asia       1992    41.7 16.3       649.
## 10 Afghanistan Asia       1997    41.8 22.2       635.
## # … with 1,694 more rows

Note the syntax here: within mutate() we have an assignment statement, namely pop = pop / 1000000. This tells R to calculate pop / 1000000 and assign the result to pop, in place of the original variable.

We can also use mutate() to create a new variable. The gapminder dataset doesn't contain overall GDP, only GDP per capita. To calculate GDP, we need to multiply gdpPercap by pop. But wait! Didn't we just change pop so it's expressed in millions? No: we never stored the results of our previous command, we simply displayed them. Just as I discussed above, unless you overwrite it, the original gapminder dataset will be unchanged. With this in mind, we can create the gdp variable as follows:

gapminder %>% 
  mutate(gdp = pop * gdpPercap)
## # A tibble: 1,704 × 7
##    country     continent  year lifeExp      pop gdpPercap          gdp
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
## # … with 1,694 more rows

1.10.1 Exercise

  1. Explain why we used = rather than == in the mutate() examples above.

We used = because we're carrying out an assignment operation. In contrast == tests for equality, returning TRUE or FALSE.

  1. Calculate life expectancy in months and use it to answer the following question: "which country in the Americas had the shortest life expectancy in months in the year 1962?"
gapminder %>%
  mutate(lifeExpMonths = 12 * lifeExp) %>%
  filter(year == 1962, continent == 'Americas') %>%
  arrange(lifeExpMonths)
## # A tibble: 25 × 7
##    country            continent  year lifeExp      pop gdpPercap lifeExpMonths
##    <fct>              <fct>     <int>   <dbl>    <int>     <dbl>         <dbl>
##  1 Bolivia            Americas   1962    43.4  3593918     2181.          521.
##  2 Haiti              Americas   1962    43.6  3880130     1797.          523.
##  3 Guatemala          Americas   1962    47.0  4208858     2750.          563.
##  4 Honduras           Americas   1962    48.0  2090162     2291.          576.
##  5 Nicaragua          Americas   1962    48.6  1590597     3634.          584.
##  6 Peru               Americas   1962    49.1 10516500     4957.          589.
##  7 El Salvador        Americas   1962    52.3  2747687     3777.          628.
##  8 Dominican Republic Americas   1962    53.5  3453434     1662.          642.
##  9 Ecuador            Americas   1962    54.6  4681707     4086.          656.
## 10 Brazil             Americas   1962    55.7 76039390     3337.          668.
## # … with 15 more rows