# Lesson 2 Getting Started with `ggplot2`

In this lesson we'll build on your knowledge of `dplyr`

and the `gapminder`

dataset and introduce `ggplot2`

, the R graphics package *par excellence*. Like `dplyr`

, `ggplot2`

is also a part of the Tidyverse family of packages. To install the whole family of packages, use `install.packages('tidyverse')`

. To load all of them at once, use `library(tidyverse)`

. (Of course you can also install and load `ggplot2`

on its own if you prefer.) This lesson is only the tip of the iceberg when it comes to `ggplot2`

. We'll pick up a few more `ggplot2`

tricks in future lessons. For a more comprehensive treatment, see the free online draft of Data Visualization: A Practical Introduction. Another good reference is R for Data Science, and don't forget the `ggplot2`

cheat sheet!

## 2.1 A simple scatterplot using `ggplot2`

We'll start off by constructing a subset of the `gapminder`

dataset that contains information from the year 2007 that we'll use for our plots below.

```
library(gapminder)
library(tidyverse)
<- gapminder %>%
gapminder_2007 filter(year == 2007)
```

It takes some time to grow accustomed to `ggplot2`

syntax, so rather than giving you a lot of detail, we'll examine a series of examples that start off simple and become more complex. The first of these is a simple scatterplot using `gapminder_2007`

.
Each point will correspond to a single country in 2007. Its x-coordinate will be GDP per capita and its y-coordinate will be life expectancy. Here's the code:

```
ggplot(gapminder_2007) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp))
```

We see that GDP per capita is a very strong predictor of life expectancy, although the relationship is non-linear. Notice how I add a linebreak after the `+`

. This is analogous to how I always add a linebreak after the pipe `%>%`

. While it isn't necessary for the code to run correctly, it improves readability.

### 2.1.1 Exercise

- Using my code example as a template, make a scatterplot with
`pop`

on the x-axis and`lifeExp`

on the y-axis using`gapminder_2007`

. Does there appear to be a relationship between population and life expectancy?

There is no clear relationship between population and life expectancy based on the 2007 data:

```
ggplot(gapminder_2007) +
geom_point(mapping = aes(x = pop, y = lifeExp))
```

- Repeat 1. with
`gdpPercap`

on the y-axis.

There is no clear relationship between population and GDP per capita based on the 2007 data:

```
ggplot(gapminder_2007) +
geom_point(mapping = aes(x = pop, y = gdpPercap))
```

## 2.2 Plotting on the log scale

It's fairly common to transform data onto a log scale before carrying out further analysis or plotting. To transform the x-axis to the log base 10 scale, it's as easy as adding a `+ scale_x_log10()`

to the end of our command from above:

```
ggplot(data = gapminder_2007) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp)) +
scale_x_log10()
```

Again: notice how I split the code across multiple lines and ended each of the intermediate lines with the `+`

. This makes things much easier to read.

### 2.2.1 Exercise

- Using my code example as a template, make a scatterplot with the log base 10 of
`pop`

on the x-axis and`lifeExp`

on the y-axis using the`gapminder_2007`

dataset.

```
ggplot(data = gapminder_2007) +
geom_point(mapping = aes(x = pop, y = lifeExp)) +
scale_x_log10()
```

- Suppose that rather than putting the x-axis on the log scale, we wanted to put the
*y-axis*on the log scale. Figure out how to do this, either by clever guesswork or a google search, and then repeat my example with`gdpPercap`

and`lifeExp`

with`gdpPercap`

in levels and`lifeExp`

in logs base 10.

```
ggplot(data = gapminder_2007) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp)) +
scale_y_log10()
```

- Repeat 2. but with
*both*axes on the log scale.

```
ggplot(data = gapminder_2007) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp)) +
scale_x_log10() +
scale_y_log10()
```

## 2.3 The color and size aesthetics

It's time to start unraveling the somewhat mysterious-looking syntax of `ggplot`

.
To make a graph using `ggplot`

we use the following template:

```
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
```

replacing `<DATA>`

, `<GEOM_FUNCTION>`

, and `<MAPPINGS>`

to specify what we want to plot and how it should appear. The first part is easy: we replace `<DATA>`

with the dataset we want to plot, for example `gapminder_2007`

in the example from above.
The second part is also fairly straightforward: we replace `<GEOM_FUNCTION>`

with the name of a function that specifies the kind of plot we want to make.
So far we've only seen one example: `geom_point()`

which tells `ggplot`

that we want to make a scatterplot. We'll see more examples in later lessons. For now, I want to focus on the somewhat more complicated-looking `mapping = aes(<MAPPINGS>)`

.

The abbreviation `aes`

is short for *aesthetic* and the code `mapping = aes(<MAPPINGS>)`

defines what is called an *aesthetic mapping*. This is just a fancy way of saying that it tells R how we want our plot to look. The information we need to put in place of `<MAPPINGS>`

depends on what kind of plot we're making. Thus far we've only examined `geom_point()`

which produces a scatterplot. For this kind of plot, the minimum information we need to provide is the location of each point. For example, in our example above we wrote `aes(x = gdpPercap, y = lifeExp)`

to tell R that `gdpPercap`

gives the x-axis location of each point, and `lifeExp`

gives the y-axis location.

When making a scatterplot with `geom_point`

we are not limited to specifying the x and y coordinates of each point; we can also specify the size and color of each point. This gives us a useful way of displaying more than two variables in a two-dimensional plot. We do this using `aes`

. For example, let's use the color of each point to indicate `continent`

```
ggplot(data = gapminder_2007) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
scale_x_log10()
```

Notice how `ggplot`

automatically generates a helpful legend. This plot makes it easy to see at a glance that the European countries in 2007 tend to have high GDP per capita and high life expectancy, while the African countries have the opposite. We can also use the *size* of each point to encode information, e.g. population:

```
ggplot(data = gapminder_2007) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp,
color = continent, size = pop)) +
scale_x_log10()
```

### 2.3.1 Exercise

- Would it make sense to set
`size = continent`

? What about setting`col = pop`

? Explain briefly.

Neither of these makes sense since `continent`

is categorical and `pop`

is continuous: `color`

is useful for categorical variables and `size`

for continuous ones.

- The following code is slightly different from what I've written above. What is different. Try running it. What happens? Explain briefly.

```
ggplot(gapminder_2007) +
geom_point(aes(x = gdpPercap, y = lifeExp)) +
scale_x_log10()
```

It still works! You don't have to explicitly write `data`

or `mapping`

when using `ggplot`

. I only included these above for clarity. In the future I'll leave them out to make my code more succinct.

```
ggplot(gapminder_2007) +
geom_point(aes(x = gdpPercap, y = lifeExp)) +
scale_x_log10()
```

- Create a tibble called
`gapminder_1952`

that contains data from`gapminder`

from 1952.

```
<- gapminder %>%
gapminder_1952 filter(year == 1952)
```

- Use
`gapminder_1952`

from the previous part to create a scatter plot with population on the x-axis, life expectancy on the y-axis, and continent represented by the color of the points. Plot population on the log scale (base 10).

```
ggplot(gapminder_1952) +
geom_point(aes(x = pop, y = lifeExp, color = continent)) +
scale_x_log10()
```

- Suppose that instead of indicating continent using color, you wanted all the points in the plot from 3. to be blue. Consult the chapter "Visualising Data" from R for Data Science to find out how to do this.

When you want color to be a variable from your dataset, put `color = <VARIABLE>`

*inside* of `aes`

; when you simply want to set the colors of all the points, put `color = '<COLOR>'`

*outside* of `aes`

, for example

```
ggplot(gapminder_1952) +
geom_point(aes(x = pop, y = lifeExp), color = 'blue') +
scale_x_log10()
```

## 2.4 Faceting - Plots for Multiple Subsets

Recall our plot of GDP per capita and life expectancy in 2007 from above:

```
<- gapminder %>%
gapminder_2007 filter(year == 2007)
ggplot(gapminder_2007) +
geom_point(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
scale_x_log10()
```

This is an easy way to make a plot for a single year. But what if you wanted to make the same plot for *every year* in the `gapminder`

dataset? It would take a lot of copying-and-pasting of the preceding code chunk to accomplish this. Fortunately there's a much easier way: *faceting*. In `ggplot2`

a *facet* is a subplot that corresponds to a subset of your dataset, for example the year 2007. We'll now use faceting to reproduce the plot from above for all the years in `gapminder`

simultaneously:

```
ggplot(gapminder) +
geom_point(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
scale_x_log10() +
facet_wrap(~ year)
```

Note the syntax here: in a similar way to how we added `scale_x_log10()`

to plot on the log scale, we add `facet_wrap(~ year)`

to facet by `year`

.
The tilde `~`

is important: this has to precede the variable by which you want to facet.

Now that we understand how to produce it, let's take a closer look at this plot.
Notice how this plot allows us to visualize five variables *simultaneously*.
By looking at how the plots change over time, we see a pattern of increasing GDP per capita and life expectancy throughout the world between 1952 and 2007. Notice in particular the dramatic improvements in both variables in the Asian economies.

### 2.4.1 Exercise

- What would happen if I were to run the following code? Explain briefly.

```
ggplot(gapminder_2007) +
geom_point(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
scale_x_log10() +
facet_wrap(~ year)
```

We'll only get one facet since the tibble `gapminder_2007`

only has data for 2007:

```
ggplot(gapminder_2007) +
geom_point(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
scale_x_log10() +
facet_wrap(~ year)
```

- Make a scatterplot with data from
`gapminder`

for the year 1977. Your plot should be faceted by continent with GDP per capita on the log scale on the x-axis, life expectancy on the y-axis, and population indicated by the size of each point.

```
<- gapminder %>%
gapminder_1977 filter(year == 1977)
ggplot(gapminder_1977) +
geom_point(aes(x = gdpPercap, y = lifeExp, size = pop)) +
scale_x_log10() +
facet_wrap(~ continent)
```

- What would happen if you tried to facet by
`pop`

? Explain briefly.

You'll get something crazy if you try this. Population is continuous rather than categorical so every country has a different value for this variable. You'll end up with one plot for every country, containing a single point:

```
# Not run: it takes a long time and looks nasty!
<- gapminder %>%
gapminder_1977 filter(year == 1977)
ggplot(gapminder_1977) +
geom_point(aes(x = gdpPercap, y = lifeExp, color = continent)) +
scale_x_log10() +
facet_wrap(~ pop)
```

## 2.5 Plotting summarized data

By combining `summarize`

and `group_by`

with `ggplot`

, it's easy to make plots of grouped data.
For example, here's how we could plot total world population in millions from 1952 to 2007.
First we construct a tibble which I'll name `by_year`

containing the desired summary statistic grouped by year and display it:

```
<- gapminder %>%
by_year mutate(popMil = pop / 1000000) %>%
group_by(year) %>%
summarize(totalpopMil = sum(popMil))
by_year
```

```
## # A tibble: 12 × 2
## year totalpopMil
## <int> <dbl>
## 1 1952 2407.
## 2 1957 2664.
## 3 1962 2900.
## 4 1967 3217.
## 5 1972 3577.
## 6 1977 3930.
## 7 1982 4289.
## 8 1987 4691.
## 9 1992 5111.
## 10 1997 5515.
## 11 2002 5887.
## 12 2007 6251.
```

Then we make a scatterplot using `ggplot`

:

```
ggplot(by_year) +
geom_point(aes(x = year, y = totalpopMil))
```

Here's a more complicated example where we additionally use color to plot each continent separately:

```
<- gapminder %>%
by_year_continent mutate(popMil = pop / 1000000) %>%
group_by(year, continent) %>%
summarize(totalpopMil = sum(popMil))
by_year
```

```
## # A tibble: 12 × 2
## year totalpopMil
## <int> <dbl>
## 1 1952 2407.
## 2 1957 2664.
## 3 1962 2900.
## 4 1967 3217.
## 5 1972 3577.
## 6 1977 3930.
## 7 1982 4289.
## 8 1987 4691.
## 9 1992 5111.
## 10 1997 5515.
## 11 2002 5887.
## 12 2007 6251.
```

```
ggplot(by_year_continent) +
geom_point(aes(x = year, y = totalpopMil, color = continent))
```

Make sure you understand how the preceding example works before attempting the exercise.

### 2.5.1 Exercise

- What happens if you append
`+ expand_limits(y = 0)`

to the preceding`ggplot`

code? Why might this be helpful in some cases?

The function `expand_limits()`

lets us tweak the limits of our x or y-axis in a `ggplot`

. In this particular example `expand_limits(y = 0)`

ensures that the y-axis begins at zero. Without using this command, `ggplot`

will choose the y-axis on its own so that there is no "empty space" in the plot. Sometimes we may want to override this behavior.

- Make a scatter with average GDP per capita across all countries in
`gapminder`

in the y-axis and`year`

on the x-axis.

```
<- gapminder %>%
by_year group_by(year) %>%
summarize(meanGDPc = mean(gdpPercap))
ggplot(by_year) +
geom_point(aes(x = year, y = meanGDPc))
```

- Repeat 2. broken down by continent, using color to distinguish the points. Put mean GDP per capita on the log scale.

```
<- gapminder %>%
by_year group_by(year, continent) %>%
summarize(meanGDPc = mean(gdpPercap))
```

```
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
```

```
ggplot(by_year) +
geom_point(aes(x = year, y = meanGDPc, color = continent)) +
scale_y_log10()
```

## 2.6 Line plots

Thus far we've only learned how to make one kind of plot with `ggplot`

: a scatterplot, which we constructed using `geom_scatter()`

. Sometimes we want to *connect* the dots in a scatterplot, for example when we're interested in visualizing a trend over time. The resulting plot is called a *line plot*. To make one, simply replace `geom_scatter()`

with `geom_line()`

. For example:

```
<- gapminder %>%
by_year_continent mutate(popMil = pop / 1000000) %>%
group_by(year, continent) %>%
summarize(totalpopMil = sum(popMil))
by_year
```

```
## # A tibble: 60 × 3
## # Groups: year [12]
## year continent meanGDPc
## <int> <fct> <dbl>
## 1 1952 Africa 1253.
## 2 1952 Americas 4079.
## 3 1952 Asia 5195.
## 4 1952 Europe 5661.
## 5 1952 Oceania 10298.
## 6 1957 Africa 1385.
## 7 1957 Americas 4616.
## 8 1957 Asia 5788.
## 9 1957 Europe 6963.
## 10 1957 Oceania 11599.
## # … with 50 more rows
```

```
ggplot(by_year_continent) +
geom_line(aes(x = year, y = totalpopMil, color = continent))
```

### 2.6.1 Exercise

Repeat exercise 5-3 with a line plot rather than a scatterplot.

```
<- gapminder %>%
by_year group_by(year, continent) %>%
summarize(meanGDPc = mean(gdpPercap))
```

```
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
```

```
ggplot(by_year) +
geom_line(aes(x = year, y = meanGDPc, color = continent)) +
scale_y_log10()
```

## 2.7 Bar plots

To make a bar plot, we use `geom_col()`

. Note that the `x`

argument of `aes`

needs to be a *categorical variable* for a bar plot to make sense. Here's a simple example:

```
<- gapminder %>%
by_continent filter(year == 2007) %>%
group_by(continent) %>%
summarize(meanLifeExp = mean(lifeExp))
ggplot(by_continent) +
geom_col(aes(x = continent, y = meanLifeExp))
```

Sometimes we want to turn a bar plot, or some other kind of plot, on its side. This can be particularly helpful if the x-axis labels are very long. To do this, simply add `+ coord_flip()`

to your `ggplot`

command, for example:

```
ggplot(by_continent) +
geom_col(aes(x = continent, y = meanLifeExp)) +
coord_flip()
```

## 2.8 Cleveland Dot Charts

Now that you know how to make a barchart *don't bother*; *dot charts* as described by Cleveland (1984), are a simpler, cleaner and more flexible alternative.

```
ggplot(by_continent) +
geom_point(aes(x = meanLifeExp, y = continent))
```

Unlike the equivalent bar chart from above, this dot chart *restricts* the `meanLifeExp`

axis rather than extending it all the way to zero. This makes sense given that our interest in making this plot is to compare average life expectancy across continents.

Dot charts are typically most informative when *sorted* by the continuous variable, `meanLifeExp`

in our case. A convenient way to achieve this is by using the `fct_reorder()`

function from the `forcats`

package, a member of the Tidyverse. Here's a simple pipeline that does the trick:

```
library(forcats)
%>%
by_continent mutate(continent = fct_reorder(continent, meanLifeExp)) %>%
ggplot() +
geom_point(aes(x = meanLifeExp, y = continent))
```

The first argument of `fct_reorder()`

is the factor whose levels we want to re-order. The second argument is the variable that we'll use to determine the order. In our case, we reorder `continent`

according to `meanLifeExp`

. Another thing worth noticing in the preceding code chunk is the way that I modified `by_continent`

*in place* and piped the result directly into `ggplot()`

. I didn't bother to store this modified version of `by_continent`

or give it a new name, because I knew that I wouldn't need to use it again.

Because dots take up less space than bars, dot charts provide a cleaner way of making comparisons *within* and *between* groups simultaneously. Here's a more complicated example that shows how life expectancy has changed in each continent between 1987 and 2007:

```
%>%
gapminder filter(year %in% c(1987, 2007)) %>%
mutate(year = factor(year)) %>%
group_by(continent, year) %>%
summarize(meanLifeExp = mean(lifeExp)) %>%
ggplot(aes(x = meanLifeExp, y = continent)) +
geom_line(aes(group = continent)) +
geom_point(aes(color = year))
```

### 2.8.1 Exercise

Make a dot chart of GDP per capita in all European countries in the year 2007. Sort the dots so that the country with the *highest* GDP per capita appears a the top and the country with the *lowest* appears at the bottom.

```
%>%
gapminder filter(continent == 'Europe', year == 2007) %>%
mutate(country = fct_reorder(country, gdpPercap)) %>%
ggplot() +
geom_point(aes(x = gdpPercap, y = country))
```

## 2.9 Histograms

To make a `ggplot2`

histogram, we use the function `geom_histogram()`

. Recall that a histogram summarizes a *single* variable at a time by forming non-overlapping bins of equal width and calculating the fraction of observations in each bin. If we choose a different width for the bins, we'll get a different histogram. Here's an example of two different bin widths:

```
<- gapminder %>%
gapminder_2007 filter(year == 2007)
ggplot(gapminder_2007) +
geom_histogram(aes(x = lifeExp), binwidth = 5)
```

```
ggplot(gapminder_2007) +
geom_histogram(aes(x = lifeExp), binwidth = 1)
```

### 2.9.1 Exercise

- All of the examples we've seen that use
`ggplot`

*besides*histograms have involved specifying both`x`

and`y`

within`aes()`

. Why are histograms different?

This is because histograms only depict a single variable while the other plots we've made show two variables at once.

- What happens if you don't specify a bin width in either of my two examples?

If you don't specify a bin width, `ggplot2`

will pick one for you and possibly give you a warning suggesting that you pick a better bin width manually.

- Make a histogram of GDP per capita in 1977. Play around with different bin widths until you find one that gives a good summary of the data.

There's no obvious *right answer* for the bin width, but here's one possibility:

```
<- gapminder %>%
gapminder1977 filter(year == 1977)
ggplot(gapminder_1977) +
geom_histogram(aes(x = gdpPercap), binwidth = 5000)
```

- Repeat 3. but put GDP per capita on the log scale.

You'll need a much smaller bin width when using the log scale, for example:

```
ggplot(gapminder_1977) +
geom_histogram(aes(x = gdpPercap), binwidth = 0.2) +
scale_x_log10()
```

- Compare and contrast the two different histograms you've made.

No right answer: it's a discussion question! But the idea is to see how taking logs gets rid of the huge positive skewness in GDP per capita.

## 2.10 Boxplots

The final kind of `ggplot`

we'll learn about in this lesson is a boxplot, a visualization of the *five-number summary* of a variable: minimum, 25th percentile, median, 75th percentile, and maximum. To make a boxplot in `ggplot`

we use the function `geom_boxplot()`

, for example:

```
ggplot(gapminder_2007) +
geom_boxplot(aes(x = continent, y = lifeExp))
```

Compared to histograms, boxplots provide less detail but allow us to easily compare across groups.

### 2.10.1 Exercise

- What is the meaning of the little "dots" that appear in the boxplot above? Use a Google search to find out what they are and how they are computed.

They are outliers: `ggplot`

considers any observation that is more than 1.5 times the interquartile range away from the "box" to be an outlier, and adds a point to indicate it.

- Use faceting to construct a collection of boxplots, each of which compares log GDP per capita across continents in a given year. Turn your boxplots sideways to make it easier to read the continent labels.

```
ggplot(gapminder) +
geom_boxplot(aes(x = continent, y = gdpPercap)) +
facet_wrap(~ year) +
scale_y_log10() +
coord_flip() +
ggtitle('GDP per Capita by Continent: 1952-2007')
```

- Use a Google search to find out how to add a title to a
`ggplot`

. Use it to add a title to the plot you created in 2.

Use `ggtitle('YOUR TITLE HERE')`

as I did in my solution to 2. above.