Chapter 6 Data Visualization with ggplot2
Main reference for this chapter: R graphics cookbook (https://r-graphics.org/)
In the previous chapter, we learned how to create some basic plots with base R. In this chapter, we will see how to use ggplot
for data visualization.
We will use some of the datasets from the the package gcookbook
. Therefore, we will install it now.
Load the packages gcookbook
, tidyverse
and nycflights13
.
library(gcookbook) # contains some datasets for illustration
library(tidyverse) # contains ggplot2 and dplyr
library(nycflights13) # contains the dataset "flights"
6.1 Bar charts
We will start with bar charts. Many of the usages discussed in this section can also be transferable to create other plots.
Recall that there are two types of bar charts:
- Bar chart of values. x-axis: discrete variable, y-axis: numeric data (not necessarily count data)
- Bar chart of counts. x-axis: discrete variable, y-axis: count of cases in the discrete variable
Using ggplot:
- For bar chart of values, we use
geom_col()
, which is the same as usinggeom_bar(stat = "identity")
. - For bar chart of counts, we use
geom_bar()
, which is the same as usinggeom_bar(stat = "count")
. That is, the default forgeom_bar()
is to usestat = "count"
.
Bar chart of values:
pg_mean
is a simple dataset with groupwise means of some plant growth data.
Recall the mtcars
dataset. Let’s create a bar chart of values for the mean weights grouped by the number of gears. First, we summarize the data using summarize
.
by_gear <- group_by(mtcars, gear)
mtcars_wt <- summarize(by_gear, mean_wt_by_gear = mean(wt))
# Alternatively, using %>%
mtcars_wt <- mtcars %>%
group_by(gear) %>%
summarize(mean_wt_by_gear = mean(wt))
Create the bar chart:
To change the colour of the bars, use fill
.
By default, there is no outline around the fill. To add an outline, use colour
(or color
).
Of course, you can combine the two settings:
Graph with grouped bars
The most basic bar chart of values have one categorical variable on the x-axis and one continuous variable on the y-axis. If you want to include another categorical variable to divide up the data, you can use a graph with grouped bars.
In mtcars
, vs
represents the engine of the car with 0 = V-shaped and 1 = straight. We can use vc
to divide up the data in addition to gear
using fill
. To create a grouped bar chart, set position = "dodge"
in geom_col()
; otherwise, you will get a stacked bar chart.
# prepare the data
by_gear_vs <- group_by(mtcars, gear, vs)
mtcars_wt2 <- summarize(by_gear_vs, mean_wt = mean(wt))
# convert to factor in the data
mtcars_wt2$vs <- as.factor(mtcars_wt2$vs)
# Alternatively, using %>%
mtcars_wt2 <- mtcars %>%
group_by(gear, vs) %>%
summarize(mean_wt = mean(wt)) %>%
ungroup() %>%
mutate(vs = as.factor(vs))
# plot
ggplot(mtcars_wt2, aes(x = gear, y = mean_wt, fill = vs)) +
geom_col(position = "dodge")
Without position = "dodge"
, we get a stacked bar chart:
You can also convert vs
to factor in call to ggplot()
:
# prepare the data
by_gear_vs <- group_by(mtcars, gear, vs)
mtcars_wt3 <- summarize(by_gear_vs, mean_wt = mean(wt))
# plot
ggplot(mtcars_wt3, aes(x = gear, y = mean_wt, fill = factor(vs))) +
geom_col(position = "dodge")
To change the colours of the bars:
ggplot(mtcars_wt2, aes(x = gear, y = mean_wt, fill = vs)) +
geom_col(position = "dodge") +
scale_fill_brewer(palette = "Pastel2")
You can try with different palettes:
Using palette = "Oranges"
:
ggplot(mtcars_wt2, aes(x = gear, y = mean_wt, fill = vs)) +
geom_col(position = "dodge") +
scale_fill_brewer(palette = "Oranges")
Using a manually defined palette:
ggplot(mtcars_wt2, aes(x = gear, y = mean_wt, fill = vs)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("#cc6666", "#66cccc"))
Bar Charts of Counts
Creating a bar chart of counts is very similar to creating a bar chart of values.
Bar chart of the number of cars by gear in mtcars
:
Bar chart of the number of flights by each month in nycflights13
:
Controlling the width (by default, width = 0.9
):
Bar chart of the number of flights by origin and month:
6.2 Line Graph
Suppose you want to make a line graph of the daily average departure delay in flights
. From now on, we will use %>%
whenever it is appropriate.
avg_delay <-
flights %>%
group_by(month, day) %>%
summarize(delay = mean(dep_delay, na.rm = TRUE)) %>%
ungroup() %>%
mutate(Time = 1:365)
ggplot(avg_delay, aes(x = Time, y = delay)) +
geom_line()
Labeling the graph:
# notice how we put each argument on its own line when the arguments
# do not all fit on one line
ggplot(avg_delay, aes(x=Time, y=delay)) +
geom_line() +
labs(
y = "Average Delay",
title = "Daily Average Departure Delay of Flights from NYC in 2013"
)
By default, the range of the y-axis of a line graph is just enough to include all the y values in the data. Sometimes, you may want to change the range manually. For example, the range of the y-axis in the following graph does not include 0
.
If you want to include 0
in the y
range, you can use ylim
:
Line Graph with multiple lines
Suppose we want to create a line graph showing the daily average departure delay from the 3 airports in flights
.
# prepare the data
flights_delay <- flights %>%
group_by(month, origin) %>%
summarize(delay = mean(dep_delay, na.rm = TRUE)) %>%
ungroup()
Line Graph:
ggplot(flights_delay, aes(x = month, y = delay, color = origin)) +
geom_line() +
scale_x_continuous(breaks = 1:12)
With different line types:
Add the points on top of the lines:
ggplot(flights_delay, aes(x = month, y = delay, linetype = origin, color = origin)) +
geom_line() +
geom_point()
Change the point shapes according to origin
:
ggplot(flights_delay, aes(x = month, y = delay,
linetype = origin, color = origin, shape = origin)) +
geom_line() +
geom_point()
To use one single shape for the points, we can specify the shape in geom_point()
. The default shape is shape = 16
. The default size is size = 2
. fill
is only applicable for shape = 21 to 25.
ggplot(flights_delay, aes(x = month, y = delay, linetype = origin, color = origin)) +
geom_line() +
geom_point(shape = 22, size = 3, fill = "white", color = "darkred")
Using another colour palette and changing the size of the lines:
6.3 Scatter Plots
Scatter plots are often used to visualize the relationship between two continuous variables. It is also possible to use a scatter plot when either or both variables are discrete.
The dataset heightweight
contains sex, age, height and weight of some schoolchildren.
head(heightweight)
## sex ageYear ageMonth heightIn weightLb
## 1 f 11.92 143 56.3 85.0
## 2 f 12.92 155 62.3 105.0
## 3 f 12.75 153 63.3 108.0
## 4 f 13.42 161 59.0 92.0
## 5 f 15.92 191 62.5 112.5
## 6 f 14.25 171 62.5 112.0
To create a basic scatter plot, use geom_point()
:
You can control the shape, size, and color of the points as illustrated in the last section.
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point(size = 1.5, shape = 4, color = "blue")
If shape = 21-25, you can control the color in the points and outline of the points using fill
and color
, respectively.
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point(size = 1.5, shape = 22, fill = "red", color = "blue")
Visualizing an additional discrete variable
Suppose you want to use different colours for the points according to different categories of sex
.
Suppose you want to use different shapes for the points according to different categories of sex
.
You can use colours and shapes at the same time:
You can change the shapes or colours manually:
ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex, color = sex)) +
geom_point() +
scale_shape_manual(values = c(21,22)) +
scale_colour_brewer(palette = "Set2")
Visualizing an additional continuous variable
You may map an additional continuous variable to color
.
Visualizing two additional discrete variables
Let’s create a new column to indicate if the child weights < 100 or >= 100 pounds (this is a discrete variable).
Now, we can add both sex
and weightgroup
in the plot in the following way:
ggplot(heightweight2, aes(x = ageYear, y = heightIn, shape = sex, fill = weightgroup)) +
geom_point() +
scale_shape_manual(values = c(21, 24)) +
scale_fill_manual(
values = c("red", "black"),
guide = guide_legend(override.aes = list(shape = 21)) # to change the legend
)
Changing the mark ticks, limits and labels of the x-axis and y-axis:
ggplot(heightweight2, aes(x = ageYear, y = heightIn, shape = sex, fill = weightgroup)) +
geom_point() +
scale_shape_manual(values = c(21, 24)) +
scale_fill_manual(
values = c("red", "black"),
guide = guide_legend(override.aes = list(shape = 21)) # to change the legend
) +
scale_x_continuous(name = "Age (Year)", breaks = 11:18, limits = c(11, 18)) +
scale_y_continuous(name = "Height (In)", breaks = seq(50, 70, 5), limits = c(50, 73))
6.3.1 Overplotting
Overplotting refers to the situation when you have a large dataset so that the points in a scatter plot overlap and obscure each other.
# We can create a variable to store the "ggplot"
diamonds_ggplot <- ggplot(diamonds, aes(x = carat, y = price))
diamonds_ggplot +
geom_point()
Possible solutions for overplotting:
- Use smaller points (
size
)
# with diamonds_ggplot, we do not have to type
# ggplot(diamonds, aes(x = carat, y = price))
diamonds_ggplot +
geom_point(size = 0.1)
- Make the points semitransparent (
alpha
)
We can see some vertical bands at some values of carats, meaning that diamonds tend to be cut to those sizes.
- Bin the data into rectangles (
stat_bin2d
)
bins
controls the number of bins in the x and y directions. The color of the rectangle indicates how many data points there are in the region.
With bins = 50
:
- Overplotting can also occur when the data is discrete on one or both axes.
In the following example, we use the dataset ChickWeight
, where Time
is a discrete variable.
head(ChickWeight)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
# create a base plot
cw_ggplot <- ggplot(ChickWeight, aes(x = Time, y = weight))
cw_ggplot +
geom_point()
You may randomly jitter the points:
Jittering the points means a small amount of random variation is added to the location of each point. If you only want to jitter in the x-direction:
6.3.2 Labelling points in a scatter plot
We can use annotate()
or geom_text_repel()
to label points in a scatter plot. For the latter, we have to install the package ggrepel.
We will use the countries
dataset in the package gcookbook and visualize the relationship between health expenditures and infant mortality rate. We will consider a subset of data by focusing the data from 2009 and countries with more than \(2,000\) USD health expenditures per capita:
Using annotate:
# find out the x and y coordinates for the point corresponding to Canada
canada_x <- filter(countries_subset, Name == "Canada")$healthexp
canada_y <- filter(countries_subset, Name == "Canada")$infmortality
ggplot(countries_subset, aes(x = healthexp, y = infmortality)) +
geom_point() +
annotate("text", x = canada_x, y = canada_y + 0.2, label = "Canada")
Label all the points with geom_text_repel
:
# to use geom_text_repel, load the package ggrepel
library(ggrepel)
ggplot(countries_subset, aes(x = healthexp, y = infmortality)) +
geom_point() +
geom_text_repel(aes(label = Name), size = 3)
Label all the points with geom_label_repel
(with a box around the label):
6.4 Summarizing Data Distributions
6.4.1 Histogram
Histogram can be used to visualize the distribution of a variable. We will illustrate how to create histograms using the dataset birthwt
from the package MASS.
birthwt
contains data of 189 birth weights with some covariates of the mothers.
Take a look at the dataset:
head(birthwt)
## low age lwt race smoke ptl ht ui ftv bwt
## 85 0 19 182 2 0 0 0 1 0 2523
## 86 0 33 155 3 0 0 0 0 3 2551
## 87 0 20 105 1 1 0 0 0 1 2557
## 88 0 21 108 1 1 0 0 1 2 2594
## 89 0 18 107 1 1 0 0 1 0 2600
## 91 0 21 124 3 0 0 0 0 0 2622
Basic histogram:
Plot a histogram with density (not frequency):
To compare two histograms
- Use
facet_grid()
to display two histograms in the same plot.
Suppose we group the data according to the smoking status during pregnancy and we want to display the two histograms of the birth weight:
To change the label, we can change the content of the variable:
# create another dataset
birthwt_mod <- birthwt
birthwt_mod$smoke <- ifelse(birthwt_mod$smoke == 1, "Smoke", "No Smoke")
ggplot(birthwt_mod, aes(x = bwt)) +
geom_histogram() +
facet_grid(smoke ~ .)
Alternatively, we can use recode_factor
:
- Use
fill()
to put two groups in the same plot with different colors. We need to setposition = "identity"
; otherwise, the bars will be stacked on top of each other vertically which is not what we want.
It is also possible to use both facet_grid
and fill
when we have want to group the data with two discrete variables. We will illustrate this with grouping according to the smoking status and the race. We also add scales = "free"
so that the ranges of the y-axes will be adjusted according to the data in each histogram.
# change the name so that the labels can be understood easily
birthwt_mod$race[which(birthwt_mod$race==1)] = "White"
birthwt_mod$race[which(birthwt_mod$race==2)] = "Black"
birthwt_mod$race[which(birthwt_mod$race==3)] = "Other"
ggplot(birthwt_mod, aes(x = bwt, fill = smoke)) +
geom_histogram(position = "identity", alpha = 0.4) +
facet_grid(race ~ ., scales = "free")
Note: we do not have a large dataset in this example so that grouping by two variables may not give us a very good understanding of the data.
6.4.2 Kernel Density Estimate
Kernel density estimation is a nonparametric method to estimate the density of the samples. Nonparametric method means we do not impose a parametric model. A parametric model has a finite dimensional parameter \(\theta \in \mathbb{R}^d\) for some finite \(d\). Let \(X_1,\ldots,X_n\) be i.i.d. random variables from some distribution with density \(f\). The histogram for \(f\) at point \(x_0\) is \[\begin{equation*} \hat{f}(x_0) = \frac{\text{number of $x_i$ in the bin containing $x_0$}}{n h}, \end{equation*}\] where the bin width is \(h\). As we already know, the histogram will not give a smooth estimate of the density. One may use another method called kernel density estimator, which could produce smooth estimate of the density. The kernel density estimator is \[\begin{equation*} \hat{f}_n(x_0) = \frac{1}{nh}\sum^n_{i=1} K \bigg( \frac{x_0 - x_i}{h} \bigg), \end{equation*}\] where \(K\) is a kernel and \(h\) is the bandwidth. For our purposes, a kernel is a non-negative symmetric function such that \(\int^\infty_{-\infty}K(x)dx = 1\) and \(\int^\infty_{-\infty} x K(x)dx =0\). For example, \[\begin{eqnarray*} \text{the boxcar kernel:} && K(x) = \frac{1}{2}I(|x| \leq 1)\\ \text{the Gaussian kernel:} && K(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2} \\ \text{the Epanechnikov kernel:} && K(x) = \frac{3}{4}(1-x^2)I(|x| \leq 1) \\ \text{the tricube kernel:} && K(x) = \frac{70}{81}(1-|x|^3)^3I(|x| \leq 1), \end{eqnarray*}\] where \(I(|x| \leq 1) = 1\) if \(|x| \leq 1\) and equals \(0\) otherwise.
Since the kernel is symmetric around \(0\), the magnitude \((x-x_i)/h\) is the distance from \(0\). For the above kernels, the value of the kernels is smaller when we evaluate at a point further from \(0\). Therefore, data close to \(x_0\) will contribute larger weights in estimating \(\hat{f}(x_0)\).
The bandwidth will control the smoothness of the estimate: larger bandwidth will result in a smoother curve and smaller bandwidth will result in a noisy and rough curve. We can create a kernel density estimate of the distribution using geom_density()
.
ggplot(birthwt, aes(x = bwt)) +
geom_density() +
geom_density(adjust = 0.25, color = "red") + # smaller bandwidth -> noisy
geom_density(adjust = 2, color = "blue") # large bandwidth -> smoother
Overlaying a density curve with a histogram
ggplot(birthwt, aes(x = bwt)) +
geom_histogram(fill = "cornsilk", aes(y = ..density..)) +
geom_density()
Displaying kernel density Estimates from grouped data
To use geom_density()
to display kernel density estimates from grouped data, the grouping variable must be a factor or a character vector. Recall that in birthwt_mod
that we created earlier, the smoke variable is a character vector.
With color
:
With fill
:
ggplot(birthwt_mod, aes(x = bwt, fill = smoke)) +
geom_density(alpha = 0.3) # to control the transparency
With facet_grid()
:
6.5 Saving your plots
There are two types of image files: vector and raster (bitmap)
Raster images are pixel-based. When you zoom in the image, you can see the individual pixels. Two examples are JPG and PNG files. JPG files’ quality is lower than that of the PNG files.
Vector images are constructed using mathematical formulas. You can resize the image without a loss in image quality. When you zoom in the image, it is still smooth and clear. Two examples are AI and PDF files.
6.5.1 Outputting to pdf vector files
Suppose you want to save the plot from the following code:
# first argument is the file name
# width and height are in inches
pdf("filename.pdf", width = 4, height = 4)
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()
dev.off()
Outputting to a pdf file:
- usually the best option
- usually smaller than bitmap files such as PNG files.
- when you have overplotting (many points on the plot), a PDF file can be much larger than a PNG file.
6.5.2 Outputting to bitmap files
# width and heights are in pixels
png("png_plot.png", width = 600, height = 600)
ggplot(mtcars, aes(x=wt,y=mpg)) +
geom_point()
dev.off()
For high-quality print output, it is recommended to use at least 300 ppi (ppi = pixels per inch). Suppose you want to create a 4x4-inch PNG file with 300 ppi:
6.6 Axes, appearance
6.7 Summary
6.7.1 Bar charts
- examples of using pipe
%>%
together withggplot
- create bar charts of counts
- create bar charts of values
- change “fill” and “outline” of the bars
- create grouped bar charts
- create stacked bar charts
- convert a variable into factor in ggplot
- use different colour palette
- control the width of the bars
6.7.2 Line graphs
- create line graphs
- label the graph
- change the range of y-axis
- create line graphs with multiple lines
- use multiple geoms (geometric objects) (e.g. additing the points on top of the lines)
- change shape, size, fill, outline of points
- change line type
6.7.3 Scatter plot
- create scatter plots
- visualize an additional discrete variable
- visualize an additional continuous variable
- visualize two additional discrete variables
- overplotting (use smaller points, make points semitransparent, bin data into rectangels, jitter the points)
- label points in a scatter plot
6.7.4 Summarizing data distributions
- create histograms (frequency and density)
- compare two histograms (
facet_grid()
,fill()
) - create histograms with two additional discrete variables
- create kernel density estimates
- overlay a density curve with a histogram
- display kernel density estimates from grouped data (
color
,fill
,facet_grid
)