Chapter 6 Data Visualization with ggplot2

Main reference for this chapter: R graphics cookbook (https://r-graphics.org/)

Other useful references:

In the previous chapter, we learned how to create some basic plots with base R. In this chapter, we will use ggplot2 for data visualization.

We will use some datasets from the package gcookbook, so install it if needed.

install.packages("gcookbook")

Load the packages gcookbook, tidyverse and nycflights13.

library(gcookbook)    # datasets for illustration
library(tidyverse)    # includes ggplot2 and dplyr
library(nycflights13) # includes the dataset flights

6.1 Bar charts

We start with bar charts. Many ideas in this section (mapping variables to aesthetics, adding layers, changing colours, adding labels) are transferable to other plot types.

Recall that there are two types of bar charts:

Recall there are two types of bar charts:

  1. Bar chart of values: x-axis is a categorical variable, y-axis is a numeric value (for example, mean, median).

  2. Bar chart of counts: x-axis is a categorical variable, y-axis is the number of observations in each category.

In ggplot:

  1. For a bar chart of values, use geom_col(). (This is the same as geom_bar(stat = "identity")

  2. For a bar chart of counts, use geom_bar(). By default it uses stat = "count".

Bar chart of values:

Consider mtcars. We create a bar chart of the mean car weight grouped by the number of gears. Since gear is a category, we convert it to a factor.

mtcars_wt <- mtcars %>% 
  group_by(gear) %>% 
  summarize(mean_wt_by_gear = mean(wt)) %>% 
  ungroup() %>% 
  mutate(gear = factor(gear))

Create the bar chart:

ggplot(mtcars_wt, aes(x = gear, y = mean_wt_by_gear)) +
  geom_col() +
  labs(
    title = "Mean Weight by Number of Gears",
    x = "Number of gears",
    y = "Mean weight (1000 lbs)"
  )

To change the fill colour of bars, use fill.

ggplot(mtcars_wt, aes(x = gear, y = mean_wt_by_gear)) +
  geom_col(fill = "lightblue") +
  labs(
    title = "Mean Weight by Number of Gears",
    x = "Number of gears",
    y = "Mean weight (1000 lbs)"
  )

To add an outline around the bars, use colour (or color).

ggplot(mtcars_wt, aes(x = gear, y = mean_wt_by_gear)) +
  geom_col(color = "red") +
  labs(
    title = "Mean Weight by Number of Gears",
    x = "Number of gears",
    y = "Mean weight (1000 lbs)"
  )

You can combine both:

ggplot(mtcars_wt, aes(x = gear, y = mean_wt_by_gear)) +
  geom_col(fill = "lightblue", color = "red") +
  labs(
    title = "Mean Weight by Number of Gears",
    x = "Number of gears",
    y = "Mean weight (1000 lbs)"
  )

Grouped bars (adding another categorical variable)

A basic bar chart of values uses one categorical variable on the x-axis. If you want to include another categorical variable to divide up the bars, map that variable to fill.

In mtcars, vs represents the engine of the car with 0 = V-shaped and 1 = straight. We can use vc to divide up the data in addition to gear using fill. Since gear and vs are categories, we convert them to factors.

# prepare the data
mtcars_wt2 <- mtcars %>% 
  group_by(gear, vs) %>% 
  summarize(mean_wt = mean(wt)) %>% 
  ungroup() %>% 
  mutate(
    gear = factor(gear),
    vs = factor(vs)
  )

To create a grouped bar chart, set position = "dodge" in geom_col(); otherwise, you will get a stacked bar chart. Because we will demostrate several similar plots with the same labs. We will first define the labels once:

common_labs <- labs(
  title = "Mean Weight by Gears and Engine Shape",
  x = "Gears",
  y = "Mean weight",
  fill = "Engine type"
)
# plot
ggplot(mtcars_wt2, aes(x = gear, y = mean_wt, fill = vs)) +
  geom_col(position = "dodge") +
  common_labs

Without position = "dodge", bars are stacked (this is the default):

ggplot(mtcars_wt2, aes(x = gear, y = mean_wt, fill = vs)) +
  geom_col() +
  common_labs

To change the colours of the bars, you can use scale_fill_brewer():

ggplot(mtcars_wt2, aes(x = gear, y = mean_wt, fill = vs)) +
  geom_col(position = "dodge") +
  scale_fill_brewer(palette = "Pastel2") +
  common_labs

You can try different palettes:

library(RColorBrewer)
display.brewer.all()

Using palette = "Oranges":

ggplot(mtcars_wt2, aes(x = gear, y = mean_wt, fill = vs)) +
  geom_col(position = "dodge") +
  scale_fill_brewer(palette = "Oranges") +
  common_labs

Using a manually defined palette:

ggplot(mtcars_wt2, aes(x = gear, y = mean_wt, fill = vs)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("#cc6666", "#66cccc")) +
  common_labs

Bar Charts of Counts

A bar chart of counts is similar, except you only supply the x variable and let ggplot compute counts.

Bar chart of the number of cars by gear in mtcars:

ggplot(mtcars, aes(x = factor(gear))) +
  geom_bar() +
  labs(
    title = "Number of Cars by Gears",
    x = "Number of gears",
    y = "Count"
  )

Bar chart of the number of flights by each month in nycflights13:

ggplot(flights, aes(x = factor(month))) +
  geom_bar(fill = "lightblue") +
  labs(
    title = "Number of Flights by Month",
    x = "Month",
    y = "Count"
  )

Controlling bar width (default is width = 0.9):

ggplot(flights, aes(x = factor(month))) +
  geom_bar(fill = "lightblue", width = 0.5) +
  labs(
    title = "Number of Flights by Month",
    x = "Month",
    y = "Count"
  )

Bar chart of the number of flights by origin and month (grouped bars):

ggplot(flights, aes(x = origin, fill = factor(month))) +
  geom_bar(position = "dodge", color = "black") +
  labs(
    title = "Number of Flights by Origin and Month",
    x = "Origin",
    y = "Count",
    fill = "Month"
  )

6.2 Line Graph

Suppose we want to make a line graph of the daily average departure delay in the flights dataset.

avg_delay <- flights %>% 
  group_by(year, month, day) %>% 
  summarize(delay = mean(dep_delay, na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate(
    date = as.Date(sprintf("%d-%02d-%02d", year, month, day))
  )

ggplot(avg_delay, aes(x = date, y = delay)) +
  geom_line() +
  labs(
    x = "Date",
    y = "Average departure delay (minutes)",
    title = "Daily Average Departure Delay of Flights from NYC in 2013"
  )

Line Graph with multiple lines

We now compare the average monthly departure delay across the three NYC airports.

# prepare the data
flights_delay <- flights %>%
  group_by(year, month, origin) %>% 
  summarize(delay = mean(dep_delay, na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate(
    date = as.Date(sprintf("%d-%02d-01", year, month))
  )

Line Graph:

ggplot(flights_delay, aes(x = date, y = delay, color = origin)) +
  geom_line() +
  labs(
    x = "Month",
    y = "Average departure delay (minutes)",
    title = "Average Monthly Departure Delay by Airport"
  )

Different aesthetics can be used to distinguish groups.

Using line type instead of color:

ggplot(flights_delay, aes(x = date, y = delay, linetype = origin)) +
  geom_line()

As in base R, it is often helpful to show both lines and points:

ggplot(
  flights_delay,
  aes(x = date, y = delay, color = origin, linetype = origin)) +
  geom_line() +
  geom_point()

Change point shapes by group:

ggplot(
  flights_delay,
  aes(x = date, y = delay,
      color = origin, linetype = origin, shape = origin)) +
  geom_line() +
  geom_point()

To use a single point shape for all groups, specify it inside geom_point(). The default shape is 16. The fill argument only applies to shapes 21–25.

ggplot(flights_delay, aes(x = date, y = delay, color = origin)) +
  geom_line() +
  geom_point(shape = 22, size = 3, fill = "white", color = "darkred")

Using a different colour palette and thicker lines:

ggplot(flights_delay, aes(x = date, y = delay, color = origin)) +
  geom_line(size = 2) +
  geom_point(shape = 22, size = 3, fill = "white", color = "darkred") +
  scale_colour_brewer(palette = "Set2")

6.3 Scatter Plots

Scatter plots are commonly used to visualize the relationship between two continuous variables. They can also be useful when one or both variables are discrete, especially for exploratory analysis.

The dataset heightweight contains information on sex, age, height, and weight of a group of schoolchildren.

head(heightweight)
##   sex ageYear ageMonth heightIn weightLb
## 1   f   11.92      143     56.3     85.0
## 2   f   12.92      155     62.3    105.0
## 3   f   12.75      153     63.3    108.0
## 4   f   13.42      161     59.0     92.0
## 5   f   15.92      191     62.5    112.5
## 6   f   14.25      171     62.5    112.0

To create a basic scatter plot, use geom_point():

ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point()

You can set the appearance of the points by specifying arguments directly inside geom_point().

ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point(size = 1.5, shape = 4, color = "blue")

If shape is between 21 and 25, you can control the interior (fill) and the outline (color) separately.

ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point(size = 1.5, shape = 22, fill = "red", color = "blue")

Visualizing an additional discrete variable

A common use of scatter plots is to show how the relationship differs across groups.

For example, use different colours for different values of sex:

ggplot(heightweight, aes(x = ageYear, y = heightIn, color = sex)) +
  geom_point()

Or use different shapes:

ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex)) +
  geom_point()

You can combine both colour and shape:

ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex, color = sex)) +
  geom_point()

To manually control shapes and colours, use scale_shape_manual() and scale_colour_*().

ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex, color = sex)) +
  geom_point() +
  scale_shape_manual(values = c(21,22)) +
  scale_colour_brewer(palette = "Set2")

Visualizing an additional continuous variable

You can also map a continuous variable to color. In this case, ggplot automatically uses a gradient.

ggplot(heightweight, aes(x = ageYear, y = heightIn, color = weightLb)) +
  geom_point()

Visualizing two additional discrete variables

Now create a new discrete variable indicating whether a child weighs less than 100 pounds or at least 100 pounds.

heightweight2 <- heightweight %>%
  mutate(weightgroup = ifelse(weightLb < 100, "< 100", ">= 100"))

We can visualize both sex and weightgroup by mapping one to shape and the other to fill.

ggplot(heightweight2, aes(x = ageYear, y = heightIn, shape = sex, fill = weightgroup)) +
  geom_point() +
  scale_shape_manual(values = c(21, 24)) +
  scale_fill_manual(
    values = c("red", "black"),
    guide = guide_legend(override.aes = list(shape = 21)) # to change the legend
  ) 

Controlling axis ticks, limits, and labels

ggplot(heightweight2,
       aes(x = ageYear, y = heightIn,
           shape = sex, fill = weightgroup)) +
  geom_point() +
  scale_shape_manual(values = c(21, 24)) +
  scale_fill_manual(
    values = c("red", "black"),
    guide = guide_legend(override.aes = list(shape = 21))
  ) +
  scale_x_continuous(
    name = "Age (years)",
    breaks = 11:18
  ) +
  scale_y_continuous(
    name = "Height (inches)",
    breaks = seq(50, 70, 5)
  ) +
  # controlling x-axis, y-axis range
  coord_cartesian(
    xlim = c(11, 18),
    ylim = c(50, 73)
  )

6.3.1 Controlling the y-axis range

By default, ggplot chooses the y-axis range to include all observed values. Sometimes you may want to control the range manually.

Raw departure delays for a single busy day. Departure delays have extreme positive outliers, which stretch the y-axis and compress most of the data near 0.

one_day <- flights %>% 
  filter(month == 1, day == 1)

ggplot(one_day, aes(x = sched_dep_time, y = dep_delay)) +
  geom_point(alpha = 0.5) +
  labs(
    x = "Scheduled departure time",
    y = "Departure delay (minutes)",
    title = "Departure Delays on January 1, 2013"
  )

  • Most points clustered near 0

  • A few extreme delays (hundreds of minutes)

Zoom in:

ggplot(one_day, aes(x = sched_dep_time, y = dep_delay)) +
  geom_point(alpha = 0.5) +
  coord_cartesian(ylim = c(-20, 100)) +
  labs(
    title = "Departure Delays on January 1, 2013 (Zoomed In)"
  )

6.3.2 Overplotting

Overplotting occurs when a dataset is large enough that many points in a scatter plot overlap, making it difficult to see the underlying structure or density of the data.

# Create a ggplot object to reuse
diamonds_ggplot <- ggplot(diamonds, aes(x = carat, y = price))

diamonds_ggplot +
  geom_point()

In this plot, many points lie on top of each other, obscuring patterns in the data.

Common solutions to overplotting:

  1. Use smaller points (size): Reducing the point size decreases overlap.
diamonds_ggplot +
  geom_point(size = 0.1)

  1. Use transparency (alpha): Making points partially transparent allows regions with many overlapping points
diamonds_ggplot +
  geom_point(alpha = 0.05, size = 0.1)  # alpha ranges from 0 (transparent) to 1

From this plot, we can see vertical bands at certain carat values, indicating that diamonds tend to be cut at preferred sizes.

  1. Bin the data into rectangles (stat_bin2d): For very large datasets, it can be more effective to summarize density directly.

stat_bin2d() divides the plane into rectangular bins (similar to a 2D histogram). The color of each rectangle indicates how many observations fall in that region.

# By default, bins = 30
diamonds_ggplot +
  stat_bin2d(bins = 30)

Using a larger number of bins gives finer resolution:

diamonds_ggplot +
  stat_bin2d(bins = 50) +
  scale_fill_gradient(low = "lightblue", high = "red")

  1. Overplotting with discrete variables. Overplotting can also occur when one or both variables take only a few discrete values.

In the dataset ChickWeight, the variable Time is discrete.

head(ChickWeight)
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1
# Base plot
cw_ggplot <- ggplot(ChickWeight, aes(x = Time, y = weight))

cw_ggplot +
  geom_point()

Because many observations share the same x-value, points overlap heavily.

A common solution is to jitter the points, adding a small amount of random noise to their positions.

cw_ggplot +
  geom_point(position = "jitter")

If you only want to jitter in one direction (for example, horizontally):

cw_ggplot +
  geom_point(position = position_jitter(width = 0.5, height = 0))

Jittering helps reveal how many observations occur at each discrete value without changing the underlying data.

6.3.3 Labelling points in a scatter plot

In scatter plots, we sometimes want to label individual points to identify specific observations. In ggplot, this can be done using annotate() or functions from the package ggrepel.

We will use the countries dataset from the package gcookbook to visualize the relationship between health expenditure per capita and infant mortality rate.

We focus on data from 2009 and countries with health expenditure greater than \(2{,}000\) USD per capita.

countries_subset <- countries %>% 
  filter(Year == 2009, healthexp > 2000)

Labeling a specific point using annotate()

annotate() is useful when you want to label one or a few specific points.

Here we label the point corresponding to Canada.

# extract coordinates for Canada
canada_x <- filter(countries_subset, Name == "Canada")$healthexp
canada_y <- filter(countries_subset, Name == "Canada")$infmortality

ggplot(countries_subset, aes(x = healthexp, y = infmortality)) + 
  geom_point() +
  annotate(
    "text",
    x = canada_x,
    y = canada_y + 0.2,
    label = "Canada"
  )

The small vertical offset (+ 0.2) prevents the label from overlapping with the point.

Labeling many points using geom_text_repel()

When labeling many points, text labels often overlap and become unreadable. The package ggrepel automatically adjusts label positions to avoid overlap.

# load the ggrepel package
library(ggrepel)

ggplot(countries_subset, aes(x = healthexp, y = infmortality)) + 
  geom_point() +
  geom_text_repel(aes(label = Name), size = 3)

Labels with boxes: geom_label_repel()

geom_label_repel() works similarly to geom_text_repel(), but draws a box around each label.

# geom_label_repel also depends on the package ggrepel
ggplot(countries_subset, aes(x = healthexp, y = infmortality)) + 
  geom_point() +
  geom_label_repel(aes(label = Name), size = 3)

For more examples and customization options, see https://ggrepel.slowkow.com/articles/examples.html

6.4 Summarizing Data Distributions

6.4.1 Histogram

A histogram is used to visualize the distribution of a single continuous variable.

We illustrate histograms using the dataset birthwt from the package MASS.

library(MASS)

The dataset birthwt contains birth weights of 189 babies, along with several covariates describing the mothers.

Take a look at the data:

head(birthwt)
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

Basic histogram:

ggplot(birthwt, aes(x = bwt)) +
  geom_histogram(bins = 30)

By default, ggplot chooses the number of bins automatically (30). Different choices of bins can change the appearance of the histogram.

Histogram with density instead of frequency

Sometimes it is useful to plot density rather than counts, especially when comparing distributions.

ggplot(birthwt, aes(x = bwt)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30)

Comparing histograms

  1. Use facet_grid().

We first compare birth weight distributions by smoking status during pregnancy.

ggplot(birthwt, aes(x = bwt)) +
  geom_histogram(bins = 30) +
  facet_grid(smoke ~ .)

To improve the labels, we recode the smoking variable.

birthwt_mod <- birthwt %>%
  mutate(smoke = ifelse(smoke == 1, "Smoke", "No Smoke"))

ggplot(birthwt_mod, aes(x = bwt)) +
  geom_histogram(bins = 30) +
  facet_grid(smoke ~ .)

Alternatively, we can use recode_factor:

birthwt_mod$smoke <- recode_factor(
  birthwt_mod$smoke,
  "0" = "No Smoke",
  "1" = "Smoke"
)

Faceting is often the clearest way to compare distributions across groups.

  1. Overlaying histograms using fill().

Another approach is to overlay histograms using different colors.

ggplot(birthwt_mod, aes(x = bwt, fill = smoke)) +
  geom_histogram(
    position = "identity",
    alpha = 0.4,
    bins = 30
  )

Here:

  • position = "identity" prevents stacking,

  • alpha adds transparency to reveal overlap.

This approach emphasizes differences but can become cluttered when many groups are present.

Grouping by two discrete variables

We now group the data by smoking status and race.

birthwt_mod <- birthwt_mod %>%
  mutate(
    race = recode(
      race,
      `1` = "White",
      `2` = "Black",
      `3` = "Other"
    )
  )

ggplot(birthwt_mod, aes(x = bwt, fill = smoke)) +
  geom_histogram(position = "identity", alpha = 0.4, bins = 30) +
  facet_grid(race ~ ., scales = "free")

Using scales = "free" allows each panel to adjust its y-axis range independently.

Remark: Because the dataset is relatively small, grouping by two variables may result in noisy histograms.

Another example: overlapping densities

The following example illustrates overlapping distributions using simulated data.

y1 <- rnorm(1000, 1)
y2 <- rnorm(1000, 2)
df <- tibble(y1, y2)

df_long <- tibble(
  value = c(y1, y2),
  group = rep(c("y1", "y2"), each = 1000)
)

ggplot(df_long, aes(x = value, fill = group)) +
  geom_histogram(
    aes(y = after_stat(density)),
    position = "identity",
    alpha = 0.5,
    bins = 30
  )

When comparing distributions, it is usually clearer to reshape the data into a long format and map the group to fill, rather than layering multiple geoms.

6.4.2 Kernel Density Estimate

Kernel density estimation is a nonparametric method for estimating the density of a sample. Here, nonparametric means that we do not impose a parametric model. A parametric model typically involves a finite-dimensional parameter \(\theta \in \mathbb{R}^d\) for some finite \(d\).

Let \(X_1,\ldots,X_n\) be i.i.d. random variables from a distribution with density \(f\). A histogram-based density estimate at a point \(x_0\) can be written as \[\begin{equation*} \hat{f}(x_0) = \frac{\text{number of $x_i$ in the bin containing $x_0$}}{n h}, \end{equation*}\] where the bin width is \(h\). A histogram does not produce a smooth estimate of the density. A commonly used smooth alternative is the kernel density estimator \[\begin{equation*} \hat{f}_n(x_0) = \frac{1}{nh}\sum^n_{i=1} K \bigg( \frac{x_0 - x_i}{h} \bigg), \end{equation*}\] where \(K\) is a kernel and \(h\) is the bandwidth.

For our purposes, a kernel is a non-negative symmetric function such that \(\int^\infty_{-\infty}K(x)dx = 1\) and \(\int^\infty_{-\infty} x K(x)dx =0\). Examples include: \[\begin{eqnarray*} \text{the boxcar kernel:} && K(x) = \frac{1}{2}I(|x| \leq 1)\\ \text{the Gaussian kernel:} && K(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2} \\ \text{the Epanechnikov kernel:} && K(x) = \frac{3}{4}(1-x^2)I(|x| \leq 1) \\ \text{the tricube kernel:} && K(x) = \frac{70}{81}(1-|x|^3)^3I(|x| \leq 1), \end{eqnarray*}\] where \(I(|x| \leq 1) = 1\) if \(|x| \leq 1\) and equals \(0\) otherwise.

Because the kernel is symmetric around 0, the magnitude \((x_0-x_i)/h\) represents how far \(x_i\) is from \(x_0\) in units of the bandwidth. For the kernels above, the kernel value is smaller when the argument is farther from 0, so observations closer to \(x_0\) contribute larger weights in estimating \(\hat{f}_n(x_0)\).

The bandwidth controls smoothness: a larger bandwidth produces a smoother curve, while a smaller bandwidth produces a rougher curve.

We can create a kernel density estimate using geom_density(). The argument adjust multiplies the default bandwidth (so smaller adjust corresponds to a smaller bandwidth).

ggplot(birthwt, aes(x = bwt)) +
  geom_density() +
  geom_density(adjust = 0.25, color = "red") +  # smaller bandwidth -> rougher
  geom_density(adjust = 2, color = "blue")      # larger bandwidth -> smoother

Overlaying a density curve with a histogram

ggplot(birthwt, aes(x = bwt)) +
  geom_histogram(fill = "cornsilk", aes(y = after_stat(density))) +
  geom_density()

Displaying kernel density Estimates from grouped data

To use geom_density() to display kernel density estimates from grouped data, the grouping variable must be a factor or a character vector. Recall that in birthwt_mod that we created earlier, the smoke variable is a character vector.

With color:

ggplot(birthwt_mod, aes(x = bwt, color = smoke)) +
  geom_density()

With fill:

ggplot(birthwt_mod, aes(x = bwt, fill = smoke)) +
  geom_density(alpha = 0.3) # to control the transparency 

With facet_grid():

ggplot(birthwt_mod, aes(x = bwt)) +
  geom_density() +
  facet_grid(smoke ~ .)

Another example:

y1 <- rnorm(1000, 1)
y2 <- rnorm(1000, 2)
df_long <- tibble(
  value = c(y1, y2),
  group = rep(c("y1", "y2"), each = 1000)
)

ggplot(df_long, aes(x = value, fill = group)) +
  geom_density(alpha = 0.5)

6.4.3 Boxplot

A boxplot provides a compact summary of the distribution of a numerical variable using quantiles. Unlike histograms and density plots, boxplots do not show the full shape of the distribution, but they are very effective for comparing distributions across groups.

A single boxplot

We first create a boxplot of birth weights.

ggplot(birthwt, aes(x = "", y = bwt)) +
  geom_boxplot() +
  labs(
    x = "",
    y = "Birth weight",
    title = "Boxplot of Birth Weights"
  )

Here, the x-axis label is left empty since there is only one group.

Boxplots by group

Boxplots are especially useful for comparing distributions across groups.

Below, we compare birth weights by smoking status during pregnancy.

ggplot(birthwt_mod, aes(x = smoke, y = bwt)) +
  geom_boxplot() +
  labs(
    x = "Smoking status",
    y = "Birth weight",
    title = "Birth Weight by Smoking Status"
  )

From this plot, we can compare:

  • medians across groups,

  • variability (box height),

  • presence of outliers.

Horizontal boxplots

As with other plots, boxplots can be flipped using coord_flip().

ggplot(birthwt_mod, aes(x = smoke, y = bwt)) +
  geom_boxplot() +
  coord_flip() +
  labs(
    x = "Smoking status",
    y = "Birth weight",
    title = "Birth Weight by Smoking Status"
  )

Example using flights

ggplot(flights, aes(x = factor(month), y = arr_delay)) +
  geom_boxplot() +
  coord_flip() +
  labs(
    x = "Month",
    y = "Arrival delay (minutes)",
    title = "Arrival Delays by Month"
  )

6.5 Saving your plots

There are two main types of image files: vector and raster (bitmap).

Raster images

Raster images are pixel-based. When you zoom in, individual pixels become visible. Common raster formats include PNG and JPEG (JPG).

  • PNG files are lossless and generally higher quality.
  • JPEG files are lossy and typically smaller in size.

Vector images

Vector images are constructed using mathematical formulas rather than pixels. They can be resized without loss of quality, and remain smooth when zoomed in. Common vector formats include PDF and AI.

Practical notes

  • Use PDF for reports, papers, and slides.

  • Use PNG for web output or when vector files become too large.

6.5.1 Outputting to PDF (vector format)

Suppose you want to save the following plot:

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

Using a graphics device:

# filename is the output file
# width and height are in inches
pdf("filename.pdf", width = 4, height = 4)

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

dev.off()

By default, the output file is saved in R’s current working directory.

PDF output is often a good choice because:

  • it preserves vector quality,

  • file sizes are usually small,

  • it is ideal for reports.

Remark: When a plot contains many points (severe overplotting), a PDF file can become larger than a PNG file because every point is stored as a vector object.

6.5.2 Outputting to bitmap formats (PNG)

# width and height are in pixels
png("png_plot.png", width = 600, height = 600)

ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_point()

dev.off()

For high-quality print output, a resolution of at least 300 ppi (pixels per inch) is recommended.

To create a 4 × 4 inch PNG at 300 ppi:

ppi <- 300
png("png_plot.png",
    width = 4 * ppi,
    height = 4 * ppi,
    res = ppi)

ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_point()

dev.off()

6.6 Axes, appearance

6.6.1 Swapping X- and Y-axes

Swapping the axes is useful when category labels are long or when a horizontal layout is easier to read. This can be done using coord_flip().

library(ggpubr)  # to use ggarrange

plot1 <- ggplot(heightweight, aes(x = sex, y = heightIn)) +
  geom_boxplot()

plot2 <- plot1 +
  coord_flip()

ggarrange(plot1, plot2)

6.6.2 Setting the range of a continuous axis

In ggplot, xlim() and ylim() remove data outside the specified range. To zoom in without removing data, use coord_cartesian().

hw_plot <- ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point()


hw_plot_zoom <- hw_plot +
  coord_cartesian(
    xlim = c(14, 18),
    ylim = c(60, 70)
  )

ggarrange(hw_plot, hw_plot_zoom)

6.6.3 Changing the text of axis labels

Axis labels can be changed using labs(). Line breaks can be added using \n.

hw_plot <- ggplot(heightweight,
                  aes(x = ageYear, y = heightIn, colour = sex)) +
  geom_point()

hw_plot +
  labs(
    x = "Age\n(years)",
    y = "Height (inches)"
  )

6.6.4 Adding Titles

You can add a title using labs(title = ...).

hw_plot <- ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
  geom_point()

hw_plot1 <- hw_plot +
  labs(title = "Age and Height of Schoolchildren")

hw_plot2 <- hw_plot +
  labs(title = "Age and Height\nof Schoolchildren")

ggarrange(hw_plot1, hw_plot2)

6.6.5 Adding Subtitle

Subtitles can be added using the subtitle argument in labs().

hw_plot +
  labs(
    title = "Age and Height of Schoolchildren",
    subtitle = "11.5 to 17.5 years old"
  )

6.6.6 Using Themes

Themes control the non-data appearance of a plot, such as background, gridlines, and fonts.

hw_plot_grey <- hw_plot +
  theme_grey()      # default theme

hw_plot_classic <- hw_plot +
  theme_classic()

hw_plot_bw <- hw_plot +
  theme_bw()

hw_plot_minimal <- hw_plot +
  theme_minimal()

ggarrange(hw_plot_grey,
          hw_plot_classic,
          hw_plot_bw,
          hw_plot_minimal)

6.7 Summary

6.7.1 Bar charts

  1. examples of using pipe %>% together with ggplot
  2. create bar charts of counts
  3. create bar charts of values
  4. change “fill” and “outline” of the bars
  5. create grouped bar charts
  6. create stacked bar charts
  7. convert a variable into factor in ggplot
  8. use different colour palette
  9. control the width of the bars

6.7.2 Line graphs

  1. create line graphs
  2. label the graph
  3. change the range of y-axis
  4. create line graphs with multiple lines
  5. use multiple geoms (geometric objects) (e.g. additing the points on top of the lines)
  6. change shape, size, fill, outline of points
  7. change line type

6.7.3 Scatter plot

  1. create scatter plots
  2. visualize an additional discrete variable
  3. visualize an additional continuous variable
  4. visualize two additional discrete variables
  5. overplotting (use smaller points, make points semitransparent, bin data into rectangels, jitter the points)
  6. label points in a scatter plot

6.7.4 Summarizing data distributions

  1. create histograms (frequency and density)
  2. compare two histograms (facet_grid(), fill())
  3. create histograms with two additional discrete variables
  4. create kernel density estimates
  5. overlay a density curve with a histogram
  6. display kernel density estimates from grouped data (color, fill, facet_grid)
  7. create boxplots

6.7.5 Saving your plots

  1. output to pdf vector files
  2. output to bitmap files

6.7.6 Axes, appearance

  1. swapping x- and y-axes
  2. setting the range of a continuous axis
  3. change the text of axis labels
  4. adding title
  5. adding subtitle
  6. using themes