Introduction to ggplot2 - I

Lecture 1

Dr. Greg Chism

University of Arizona
INFO 526 - Fall 2023

Warm up

Announcements

  • Reading Quiz #1 is due Monday, by 3:30pm.

  • A note on readings for next week: Some of it is review so feel free to skim those parts.

Overview

In this lecture, we will:

  • Explore the grammar of graphics

  • Map data to aesthetics

  • Understand layer components

  • Interpret ggplot2 documentation

  • Create a layered plot

  • Introduce function and syntax of visual elements

The grammar of graphics

What is a grammar?

“The fundamental principles or rules of an art or science” - Oxford English Dictionary

  • Reveal composition of complicated graphics

  • Strong foundation for understanding a range of graphics

  • Guide for well-formed or correct graphics

Note

See “The Grammar of Graphics” by Leland Wilkinson (2005) and “A Layered Grammar of Graphics” by Hadley Wickham (2010)

Layered grammar of graphics

Cartoon image of three Chinstrap, Gentoo, and Adelie penguins.

ggplot2 builds complex plots iteratively, one layer at a time.

  • What are the necessary components of a plot?

  • What are necessary components of a layer?

Components of a plot

Cartoon image of three Chinstrap, Gentoo, and Adelie penguins.

A plot contains:

  • Data and aesthetic mapping

  • Layer(s) containing geometric object(s) and statistical transformation(s)

  • Scales

  • Coordinate system

  • (Optional) facets or themes

Components of a layer

Cartoon image of three Chinstrap, Gentoo, and Adelie penguins.

A layer contains:

  • Data with aesthetic mapping

  • A statistical transformation, or stat

  • A geometric object, or geom

  • A position adjustment

Mapping data to aesthetics

What data inputs are needed?

Data can be added to either the entire ggplot object or a particular layer.

Input data must be a dataframe in ‘tidy’ format:

  • every column is a variable

  • every row is an observation

  • every cell is a single value

Note

See “Tidy Data” by Wickham (2014) and the associated vignette

Example dataset - raw

# A tibble: 6 × 4
  species   bill_length_mm bill_depth_mm body_mass_g
  <fct>              <dbl>         <dbl>       <int>
1 Adelie              39.1          18.7        3750
2 Adelie              39.5          17.4        3800
3 Gentoo              46.7          15.3        5200
4 Gentoo              43.3          13.4        4400
5 Chinstrap           46.1          18.2        3250
6 Chinstrap           51.3          18.2        3750

Example dataset - mapped

# A tibble: 6 × 4
  Color         x     y  Size
  <fct>     <dbl> <dbl> <int>
1 Adelie     39.1  18.7  3750
2 Adelie     39.5  17.4  3800
3 Gentoo     46.7  15.3  5200
4 Gentoo     43.3  13.4  4400
5 Chinstrap  46.1  18.2  3250
6 Chinstrap  51.3  18.2  3750
Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm, 
                     color = species)) +
  geom_point()

Where to specify aesthetics?

  • Can be supplied to initial ggplot() call, in individual layers, or a combo

  • ggplot() data and aesthetics are inherited, but can be overridden

Where to specify aesthetics?

  • Can be supplied to initial ggplot() call, in individual layers, or a combo

  • ggplot() data and aesthetics are inherited, but can be overridden

ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm, 
                     color = species)) +
  geom_point()
ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
  geom_point(aes(color = species))
ggplot() +
  geom_point(data = penguins,
             aes(x = body_mass_g, y = flipper_length_mm, color = species))

Inheritance of aesthetics by layers

ggplot(penguins, aes(x = body_mass_g, 
                     y = flipper_length_mm, 
                     color = species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) 
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

Inheritance of aesthetics by layers

ggplot(penguins, aes(x = body_mass_g, 
                     y = flipper_length_mm, 
                     color = species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) 
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot(penguins, aes(x = body_mass_g, 
                     y = flipper_length_mm)) +
  geom_point(aes(color = species)) +
  geom_smooth(method = "lm", 
              se = FALSE) 
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

Mapping aesthetics to constants

Specifying a constant inside aes() with quotes creates a legend on the fly

ggplot(penguins, 
       aes(x = body_mass_g,
           color = species)) +
  geom_point(aes(y = bill_length_mm, 
                 shape = "Length")) +
  geom_point(aes(y = bill_depth_mm, 
                 shape = "Depth")) +
  ylab("Bill dimensions (mm)") +
  labs(shape = "dimension")
Warning: Removed 2 rows containing missing values (`geom_point()`).
Removed 2 rows containing missing values (`geom_point()`).

Customizing layers

Under the hood with layer()

A layer contains:

  • Data with aesthetic mapping

  • A statistical transformation, or stat

  • A geometric object, or geom

  • A position adjustment

ggplot() +
  geom_point()
ggplot() +
  layer(mapping = NULL,
        data = NULL,
        geom = "point",
        stat = "identity",
        position = "identity")

Note

All geom_*() or stat_*() calls are customized shortcuts for the layer() function.

The expediency of defaults

  • Defining each of the components of a layer or whole graphic can be tiresome

  • ggplot2 has a hierarchy of defaults

  • So you can make a graph in 2 lines of code!

Warning: Removed 2 rows containing missing values (`geom_point()`).

The short way and the long way

Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot() +
  geom_point(data = penguins,
             mapping = aes(x = body_mass_g,
                           y = flipper_length_mm))
ggplot() +
  layer(data = penguins,
        mapping = aes(
          x = body_mass_g,
          y = flipper_length_mm),
        geom = "point", 
        stat = "identity",
        position = "identity") +
  scale_x_continuous() +
  scale_y_continuous() +
  coord_cartesian()

stat_* vs. geom_*

“Every geom has a default statistic, and every statistic has a default geom.” - Wickham (2010)

  • stat_* transforms the data
    • By computing or summarizing from original input dataset
    • Returns a new dataset that can be mapped to aesthetics
  • geom_* control the type of plot rendered

Tip

When in doubt, check the documentation

Two ways to plot counts (categorical)

stat_count() and geom_bar() are equivalent

ggplot(data = penguins, 
       mapping = aes(x = species, 
                     fill = sex)) +
  stat_count()

ggplot(data = penguins, 
       mapping = aes(x = species, 
                     fill = sex)) +
  geom_bar()

Two ways to plot density (continuous)

stat_density() and geom_density() are not equivalent

ggplot(data = penguins, 
       mapping = aes(x = body_mass_g, 
                     fill = species)) +
  stat_density(alpha = 0.5)
Warning: Removed 2 rows containing non-finite values (`stat_density()`).

ggplot(data = penguins, 
       mapping = aes(x = body_mass_g, 
                     fill = species)) +
  geom_density(alpha = 0.5)
Warning: Removed 2 rows containing non-finite values (`stat_density()`).

When to use which?

In general, use geom_*() unless you are trying to:

penguins %>%
  count(species) %>%
  ggplot(aes(x = species, y = n)) +
  geom_bar(stat = "identity")

ggplot(penguins, aes(x = species, 
                     y = after_stat(prop),
                     group = 1)) +
  geom_bar()

ggplot(penguins) +
  stat_summary(aes(x = species,
                   y = body_mass_g),
               fun.min = min,
               fun.max = max,
               fun = mean)
Warning: Removed 2 rows containing non-finite values (`stat_summary()`).

A panopoly of layer options!

Track all geom and stat options

Exercise

For each of the following problems, suggest a useful geom:

  1. Display how a variable has changed over time
  2. Show the detailed distribution of a single variable
  3. Focus attention on one portion of a large dataset
  4. Draw a map
  5. Label outlying points

Position adjustment options

ggplot(data = penguins, mapping = aes(x = species, fill = sex)) +
  geom_bar(position = "stack")

ggplot(data = penguins, mapping = aes(x = species, fill = sex)) +
  geom_bar(position = "fill")

ggplot(data = penguins, mapping = aes(x = species, 
                     fill = sex)) +
  geom_bar(position = "dodge")

Position adjustment options

ggplot(data = penguins, mapping = aes(x = species, y = body_mass_g, color = sex)) +
  geom_point(position = "identity")
Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot(data = penguins, mapping = aes(x = species, y = body_mass_g, color = sex)) +
  geom_point(position = "jitter")
Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot(data = penguins, mapping = aes(x = species, y = body_mass_g, color = sex)) +
  geom_point(position = position_jitterdodge())
Warning: Removed 2 rows containing missing values (`geom_point()`).

Position adjustments limitations

For example, boxplots and errorbars can’t be stacked.

Exercise

  • What properties must a geom possess to be stackable?

  • What properties must a geom possess to be dodgeable?

Code-along exercise

Recreating a layered plot

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tibble    3.2.1
✔ purrr     1.0.1     ✔ tidyr     1.3.0
✔ readr     2.1.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Exercise

What are the two layers in this plot? What data when into each?

Adjusting visual elements

Scales and guides

  • Each scale is a function that translate data space (in data units) into aesthetic space (e.g., pixels)

  • A guide (axis or legend) is the inverse function, that converts visual properties back to data

Scales and guides

  • Each scale is a function that translate data space (in data units) into aesthetic space (e.g., pixels)

  • A guide (axis or legend) is the inverse function, that converts visual properties back to data

Labeled ggplot figure indicating similarity between axes and legends

Are axes and legends equivalent?

Scale specification

Every aesthetic in a plot is associated with exactly one scale.

ggplot(penguins, 
       aes(x = body_mass_g,
           y = flipper_length_mm)) +
  geom_point(aes(color = species))
ggplot(penguins, 
       aes(x = body_mass_g,
           y = flipper_length_mm)) +
  geom_point(aes(color = species)) +
  scale_x_continuous() + 
  scale_y_continuous() + 
  scale_colour_discrete()

Scale functions names are made of 3 pieces separated by “_”:

  1. scale

  2. the name of the primary aesthetic (color, shape, x)

  3. the name of the scale (discrete, continuous, brewer)

What does a coordinate system do?

Coordinate systems have 2 primary roles:

  1. Combine the x and y position aesthetics to produce a 2-dimensional position on the plot

  2. In coordination with faceting (optional), draw axes and panel backgrounds

Types of coordinate systems

Linear:

  • coord_cartesian(): common default

  • coord_flip(): x and y axes flipped

  • coord_fixed(): fixed aspect ratio

Non-linear:

  • coord_map()/coord_quickmap()/coord_sf(): map projections, x and y become longitude and latitude

  • coord_polar(): polar coordinates, x and y become angle and radius

  • coord_trans(): apply transformations

Faceting

Creates small multiples to show different subsets:

  • facet_null(): default

  • facet_wrap(): “wraps” a 1d ribbon of panels into 2d

  • facet_grid(): 2d grid of panels defined by row and column

Comparison of facet_wrap and facet_grid organization

Keeping points of reference

Exercise

Recreate the figure below. How would you get the gray points to show up on all facets?

Warning: Removed 6 rows containing missing values (`geom_point()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

Theming

Controls non-data elements of plots (e.g., to match a style guide).

  1. Theme elements specify the non-data elements you can control: plot.title, legend.position

  2. Each element has an element function to describe its visual properties: element_text(), element_blank()

  3. The theme() function allows overriding of the default theme: theme(legend.title = element_blank())

Complete themes

ggplot(penguins, 
       aes(x = body_mass_g,
           y = flipper_length_mm)) +
  geom_point(aes(color = species)) +
  theme_bw()
Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot(penguins, 
       aes(x = body_mass_g,
           y = flipper_length_mm)) +
  geom_point(aes(color = species)) +
  theme_minimal()
Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot(penguins, 
       aes(x = body_mass_g,
           y = flipper_length_mm)) +
  geom_point(aes(color = species)) +
  theme_classic()
Warning: Removed 2 rows containing missing values (`geom_point()`).

Further resources

  • Penguin artwork by @allison_horst

  • Hadley Wickham’s “A layered grammar of graphics” (2010)

  • Hadley Wickham’s “ggplot2: Elegant Graphics for Data Analysis, 3rd edition”, now available online

  • “R for Data Science”, by Hadley Wickham, Mine Cetinkaya-Rundel, & Garret Grolemund, especially chapters 2, 10, and 12