Data wrangling - III

Lecture 8

Dr. Greg Chism

University of Arizona
INFO 526 - Fall 2023

Warm up

Announcements

  • RQ 3 is due Wednesday
  • Project 1 reviews will be returned to you by Wednesday

Setup

# load packages
library(countdown)
library(tidyverse)
library(glue)
library(lubridate)
library(scales)
library(ggthemes)
library(gt)
library(palmerpenguins)
library(openintro)
library(ggrepel)

# set theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 14))

# set width of code output
options(width = 65)

# set figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 7,        # 7" width
  fig.asp = 0.618,      # the golden ratio
  fig.retina = 3,       # dpi multiplier for displaying HTML output on retina
  fig.align = "center", # center align figures
  dpi = 300             # higher dpi, sharper image
)

Missing values I

Is it ok to suppress the following warning? Or should you update your code to eliminate it?

df <- tibble(
  x = c(1, 2, 3, NA, 3),
  y = c(5, NA, 10, 0, 5)
)
ggplot(df, aes(x = x, y = y)) +
  geom_point(size = 3)
Warning: Removed 2 rows containing missing values
(`geom_point()`).

Missing values II

set.seed(1234)
df <- tibble(x = rnorm(100))
p <- ggplot(df, aes(x = x)) +
  geom_boxplot()
p

df |>
  summarize(med_x = median(x))
# A tibble: 1 × 1
   med_x
   <dbl>
1 -0.385

Missing values II

Is it ok to suppress the following warning? Or should you update your code to eliminate it?

p + xlim(0, 2)
Warning: Removed 69 rows containing non-finite values
(`stat_boxplot()`).

Missing values II

Is it ok to suppress the following warning? Or should you update your code to eliminate it?

p + scale_x_continuous(limits = c(0, 2))
Warning: Removed 69 rows containing non-finite values
(`stat_boxplot()`).

Missing values II

Why doesn’t the following generate a warning?

p + coord_cartesian(xlim = c(0, 2))

Coordinate systems

Coordinate systems: purpose

  • Combine the two position aesthetics (x and y) to produce a 2d position on the plot:
    • linear coordinate system: horizontal and vertical coordinates
    • polar coordinate system: angle and radius
    • maps: latitude and longitude
  • Draw axes and panel backgrounds in coordination with the faceter coordinate systems

Coordinate systems: types

  1. Linear coordinate systems: preserve the shape of geoms
  • coord_cartesian(): the default Cartesian coordinate system, where the 2d position of an element is given by the combination of the x and y positions.
  • coord_fixed(): Cartesian coordinate system with a fixed aspect ratio. (useful only in limited circumstances)
  1. Non-linear coordinate systems: can change the shapes – a straight line may no longer be straight. The closest distance between two points may no longer be a straight line.
  • coord_trans(): Apply arbitrary transformations to x and y positions, after the data has been processed by the stat
  • coord_polar(): Polar coordinates
  • coord_sf(): Map projections

Setting limits: what the plots say

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() + geom_smooth() +
  labs(title = "Plot 1")

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() + geom_smooth() +
  scale_x_continuous(limits = c(190, 220)) + scale_y_continuous(limits = c(4000, 5000)) +
  labs(title = "Plot 2")

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() + geom_smooth() +
  xlim(190, 220) + ylim(4000, 5000) +
  labs(title = "Plot 3")

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() + geom_smooth() +
  coord_cartesian(xlim = c(190,220), ylim = c(4000, 5000)) +
  labs(title = "Plot 4")

Setting limits: what the plots say

Setting limits: what the warnings say

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() + geom_smooth() +
  labs(title = "Plot 1")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values
(`geom_point()`).
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() + geom_smooth() +
  scale_x_continuous(limits = c(190, 220)) + scale_y_continuous(limits = c(4000, 5000)) +
  labs(title = "Plot 2")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 235 rows containing non-finite values
(`stat_smooth()`).
Warning: Removed 235 rows containing missing values
(`geom_point()`).
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() + geom_smooth() +
  xlim(190, 220) + ylim(4000, 5000) +
  labs(title = "Plot 3")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 235 rows containing non-finite values (`stat_smooth()`).
Removed 235 rows containing missing values (`geom_point()`).
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() + geom_smooth() +
  coord_cartesian(xlim = c(190,220), ylim = c(4000, 5000)) +
  labs(title = "Plot 4")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values
(`geom_point()`).

Setting limits

  • Setting scale limits: Any data outside the limits is thrown away
    • scale_*_continuous(), xlim and ylim arguments
    • xlim() and ylim()
  • Setting coordinate system limits: Use all the data, but only display a small region of the plot (zooming in)
    • coord_cartesian(), xlim and ylim arguments

Fixing aspect ratio with coord_fixed()

Useful when having an aspect ratio of 1 makes sense, e.g. scores on two tests (reading and writing) on the same scale (0 to 100 points)

ggplot(hsb2, aes(x = read, y = write)) +
  geom_point() + geom_smooth(method = "lm") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
  labs(title = "Not fixed")

ggplot(hsb2, aes(x = read, y = write)) +
  geom_point() + geom_smooth(method = "lm") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
  coord_fixed() +
  labs(title = "Fixed")

Fixing aspect ratio with coord_fixed()

Pie charts and bullseye charts with coord_polar()

ggplot(penguins, aes(x = 1, fill = species)) +
  geom_bar() +
  labs(title = "Stacked bar chart")

ggplot(penguins, aes(x = 1, fill = species)) +
  geom_bar() +
  coord_polar(theta = "y") +
  labs(title = "Pie chart")

ggplot(penguins, aes(x = 1, fill = species)) +
  geom_bar() +
  coord_polar(theta = "x") +
  labs(title = "Bullseye chart")

aside: about pie charts…

Pie charts

What do you know about pie charts and data visualization best practices? Love ’em or lose ’em?

Pie charts: when to love ’em, when to lose ’em

For categorical variables with few levels, bar charts can work well

pie_homeownership
loans %>%
  ggplot(aes(x = homeownership, fill = homeownership)) +
  geom_bar(show.legend = FALSE) +
  scale_fill_openintro("hot") +
  labs(x = "Homeownership", y = "Count")

Pie charts: when to love ’em, when to lose ’em

For categorical variables with many levels, bar charts are difficult to read

pie_loan_grades
loans |>
  ggplot(aes(x = grade, fill = grade)) +
  geom_bar(show.legend = FALSE) +
  scale_fill_openintro("cool") +
  labs(x = "Loan grade", y = "Count")

Bringing together multiple data frames

Scenario 2

We…

have multiple data frames

want to want to bring them together so we can plot them

professions <- read_csv("data/professions.csv")
dates <- read_csv("data/dates.csv")
works <- read_csv("data/works.csv")

10 women in science who changed the world

name
Ada Lovelace
Marie Curie
Janaki Ammal
Chien-Shiung Wu
Katherine Johnson
Rosalind Franklin
Vera Rubin
Gladys West
Flossie Wong-Staal
Jennifer Doudna

Inputs

professions
# A tibble: 10 × 2
   name               profession                        
   <chr>              <chr>                             
 1 Ada Lovelace       Mathematician                     
 2 Marie Curie        Physicist and Chemist             
 3 Janaki Ammal       Botanist                          
 4 Chien-Shiung Wu    Physicist                         
 5 Katherine Johnson  Mathematician                     
 6 Rosalind Franklin  Chemist                           
 7 Vera Rubin         Astronomer                        
 8 Gladys West        Mathematician                     
 9 Flossie Wong-Staal Virologist and Molecular Biologist
10 Jennifer Doudna    Biochemist                        
dates
# A tibble: 8 × 3
  name               birth_year death_year
  <chr>                   <dbl>      <dbl>
1 Janaki Ammal             1897       1984
2 Chien-Shiung Wu          1912       1997
3 Katherine Johnson        1918       2020
4 Rosalind Franklin        1920       1958
5 Vera Rubin               1928       2016
6 Gladys West              1930         NA
7 Flossie Wong-Staal       1947         NA
8 Jennifer Doudna          1964         NA
works
# A tibble: 9 × 2
  name               known_for                                   
  <chr>              <chr>                                       
1 Ada Lovelace       first computer algorithm                    
2 Marie Curie        theory of radioactivity,  first woman Nobel…
3 Janaki Ammal       hybrid species, biodiversity protection     
4 Chien-Shiung Wu    experiment overturning theory of parity     
5 Katherine Johnson  orbital mechanics critical to sending first…
6 Vera Rubin         existence of dark matter                    
7 Gladys West        mathematical modeling of the shape of the E…
8 Flossie Wong-Staal first to clone HIV and map its genes, which…
9 Jennifer Doudna    one of the primary developers of CRISPR     

Desired output

# A tibble: 10 × 5
   name               profession  birth_year death_year known_for
   <chr>              <chr>            <dbl>      <dbl> <chr>    
 1 Ada Lovelace       Mathematic…         NA         NA first co…
 2 Marie Curie        Physicist …         NA         NA theory o…
 3 Janaki Ammal       Botanist          1897       1984 hybrid s…
 4 Chien-Shiung Wu    Physicist         1912       1997 experime…
 5 Katherine Johnson  Mathematic…       1918       2020 orbital …
 6 Rosalind Franklin  Chemist           1920       1958 <NA>     
 7 Vera Rubin         Astronomer        1928       2016 existenc…
 8 Gladys West        Mathematic…       1930         NA mathemat…
 9 Flossie Wong-Staal Virologist…       1947         NA first to…
10 Jennifer Doudna    Biochemist        1964         NA one of t…

Inputs, reminder

names(professions)
[1] "name"       "profession"
names(dates)
[1] "name"       "birth_year" "death_year"
names(works)
[1] "name"      "known_for"
nrow(professions)
[1] 10
nrow(dates)
[1] 8
nrow(works)
[1] 9

Joining data frames

something_join(x, y)
  • left_join(): all rows from x
  • right_join(): all rows from y
  • full_join(): all rows from both x and y
  • semi_join(): all rows from x where there are matching values in y, keeping just columns from x
  • inner_join(): all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matches
  • anti_join(): return all rows from x where there are not matching values in y, never duplicate rows of x

Setup

For the next few slides…

x <- tibble(
  id = c(1, 2, 3),
  value_x = c("x1", "x2", "x3")
  )

x
# A tibble: 3 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2     
3     3 x3     
y <- tibble(
  id = c(1, 2, 4),
  value_y = c("y1", "y2", "y4")
  )

y
# A tibble: 3 × 2
     id value_y
  <dbl> <chr>  
1     1 y1     
2     2 y2     
3     4 y4     

left_join()

left_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   

left_join()

professions |>
  left_join(dates)
Joining with `by = join_by(name)`
# A tibble: 10 × 4
   name               profession            birth_year death_year
   <chr>              <chr>                      <dbl>      <dbl>
 1 Ada Lovelace       Mathematician                 NA         NA
 2 Marie Curie        Physicist and Chemist         NA         NA
 3 Janaki Ammal       Botanist                    1897       1984
 4 Chien-Shiung Wu    Physicist                   1912       1997
 5 Katherine Johnson  Mathematician               1918       2020
 6 Rosalind Franklin  Chemist                     1920       1958
 7 Vera Rubin         Astronomer                  1928       2016
 8 Gladys West        Mathematician               1930         NA
 9 Flossie Wong-Staal Virologist and Molec…       1947         NA
10 Jennifer Doudna    Biochemist                  1964         NA

right_join()

right_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     4 <NA>    y4     

right_join()

professions |>
  right_join(dates)
Joining with `by = join_by(name)`
# A tibble: 8 × 4
  name               profession             birth_year death_year
  <chr>              <chr>                       <dbl>      <dbl>
1 Janaki Ammal       Botanist                     1897       1984
2 Chien-Shiung Wu    Physicist                    1912       1997
3 Katherine Johnson  Mathematician                1918       2020
4 Rosalind Franklin  Chemist                      1920       1958
5 Vera Rubin         Astronomer                   1928       2016
6 Gladys West        Mathematician                1930         NA
7 Flossie Wong-Staal Virologist and Molecu…       1947         NA
8 Jennifer Doudna    Biochemist                   1964         NA

full_join()

full_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 4 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   
4     4 <NA>    y4     

full_join()

dates |>
  full_join(works)
Joining with `by = join_by(name)`
# A tibble: 10 × 4
   name               birth_year death_year known_for            
   <chr>                   <dbl>      <dbl> <chr>                
 1 Janaki Ammal             1897       1984 hybrid species, biod…
 2 Chien-Shiung Wu          1912       1997 experiment overturni…
 3 Katherine Johnson        1918       2020 orbital mechanics cr…
 4 Rosalind Franklin        1920       1958 <NA>                 
 5 Vera Rubin               1928       2016 existence of dark ma…
 6 Gladys West              1930         NA mathematical modelin…
 7 Flossie Wong-Staal       1947         NA first to clone HIV a…
 8 Jennifer Doudna          1964         NA one of the primary d…
 9 Ada Lovelace               NA         NA first computer algor…
10 Marie Curie                NA         NA theory of radioactiv…

inner_join()

inner_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 2 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     

inner_join()

dates |>
  inner_join(works)
Joining with `by = join_by(name)`
# A tibble: 7 × 4
  name               birth_year death_year known_for             
  <chr>                   <dbl>      <dbl> <chr>                 
1 Janaki Ammal             1897       1984 hybrid species, biodi…
2 Chien-Shiung Wu          1912       1997 experiment overturnin…
3 Katherine Johnson        1918       2020 orbital mechanics cri…
4 Vera Rubin               1928       2016 existence of dark mat…
5 Gladys West              1930         NA mathematical modeling…
6 Flossie Wong-Staal       1947         NA first to clone HIV an…
7 Jennifer Doudna          1964         NA one of the primary de…

semi_join()

semi_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 2 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2     

semi_join()

dates |>
  semi_join(works)
Joining with `by = join_by(name)`
# A tibble: 7 × 3
  name               birth_year death_year
  <chr>                   <dbl>      <dbl>
1 Janaki Ammal             1897       1984
2 Chien-Shiung Wu          1912       1997
3 Katherine Johnson        1918       2020
4 Vera Rubin               1928       2016
5 Gladys West              1930         NA
6 Flossie Wong-Staal       1947         NA
7 Jennifer Doudna          1964         NA

anti_join()

anti_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 1 × 2
     id value_x
  <dbl> <chr>  
1     3 x3     

anti_join()

dates |>
  anti_join(works)
Joining with `by = join_by(name)`
# A tibble: 1 × 3
  name              birth_year death_year
  <chr>                  <dbl>      <dbl>
1 Rosalind Franklin       1920       1958

Putting it altogether

scientists <- professions |>
  left_join(dates) |>
  left_join(works)
Joining with `by = join_by(name)`
Joining with `by = join_by(name)`
scientists
# A tibble: 10 × 5
   name               profession  birth_year death_year known_for
   <chr>              <chr>            <dbl>      <dbl> <chr>    
 1 Ada Lovelace       Mathematic…         NA         NA first co…
 2 Marie Curie        Physicist …         NA         NA theory o…
 3 Janaki Ammal       Botanist          1897       1984 hybrid s…
 4 Chien-Shiung Wu    Physicist         1912       1997 experime…
 5 Katherine Johnson  Mathematic…       1918       2020 orbital …
 6 Rosalind Franklin  Chemist           1920       1958 <NA>     
 7 Vera Rubin         Astronomer        1928       2016 existenc…
 8 Gladys West        Mathematic…       1930         NA mathemat…
 9 Flossie Wong-Staal Virologist…       1947         NA first to…
10 Jennifer Doudna    Biochemist        1964         NA one of t…

*_join() functions

  • From dplyr
  • Incredibly useful for bringing datasets with common information (e.g., unique identifier) together
  • Use by argument when the names of the column containing the common information are not the same across datasets
  • Always check that the numbers of rows and columns of the result dataset makes sense
  • Refer to two-table verbs vignette when needed

Visualizing joined data

But first…

What is the plot in the previous slide called?

Livecoding

Reveal below for code developed during live coding session.

  • Transform
Code
scientists_longer <- scientists |>
  mutate(
    birth_year = case_when(
      name == "Ada Lovelace" ~ 1815,
      name == "Marie Curie" ~ 1867,
      TRUE ~ birth_year
    ),
    death_year = case_when(
      name == "Ada Lovelace" ~ 1852,
      name == "Marie Curie" ~ 1934,
      name == "Flossie Wong-Staal" ~ 2020,
      TRUE ~ death_year
    ),
    status = if_else(is.na(death_year), "alive", "deceased"),
    death_year = if_else(is.na(death_year), 2021, death_year),
    known_for = if_else(name == "Rosalind Franklin", "understanding of the molecular structures of DNA ", known_for)
  ) |>
  pivot_longer(
    cols = contains("year"),
    names_to = "year_type",
    values_to = "year"
  ) |>
  mutate(death_year_fake = if_else(year == 2021, TRUE, FALSE))
  • Plot
Code
ggplot(scientists_longer, 
       aes(x = year, y = fct_reorder(name, as.numeric(factor(profession))), group = name, color = profession)) +
  geom_point(aes(shape = death_year_fake), show.legend = FALSE) +
  geom_line(aes(linetype = status), show.legend = FALSE) +
  scale_shape_manual(values = c("circle", NA)) +
  scale_linetype_manual(values = c("dashed", "solid")) +
  scale_color_colorblind() +
  scale_x_continuous(expand = c(0.01, 0), breaks = seq(1820, 2020, 50)) +
  geom_text(aes(y = name, label = known_for), x = 2030, show.legend = FALSE, hjust = 0) +
  geom_text(aes(label = profession), x = 1809, y = Inf, hjust = 1, vjust = 1, show.legend = FALSE) +
  coord_cartesian(clip = "off") +
  labs(
    x = "Year", y = NULL,
    title = "10 women in science who changed the world",
    caption = "Source: Discover magazine"
  ) +
  facet_grid(profession ~ ., scales = "free_y", space = "free_y", switch = "x") +
  theme(
    plot.margin = unit(c(1, 23, 1, 4), "lines"),
    plot.title.position = "plot",
    plot.caption.position = "plot",
    plot.caption = element_text(hjust = 2), # manual hack
    strip.background = element_blank(),
    strip.text = element_blank(),
    axis.title.x = element_text(hjust = 0),
    panel.background = element_rect(fill = "#f0f0f0", color = "white"),
    panel.grid.major = element_line(color = "white", size = 0.5)
  )