Stockholm University, VT2022

Computer Lab 9

Everything in here is optional to the course. I just want to give you a quick and basic introduction to ggplot2.

ggplot2

ggplot2 is probably the most popular plotting engine for R. You can easily create high-quality plots, and tweak them in great detail (this can get harder though). We are going to go through the very basics of creating plots, and how to efficiently structure your data for plotting.

There is a free e-book by the lead developer of ggplot2. The book features hands-on examples and exercises to learn more. I am going to briefly go through some basic and useful features.

The Basics

  • start with a simulated AR(1) process \(y_t\) of length 100
  • we have to provide data to ggplot() as a data.frame
library(ggplot2)
library(tidyr)

set.seed(314)

n <- 100

df <- data.frame(
  time = 1:n, 
  y = as.numeric(arima.sim(list(ar = 0.6), n))
  )
  • ggplot() needs two things:
    • data
    • a mapping, i.e. how to visualize the data
The Mapping
  • provided through the aes() function (short for aesthetic)
  • takes basic components like what to put on the axes (the variables for x- and y-axis)
  • and more involved components like group and color (we will get to these later)
  • let’s create our first plot:
    • the plots are built additively: you initiate the plot, and then add layers to it
    • only initiating the plot does nothing but set up the coordinate system (check the axes limits, they are reasonable)
ggplot(df, aes(x = time, y = y))

  • to add the data, we must specify how we want to display it
  • this is a time series, so a line makes sense: we add the layer geom_line()
ggplot(df, aes(x = time, y = y)) +
  geom_line()

  • but we could have just as well used points:
ggplot(df, aes(x = time, y = y)) +
  geom_point()

  • we can also do unreasonable things, like plotting it as an area
ggplot(df, aes(x = time, y = y)) +
  geom_area()

  • let’s get back to the line, and adjust the plot a bit
  • to add axis labels and a title, use labs()
  • to change the look, we can add a theme_*()
    • default themes are listed here, there are many additional packages, just google
ggplot(df, aes(x = time, y = y)) +
  geom_line() +
  labs(
    x = "t", y = "value", 
    title = "Simulated AR(1)", 
    subtitle = "using arima.sim()"
    ) +
  theme_minimal()

  • lots of options inside the labs() function (even I don’t know all)
  • if you want a particular look, just google “ggplot how to do X”, there’s loads of guides and answered questions online
More series
  • let’s simulate another series and plot both
  • this time an MA(1) process, call it \(z_t\)
df$z <- as.numeric(arima.sim(list(ma = 0.8), n))
  • the “base R” way of plotting both would be something like this
ggplot(df, aes(x = time)) +
  geom_line(aes(y = y)) +
  geom_line(aes(y = z), color = "red")

  • note that we can “split up” the specification of the aes()
    • both geoms use the same data (specified in the initiation)
    • the x-axis is time for both plots, but the y-axis has different values
  • we can of course also combine different geoms for the two series
ggplot(df, aes(x = time)) +
  geom_line(aes(y = y)) +
  geom_point(aes(y = z), color = "red")

  • but in general, this is inefficient:
    • we know we have two timeseries, and want to essentially plot them the same way
    • how can we tell ggplot that we have two variables, that are distinct, but we want the same treatment? store the variable name as a variable itself!
The Data
  • usually we work with “wide” data
time unemp infl
2000 \(u_{2000}\) \(i_{2000}\)
2001 \(u_{2001}\) \(i_{2001}\)
  • variables are stored per column, the number of rows equals the number of observations
  • natural for specifying regressions, looking at interactions, etc.
  • but not efficient for plotting!
    • when we want to plot both series in the same figure, we need to give both names
  • the solution is to “melt” the “wide” data into a “long” format
    • image putting a candle below the data.frame and it dripping into a long format
time variable value
2000 infl \(i_{2000}\)
2001 infl \(i_{2001}\)
2000 unemp \(u_{2000}\)
2001 unemp \(u_{2001}\)
  • this is a less memory-/space-efficient way of storing the data, but very efficient for plotting
  • we do this either using data.table::melt(), or tidyr::pivot_longer()
p_df <- pivot_longer(
  data = df,
  cols = 2:3
)
  • compare:
head(df)
##   time          y          z
## 1    1 -1.9267948  0.4950472
## 2    2 -1.9947799 -0.1915507
## 3    3 -1.8873678 -2.0188212
## 4    4 -3.2309729 -1.8502690
## 5    5 -2.0913551  0.3169173
## 6    6  0.4895028  0.8033967
head(p_df)
## # A tibble: 6 × 3
##    time name   value
##   <int> <chr>  <dbl>
## 1     1 y     -1.93 
## 2     1 z      0.495
## 3     2 y     -1.99 
## 4     2 z     -0.192
## 5     3 y     -1.89 
## 6     3 z     -2.02
  • we transformed the data by storing the variable name/identifier as a variable itself, and the values as separate variable
  • now ggplot knows how to efficiently handle this data, if we tell it where the variable identifiers are
    • this is where group and color come in in the aesthetic
ggplot(p_df, aes(x = time, y = value)) +
  geom_line()

  • like this it just plots two points per time, looks weird
  • we have to specify the group
ggplot(p_df, aes(x = time, y = value, group = name)) +
  geom_line()

  • the series are separated, but difficult to distinguish, let’s give them color
ggplot(p_df, aes(x = time, y = value, group = name, color = name)) +
  geom_line()

  • by default, ggplot assigns colors to maximize contrast, and depending on the number of unique values in name
  • we have precise control using scale_color_discrete()
    • with this you can supply our own color palette, or specify particular values
  • we have broad control and access to ready-made color palettes using scale_color_brewer()
    • a visualization of some palettes is here
  • you may need to install the RColorBrewer package
ggplot(p_df, aes(x = time, y = value, group = name, color = name)) +
  geom_line() +
  scale_color_brewer(palette = "Dark2")

ggplot can do calculations
  • ggplot can do more than draw lines, it can do basic statistics
  • let’s look at the distribution of values of the processes
    • this is not necessarily a sensible exercise, but we do for illustration
    • remember boxplots? a violin plot is a pretier version of it
  • here we overlay both
    • we specify the baseline aes() parameters in the initiation (the axes are the same for violin and box plots)
    • but we set the fill only for the violin plot
    • also we adjust a graphical parameter outside aes() for the boxplot: the opacity alpha
ggplot(p_df, aes(x = name, y = value, group = name)) +
  geom_violin(aes(fill = name)) +
  geom_boxplot(alpha = 0.3)

  • the coloring is supplied in different arguments to aes(), depending on the geom
  • if we plot areas, color refers to the color of the edge, fill to the inside
  • when we want to adjust the color palette when using fill, we have to use scale_fill_brewer()
ggplot(p_df, aes(x = name, y = value, group = name)) +
  geom_violin(aes(fill = name)) +
  geom_boxplot(alpha = 0.3) +
  scale_fill_brewer(
    palette = "Accent",
    name = "Process",
    breaks = c("y", "z"),
    labels = c("AR(1)", "MA(1)")
    ) +
  labs(x = NULL) +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
    )

  • there are often multiple ways of adjusting all those little details
  • sometimes I also lose focus; google is your friend!
  • another cool feature: density plots (or kernel-density-estimators)
    • always keep in mind what you want on which axis!
ggplot(p_df, aes(x = value, group = name, color = name)) +
  geom_density()

  • a powerful feature is facet_wrap()
    • splits up the plot by groups
    • the main argument is a formula (like in lm()) specifying the combinations we want
    • here we want to split the plot by group, so we specify name ~ ., i.e. combine name with all other variables (except time and value)
ggplot(p_df, aes(x = time, y = value, group = name, color = name)) +
  geom_line() +
  facet_wrap(name ~ .)

  • we can also do trendlines using geom_smooth(), e.g.
ggplot(p_df, aes(x = time, y = value, group = name, color = name)) +
  geom_line() +
  geom_smooth(method = "lm") +
  facet_wrap(name ~ .)
## `geom_smooth()` using formula 'y ~ x'

  • a lot to explore; there is a huge range of functions out there

Different Data

  • let’s use the ChickWeight dataset
    • weight of chickens over time on different diets
df <- ChickWeight

head(df)
## Grouped Data: weight ~ Time | Chick
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1
  • this data is kind of wide already:
    • we have the “value”: weight
    • the running variable Time
    • and two group identifiers: Chick and Diet
  • let’s plot the weight of chickens over time
ggplot(df, aes(x = Time, y = weight, group = Chick, color = Chick)) +
  geom_line() +
  theme(legend.position = "None")

  • remember splitting the aes()?
  • we specify the basic coordinate system in the initiation
  • then we specify the group to plot by, and the color of the lines for geom_line()
  • and we also want boxplots for every period, so we group the geom_boxplot() by Time
ggplot(df, aes(x = Time, y = weight)) +
  geom_line(aes(group = Chick, color = Diet), alpha = 0.5) +
  geom_boxplot(aes(group = Time))

  • we can also split this up into individual plots by Diet
ggplot(df, aes(x = Time, y = weight)) +
  geom_line(aes(group = Chick), alpha = 0.5) +
  geom_boxplot(aes(group = Time)) +
  facet_wrap(Diet ~ .) +
  scale_y_log10()

Exporting Plots

  • ggplot offers a superior way of exporting plots compared to base R
  • we store plots as an object
p <- ggplot(df, aes(x = Time, y = weight)) +
  geom_line(aes(group = Chick), alpha = 0.5) +
  geom_boxplot(aes(group = Time)) +
  facet_wrap(Diet ~ .)
  • to create the figure, just call the variable
p

  • to save, use ggsave()
    • by default it saves the last plot, but more fool-proof way is to supply the plot as an object
ggsave(
  filename = "our_first_export.pdf",
  plot = p,
  device = "pdf",
  width = 20,
  height = 10,
  units = "cm"
)
  • sidenote:
    • I recommend you always export figures as PDF (especially when using them in a TeX or Word document)
    • PDF are vector graphics, not raster graphics (like a PNG or JPEG): you can zoom infinitely, the figure does not pixelate
    • also the filesize is usually smaller

Takeaways

  • ggplot is powerful, but sometimes it’s tricky to get everything we want
  • there are often multiple ways of achieving the same goal, don’t get confused by different approaches online
  • having the data in the correct format is key!
    • think about what you want to plot, and how to group
    • you usually create several different data.frames for different plots

  1. ↩︎