Econometrics 3b

Computer Lab 9

Everything in here is optional to the course. I just want to give you a quick and basic introduction to ggplot2.

`ggplot2`

ggplot2 is probably the most popular plotting engine for R. You can easily create high-quality plots, and tweak them in great detail (this can get harder though). We are going to go through the very basics of creating plots, and how to efficiently structure your data for plotting.

There is a free e-book by the lead developer of ggplot2. The book features hands-on examples and exercises to learn more. I am going to briefly go through some basic and useful features.

The Basics

start with a simulated AR(1) process \(y_t\) of length 100
we have to provide data to ggplot() as a data.frame

library(ggplot2)
library(tidyr)

set.seed(314)

n <- 100

df <- data.frame(
  time = 1:n, 
  y = as.numeric(arima.sim(list(ar = 0.6), n))
  )

ggplot() needs two things:
- data
- a mapping, i.e. how to visualize the data

The Mapping

provided through the aes() function (short for aesthetic)
takes basic components like what to put on the axes (the variables for x- and y-axis)
and more involved components like group and color (we will get to these later)
let’s create our first plot:
- the plots are built additively: you initiate the plot, and then add layers to it
- only initiating the plot does nothing but set up the coordinate system (check the axes limits, they are reasonable)

ggplot(df, aes(x = time, y = y))

to add the data, we must specify how we want to display it
this is a time series, so a line makes sense: we add the layer geom_line()

ggplot(df, aes(x = time, y = y)) +
  geom_line()

but we could have just as well used points:

ggplot(df, aes(x = time, y = y)) +
  geom_point()

we can also do unreasonable things, like plotting it as an area

ggplot(df, aes(x = time, y = y)) +
  geom_area()

let’s get back to the line, and adjust the plot a bit
to add axis labels and a title, use labs()
to change the look, we can add a theme_*()
- default themes are listed here, there are many additional packages, just google

ggplot(df, aes(x = time, y = y)) +
  geom_line() +
  labs(
    x = "t", y = "value", 
    title = "Simulated AR(1)", 
    subtitle = "using arima.sim()"
    ) +
  theme_minimal()

lots of options inside the labs() function (even I don’t know all)
if you want a particular look, just google “ggplot how to do X”, there’s loads of guides and answered questions online

More series

let’s simulate another series and plot both
this time an MA(1) process, call it \(z_t\)

df$z <- as.numeric(arima.sim(list(ma = 0.8), n))

the “base R” way of plotting both would be something like this

ggplot(df, aes(x = time)) +
  geom_line(aes(y = y)) +
  geom_line(aes(y = z), color = "red")

note that we can “split up” the specification of the aes()
- both geoms use the same data (specified in the initiation)
- the x-axis is time for both plots, but the y-axis has different values
we can of course also combine different geoms for the two series

ggplot(df, aes(x = time)) +
  geom_line(aes(y = y)) +
  geom_point(aes(y = z), color = "red")

but in general, this is inefficient:
- we know we have two timeseries, and want to essentially plot them the same way
- how can we tell ggplot that we have two variables, that are distinct, but we want the same treatment? store the variable name as a variable itself!

The Data

usually we work with “wide” data

time	unemp	infl
2000	\(u_{2000}\)	\(i_{2000}\)
2001	\(u_{2001}\)	\(i_{2001}\)

variables are stored per column, the number of rows equals the number of observations
natural for specifying regressions, looking at interactions, etc.
but not efficient for plotting!
- when we want to plot both series in the same figure, we need to give both names
the solution is to “melt” the “wide” data into a “long” format
- image putting a candle below the data.frame and it dripping into a long format

time	variable	value
2000	infl	\(i_{2000}\)
2001	infl	\(i_{2001}\)
2000	unemp	\(u_{2000}\)
2001	unemp	\(u_{2001}\)

this is a less memory-/space-efficient way of storing the data, but very efficient for plotting
we do this either using data.table::melt(), or tidyr::pivot_longer()

p_df <- pivot_longer(
  data = df,
  cols = 2:3
)

compare:

head(df)

##   time          y          z
## 1    1 -1.9267948  0.4950472
## 2    2 -1.9947799 -0.1915507
## 3    3 -1.8873678 -2.0188212
## 4    4 -3.2309729 -1.8502690
## 5    5 -2.0913551  0.3169173
## 6    6  0.4895028  0.8033967

head(p_df)

## # A tibble: 6 × 3
##    time name   value
##   <int> <chr>  <dbl>
## 1     1 y     -1.93 
## 2     1 z      0.495
## 3     2 y     -1.99 
## 4     2 z     -0.192
## 5     3 y     -1.89 
## 6     3 z     -2.02

we transformed the data by storing the variable name/identifier as a variable itself, and the values as separate variable
now ggplot knows how to efficiently handle this data, if we tell it where the variable identifiers are
- this is where group and color come in in the aesthetic

ggplot(p_df, aes(x = time, y = value)) +
  geom_line()

like this it just plots two points per time, looks weird
we have to specify the group

ggplot(p_df, aes(x = time, y = value, group = name)) +
  geom_line()

the series are separated, but difficult to distinguish, let’s give them color

ggplot(p_df, aes(x = time, y = value, group = name, color = name)) +
  geom_line()

by default, ggplot assigns colors to maximize contrast, and depending on the number of unique values in name
we have precise control using scale_color_discrete()
- with this you can supply our own color palette, or specify particular values
we have broad control and access to ready-made color palettes using scale_color_brewer()
- a visualization of some palettes is here
you may need to install the RColorBrewer package

ggplot(p_df, aes(x = time, y = value, group = name, color = name)) +
  geom_line() +
  scale_color_brewer(palette = "Dark2")

`ggplot` can do calculations

ggplot can do more than draw lines, it can do basic statistics
let’s look at the distribution of values of the processes
- this is not necessarily a sensible exercise, but we do for illustration
- remember boxplots? a violin plot is a pretier version of it
here we overlay both
- we specify the baseline aes() parameters in the initiation (the axes are the same for violin and box plots)
- but we set the fill only for the violin plot
- also we adjust a graphical parameter outside aes() for the boxplot: the opacity alpha

ggplot(p_df, aes(x = name, y = value, group = name)) +
  geom_violin(aes(fill = name)) +
  geom_boxplot(alpha = 0.3)

the coloring is supplied in different arguments to aes(), depending on the geom
if we plot areas, color refers to the color of the edge, fill to the inside
when we want to adjust the color palette when using fill, we have to use scale_fill_brewer()

ggplot(p_df, aes(x = name, y = value, group = name)) +
  geom_violin(aes(fill = name)) +
  geom_boxplot(alpha = 0.3) +
  scale_fill_brewer(
    palette = "Accent",
    name = "Process",
    breaks = c("y", "z"),
    labels = c("AR(1)", "MA(1)")
    ) +
  labs(x = NULL) +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
    )

there are often multiple ways of adjusting all those little details
sometimes I also lose focus; google is your friend!
another cool feature: density plots (or kernel-density-estimators)
- always keep in mind what you want on which axis!

ggplot(p_df, aes(x = value, group = name, color = name)) +
  geom_density()

a powerful feature is facet_wrap()
- splits up the plot by groups
- the main argument is a formula (like in lm()) specifying the combinations we want
- here we want to split the plot by group, so we specify name ~ ., i.e. combine name with all other variables (except time and value)

ggplot(p_df, aes(x = time, y = value, group = name, color = name)) +
  geom_line() +
  facet_wrap(name ~ .)

we can also do trendlines using geom_smooth(), e.g.

ggplot(p_df, aes(x = time, y = value, group = name, color = name)) +
  geom_line() +
  geom_smooth(method = "lm") +
  facet_wrap(name ~ .)

## `geom_smooth()` using formula 'y ~ x'

a lot to explore; there is a huge range of functions out there

Different Data

let’s use the ChickWeight dataset
- weight of chickens over time on different diets

df <- ChickWeight

head(df)

## Grouped Data: weight ~ Time | Chick
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

this data is kind of wide already:
- we have the “value”: weight
- the running variable Time
- and two group identifiers: Chick and Diet
let’s plot the weight of chickens over time

ggplot(df, aes(x = Time, y = weight, group = Chick, color = Chick)) +
  geom_line() +
  theme(legend.position = "None")

remember splitting the aes()?
we specify the basic coordinate system in the initiation
then we specify the group to plot by, and the color of the lines for geom_line()
and we also want boxplots for every period, so we group the geom_boxplot() by Time

ggplot(df, aes(x = Time, y = weight)) +
  geom_line(aes(group = Chick, color = Diet), alpha = 0.5) +
  geom_boxplot(aes(group = Time))

we can also split this up into individual plots by Diet

ggplot(df, aes(x = Time, y = weight)) +
  geom_line(aes(group = Chick), alpha = 0.5) +
  geom_boxplot(aes(group = Time)) +
  facet_wrap(Diet ~ .) +
  scale_y_log10()

Exporting Plots

ggplot offers a superior way of exporting plots compared to base R
we store plots as an object

p <- ggplot(df, aes(x = Time, y = weight)) +
  geom_line(aes(group = Chick), alpha = 0.5) +
  geom_boxplot(aes(group = Time)) +
  facet_wrap(Diet ~ .)

to create the figure, just call the variable

to save, use ggsave()
- by default it saves the last plot, but more fool-proof way is to supply the plot as an object

ggsave(
  filename = "our_first_export.pdf",
  plot = p,
  device = "pdf",
  width = 20,
  height = 10,
  units = "cm"
)

sidenote:
- I recommend you always export figures as PDF (especially when using them in a TeX or Word document)
- PDF are vector graphics, not raster graphics (like a PNG or JPEG): you can zoom infinitely, the figure does not pixelate
- also the filesize is usually smaller

Takeaways

ggplot is powerful, but sometimes it’s tricky to get everything we want
there are often multiple ways of achieving the same goal, don’t get confused by different approaches online
having the data in the correct format is key!
- think about what you want to plot, and how to group
- you usually create several different data.frames for different plots

Econometrics 3b

Thore Petersen¹

10.05.2022

Computer Lab 9

`ggplot2`

The Basics

The Mapping

More series

The Data

`ggplot` can do calculations

Different Data

Exporting Plots

Takeaways

Econometrics 3b

Thore Petersen1

10.05.2022

Computer Lab 9

ggplot2

The Basics

The Mapping

More series

The Data

ggplot can do calculations

Different Data

Exporting Plots

Takeaways

Thore Petersen¹

`ggplot2`

`ggplot` can do calculations