Stockholm University, VT2022
Everything in here is optional to the course. I just want to give you
a quick and basic introduction to ggplot2.
ggplot2ggplot2 is probably the most popular plotting engine for
R. You can easily create high-quality plots, and tweak them
in great detail (this can get harder though). We are going to go through
the very basics of creating plots, and how to efficiently structure your
data for plotting.
There is a free
e-book by the lead developer of ggplot2. The book
features hands-on examples and exercises to learn more. I am going to
briefly go through some basic and useful features.
ggplot() as a
data.framelibrary(ggplot2)
library(tidyr)
set.seed(314)
n <- 100
df <- data.frame(
time = 1:n,
y = as.numeric(arima.sim(list(ar = 0.6), n))
)
ggplot() needs two things:
aes() function (short for
aesthetic)group and
color (we will get to these later)ggplot(df, aes(x = time, y = y))
geom_line()ggplot(df, aes(x = time, y = y)) +
geom_line()
ggplot(df, aes(x = time, y = y)) +
geom_point()
ggplot(df, aes(x = time, y = y)) +
geom_area()
labs()theme_*()
ggplot(df, aes(x = time, y = y)) +
geom_line() +
labs(
x = "t", y = "value",
title = "Simulated AR(1)",
subtitle = "using arima.sim()"
) +
theme_minimal()
labs() function (even I
don’t know all)df$z <- as.numeric(arima.sim(list(ma = 0.8), n))
base R” way of plotting both would be
something like thisggplot(df, aes(x = time)) +
geom_line(aes(y = y)) +
geom_line(aes(y = z), color = "red")
aes()
geoms use the same data (specified in the
initiation)time for both plots, but the y-axis has
different valuesgeoms for the
two seriesggplot(df, aes(x = time)) +
geom_line(aes(y = y)) +
geom_point(aes(y = z), color = "red")
ggplot that we have two variables, that
are distinct, but we want the same treatment? store the variable name as
a variable itself!| time | unemp | infl |
|---|---|---|
| 2000 | \(u_{2000}\) | \(i_{2000}\) |
| 2001 | \(u_{2001}\) | \(i_{2001}\) |
data.frame and it
dripping into a long format| time | variable | value |
|---|---|---|
| 2000 | infl | \(i_{2000}\) |
| 2001 | infl | \(i_{2001}\) |
| 2000 | unemp | \(u_{2000}\) |
| 2001 | unemp | \(u_{2001}\) |
data.table::melt(), or
tidyr::pivot_longer()p_df <- pivot_longer(
data = df,
cols = 2:3
)
head(df)
## time y z
## 1 1 -1.9267948 0.4950472
## 2 2 -1.9947799 -0.1915507
## 3 3 -1.8873678 -2.0188212
## 4 4 -3.2309729 -1.8502690
## 5 5 -2.0913551 0.3169173
## 6 6 0.4895028 0.8033967
head(p_df)
## # A tibble: 6 × 3
## time name value
## <int> <chr> <dbl>
## 1 1 y -1.93
## 2 1 z 0.495
## 3 2 y -1.99
## 4 2 z -0.192
## 5 3 y -1.89
## 6 3 z -2.02
ggplot knows how to efficiently handle this data,
if we tell it where the variable identifiers are
group and color come in in
the aestheticggplot(p_df, aes(x = time, y = value)) +
geom_line()
ggplot(p_df, aes(x = time, y = value, group = name)) +
geom_line()
colorggplot(p_df, aes(x = time, y = value, group = name, color = name)) +
geom_line()
ggplot assigns colors to maximize contrast,
and depending on the number of unique values in namescale_color_discrete()
scale_color_brewer()
RColorBrewer packageggplot(p_df, aes(x = time, y = value, group = name, color = name)) +
geom_line() +
scale_color_brewer(palette = "Dark2")
ggplot can do calculationsggplot can do more than draw lines, it can do basic
statisticsaes() parameters in the
initiation (the axes are the same for violin and box plots)aes() for
the boxplot: the opacity alphaggplot(p_df, aes(x = name, y = value, group = name)) +
geom_violin(aes(fill = name)) +
geom_boxplot(alpha = 0.3)
aes(), depending on the geomcolor refers to the color of the
edge, fill to the insidefill, we have to use scale_fill_brewer()ggplot(p_df, aes(x = name, y = value, group = name)) +
geom_violin(aes(fill = name)) +
geom_boxplot(alpha = 0.3) +
scale_fill_brewer(
palette = "Accent",
name = "Process",
breaks = c("y", "z"),
labels = c("AR(1)", "MA(1)")
) +
labs(x = NULL) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
ggplot(p_df, aes(x = value, group = name, color = name)) +
geom_density()
facet_wrap()
lm())
specifying the combinations we wantname ~ ., i.e. combine name with all other
variables (except time and value)ggplot(p_df, aes(x = time, y = value, group = name, color = name)) +
geom_line() +
facet_wrap(name ~ .)
geom_smooth(),
e.g.ggplot(p_df, aes(x = time, y = value, group = name, color = name)) +
geom_line() +
geom_smooth(method = "lm") +
facet_wrap(name ~ .)
## `geom_smooth()` using formula 'y ~ x'
ChickWeight dataset
df <- ChickWeight
head(df)
## Grouped Data: weight ~ Time | Chick
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
weightTimeChick and
Dietggplot(df, aes(x = Time, y = weight, group = Chick, color = Chick)) +
geom_line() +
theme(legend.position = "None")
aes()?geom_line()geom_boxplot() by Timeggplot(df, aes(x = Time, y = weight)) +
geom_line(aes(group = Chick, color = Diet), alpha = 0.5) +
geom_boxplot(aes(group = Time))
Dietggplot(df, aes(x = Time, y = weight)) +
geom_line(aes(group = Chick), alpha = 0.5) +
geom_boxplot(aes(group = Time)) +
facet_wrap(Diet ~ .) +
scale_y_log10()
ggplot offers a superior way of exporting plots
compared to base Rp <- ggplot(df, aes(x = Time, y = weight)) +
geom_line(aes(group = Chick), alpha = 0.5) +
geom_boxplot(aes(group = Time)) +
facet_wrap(Diet ~ .)
p
ggsave()
ggsave(
filename = "our_first_export.pdf",
plot = p,
device = "pdf",
width = 20,
height = 10,
units = "cm"
)
ggplot is powerful, but sometimes it’s tricky to get
everything we wantdata.frames for
different plots