Stockholm University, VT2022

Computer Lab 1: R Intro

Resources (all available for free)

  • “An Introduction to R” ebook that starts from scratch and includes videos (for beginner R users) link
  • introduction by the developers of R (for R beginners with other programming experience) link
  • “R for Data Science” ebook for a hands-on introduction to data science in R with a particular focus on packages from the tidyverse (for beginner and intermediate R users) link
  • “Efficient R Programming” ebook on how to efficiently write code, and how to write efficient code (for intermediate and advanced R users) link
  • “R Packages” ebook on how to write and publish R packages (for intermediate and advanced R users) link
  • advanced tips and tricks with a nerdy presentation link
  • “Advanced R” ebook that goes deep, yet is easy to follow link

What is R?

  • open-source programming language developed for statistical analysis, computation, and visualization
  • R compared to Stata, SPSS, or SAS
    • has no point-and-click interface
    • has a steeper learning curve
    • is more difficult to use if you “just want to run regressions”
    • is much more flexible (useful when cleaning data, or when using more involved estimation procedures)
    • has a vastly superior online support community, and learning resources (any question you can have is probably answered on stackoverflow)
    • is a “fully fledged” programming language, i.e. can do more
  • R compared to Matlab, Octave, or Julia
    • is more focussed on statistical analysis (i.e. it is easier to do)
    • has better data-handling capabilities
    • is typically slower
  • R compared to Python
    • is more focussed on statistical analysis (i.e. it is easier to do)
    • has better out-of-the-box data-handling capabilities
    • is less of an all-round language

R and RStudio

  • R is the language in the background, RStudio is “optional”
  • RStudio is an IDE (integrated development environment), and a very good one at that (de facto standard)
  • use it to write R scripts, (interactively) execute code, look at plots, and look at data
  • note that correct R code can run without RStudio

Download R here

Download RStudio here

Using R

RStudio

  • default window arrangement
    • left is console, the current R instance
    • top right is environment (current variables), history (previously executed commands)
    • bottom right is very important, help, plots, and more
  • highly recommended working with projects
    • forces you to keep code and data structured
    • idea is to keep everything related to a “project” in a single folder and subfolders
      • both scripts and data
      • allows you to use relative file paths
      • reproducible by just sharing that folder

Overpowered calculator

1 + 1
## [1] 2
2 - 1
## [1] 1
2 * 5
## [1] 10
4.4 / 2
## [1] 2.2
2^5
## [1] 32
log(10)
## [1] 2.302585
exp(2)
## [1] 7.389056

Assignments

x <- 2
x
## [1] 2
3 * x
## [1] 6
log(x)
## [1] 0.6931472
  • assignment possible using <- or =
    • R convention is to use <-
    • = also used for keyword arguments in function calls (more later)

Data Modes

  • three data modes: numeric, character, and logical
y <- "tjena"
z <- TRUE
  • mode determines what can be done with the object
try(log(y))
## Error in log(y) : non-numeric argument to mathematical function

Data Structures

  • four important data structures: vector, matrix, data.frame, and list
    • special cases:
      • a scalar is a length-1 vector
      • a matrix is a 2-dimensional array
      • a data.frame is a special list
vector and matrix
  • can contain only elements of one mode
vec1 <- c(1, 2)
vec2 <- c("tja", "tjena")
vec3 <- c(TRUE, FALSE)

matr1 <- matrix(c(1, 2, 3, 4), nrow = 2)
list
  • can contain anything (even more lists)
list1 <- list(2, "tja", TRUE, list("bla bla"))
  • list is very versatile and flexible, but you have to be careful
  • more or less irrelevant for us now
data.frame
  • workhorse for data analysis
  • like a matrix, but data modes can differ between columns
df1 <- data.frame(x = c(1, 2), y = c("tja", "tjena"))
df2 <- data.frame(x = vec1, y = vec2)

df1
##   x     y
## 1 1   tja
## 2 2 tjena
df2
##   x     y
## 1 1   tja
## 2 2 tjena
Subsetting
  • access individual elements, or ranges of elements using []
  • R is one-indexed: first element of vector is x[1] (unlike e.g. Python, where it is x[0])
# shorthand to create a vector containing 1, 2, ..., 10
x <- 1:10

x[3]
## [1] 3
x[5:8]
## [1] 5 6 7 8
  • matrix subsetting similar, use [row, column]
y <- matrix(1:4, nrow = 2, byrow = TRUE)

y[1, 1]
## [1] 1
y[, 2]
## [1] 2 4
  • note that the last command returns a “directionless” vector; we extract a column of the matrix, but get a “standard” vector (i.e. not row or column)

Time Series

  • R comes with built-in time series capabilities
x <- ts(1:12)
class(x)
## [1] "ts"
# above is kind of useless, could have just as well used x <- 1:12

# useful feature is to add date and frequency info
x <- ts(1:12, start = c(2022, 1), freq = 12)

# note x-axis label
plot(x)

  • using ts() not strictly necessary, but useful for understanding data and easier plotting

Functions

  • every operation is a function
  • performs an action, or sequence of actions
  • has a name, and arguments
  • takes objects as inputs
  • e.g. c() takes two scalars as inputs, and combines/concatenates them to one vector
  • all functions have help-pages, describing what the function does, input arguments, and output
help(c)
?c
  • examples
x <- seq(1, 5, by = 1)
sum(x)
## [1] 15
mean(x)
## [1] 3
r <- rnorm(10)
  • functions have default values, look at help page of rnorm
  • functions are organized in packages

Packages

  • look at base package in RStudio
  • great advantage of open-source languages like R or Python is huge universe of user written packages
  • anyone can write and publish a package
  • CRAN (Comprehensive R Archive Network) is the central repository
Is a package “good”?
  • when googling for how to do things in R, sometimes you find very particular packages that promise to help you
    • but is a package “good” (i.e. bug-free and does what it says)?
    • google “cran [package name]”, and click the first link
    • e.g. “cran dynlm” points us to here
    • check the packages metadata:
      • version 0.3-6: be alert, early version of the package
      • published 2019: OK
      • author Achim Zeilis with email address at @r-project.org: good sign
    • another example: “cran did” here
      • advanced version number
      • published very recently
      • authored by one of the developers of the estimator himself -> good sign
    • be aware of low version numbers, old publication dates, and random authors, the code might contain bugs
    • CRAN runs technical checks of the code before publication (i.e. do the functions run without errors/warnings), does not check if the numbers are correct

Console and Scripts

  • essentially two modes of operation: interactive with the console, or “organized” with scripts
  • R scripts are text documents (like .txt, .csv, .py, .do, …) with file extension .R
  • write your code sequentially in the script, and execute either all at once from the command line, or line-by-line in RStudio

Hands-On

Installing Michael’s TS package

  • open RStudio, create project (new directory, call it “Lab-1” or something)
    • remember the location
  • download .tar.gz file from athena, put it into the folder just created for the project
  • go to “Console” in RStudio, execute list.files()
    • should show "TS_1.0-2022.1.tar.gz"
  • execute
install.packages("urca")
install.packages("vars")
install.packages("TS_1.0-2022.1.tar.gz", type = "source", repo = NULL)

Writing our first script

  • create new R script, save it in the project directory
  • add comment at top of file describing what this is
  • load TS package
  • load KPIF data
  • calculate average inflation
# Calculate average inflation rate in Sweden

library(TS)

data("KPIF")

print("The average inflation rate in Sweden is")

mean(KPIF)
  • save file
  • in RStudio go to bottom left, “Terminal”
  • type RScript "[filename].R", enter
  • the script is then run in batch mode
    • Michael requires this for all assignments and fails you if it does not work, so test this!
    • reasons for it not to work (on Michael’s computer):
      • you use setwd() with absolute paths in the script (specify all paths relative to project directory)
      • the order of commands in wrong (need to load the package first, before we can load the data)
      • a simple coding error

More?

  • RStudio with integrated tutorials through learnr package
    • I have not used it, can not judge
    • introduces the tidyverse, which is a particular “style” of using R
    • access in RStudio in top-right
  • swirl package
    • interactive in the R console
    • that’s how I learned the basics
install.packages("swirl")
library(swirl)
swirl()

  1. ↩︎