Stockholm University, VT2022

Computer Lab 8

Getting data and APIs

  • APIs are a convenience feature of some data providers
  • the existence of an API should not be the deciding factor when choosing data sources!
  • some example packages for R (far from comprehensive; google “[name of data source] r api” or similar to see if one exists for your chosen data provider):
    • wbstats and WDI: World Bank
    • eurostat: Eurostat
    • fredr: Federal Reserve Economic Data
    • imfr: International Monetary Fund
    • pxweb: list of available statistics
  • all APIs have idiosyncratic syntax and ways of working
  • most have good documentation:
    • either online
    • or in a vignette
      • type vignette("pxweb"), or go to packages → pxweb → “User guides, package vignettes and other documentation.”
    • or examples in the help files
  • most APIs will return data in a R-friendly format (often data.frames) and require minimal cleaning

Best practice

  • when using an API, you are just connecting to the data provider’s server via the internet through a particular interface
  • it’s best practice to write a separate script to download the data more or less interactively (remember lab 3), and save it
    • save it either as text file (.csv, .tsv, .txt, …)
    • or as compressed R data file:
      • saveRDS() for a single object (e.g. a single data.frame or mts); format is .rds
      • save() for multiple objects (e.g. multiple data.frames, and some lists); format is .rdata
  • then write a separate script to clean the data
    • cleaning can be a bit of trial and error, you want to keep the raw data untouched
    • save the clean data (typically using saveRDS() or save())
  • and then another separate script for the actual analysis

Cleaning time

(Pun intended)

Cleaning the time variable is a critical step in every data preparation. Since you will do this with every project that has data with a time dimension, let’s spend some time on this. Cleaning other variables is usually more of a case-by-case business.

  • API data often has the time variable (if frequency is higher than yearly) stored as something like
    • 2000Q1
    • 2000qtr1
    • 2000m01
    • 2000 January
    • 2000-03-21
  • when working with “bigger” datasets it is often better to store them as data.frames, and turn them into ts or mts objects only immediately before running timeseries functions
  • so we want to turn these into workable time variables to store in a date column in our data.frame

Splitting strings

  • simplest way: separate them into year and quarter/month/week
    • use function substr() to extract parts of the combined string by position, and then use as.numeric()
    • possible, but would not recommend
d <- "2000Q1"

y <- substr(d, start = 0, stop = 4) |> as.numeric()
q <- substr(d, start = 6, stop = 6) |> as.numeric()
  • does not work with month names!
  • pretty generic, R does not know we want to turn the character into numerics representing dates
  • errors may creep in unnoticed, if format not consistent in the data

Matching patterns

  • more sophisticated and robust way is to match patterns
  • library zoo offers classes yearqtr and yearmon to store quarterly and monthly dates
  • extract them from the date as character using pattern matching
    • idea: the year is always a 4-digit number, the quarter always a 1-digit number, months have several standard naming conventions, etc.
    • tell R where those are within the string, and it will extract them into a standardized time variable
    • codes are e.g. %Y for 4-digit years, %m for 2-digit month, %B for written out months (English)
      • see ?strptime for a list
Quarterly
library(zoo)

dq1 <- "2000Q2"
dq2 <- "2000qtr2"
dq3 <- "2000 Quarter 2"

dateq1 <- as.yearqtr(dq1, format = "%YQ%q")
dateq2 <- as.yearqtr(dq2, "%Yqtr%q")
dateq3 <- as.yearqtr(dq3, "%Y Quarter %q")

dateq1
## [1] "2000 Q2"
dateq2
## [1] "2000 Q2"
dateq3
## [1] "2000 Q2"
  • zoo stores the time index internally like R: 2000 Q12000.00; 2000 Q22000.25; …
as.numeric(dateq3)
## [1] 2000.25
  • essentially like the time index of a ts object, but stored as a variable
Monthly
dm1 <- "2000m03"
dm2 <- "2000 March"

datem1 <- as.yearmon(dm1, "%Ym%m")
datem2 <- as.yearmon(dm2, "%Y %B")

datem1
## [1] "Mar 2000"
datem2
## [1] "Mar 2000"
  • again, internally this is stored like the base R ts index for monthly data:
as.numeric(datem2)
## [1] 2000.167
Separated time variables
y1 <- 2000
m1 <- 11

datec1 <- as.yearmon(
  x = paste(y1, m1, sep = " "), 
  format = "%Y %m"
  )

datec1
## [1] "Nov 2000"
Daily
  • this one is implemented in base R, dates are stored as Date class internally
    • how to store dates is a rabbit hole, google POSIXct if you are interested
  • easily deal with non-standard ways of writing dates (like the American way)
dd1 <- "2000-24-12"

dated1 <- as.Date(dd1, format = "%Y-%d-%m")

dated1
## [1] "2000-12-24"
  • by default, R prints dates as Year-Month-Day (or "%Y-%m-%d"; the ISO 8601 format)
Sidenote
  • these weird formatting codes can be used both ways
format(datem1, "%Y.%m")
## [1] "2000.03"
format(datem2, "%b %y")
## [1] "Mar 00"
format(datem2, "%b %Y")
## [1] "Mar 2000"
format(datem2, "%B %Y")
## [1] "March 2000"
format(dated1, "%A, %d %B %Y")
## [1] "Sunday, 24 December 2000"

  1. ↩︎