Econometrics 3b

Computer Lab 8

Getting data and APIs

APIs are a convenience feature of some data providers
the existence of an API should not be the deciding factor when choosing data sources!
some example packages for R (far from comprehensive; google “[name of data source] r api” or similar to see if one exists for your chosen data provider):
- wbstats and WDI: World Bank
- eurostat: Eurostat
- fredr: Federal Reserve Economic Data
- imfr: International Monetary Fund
- pxweb: list of available statistics
all APIs have idiosyncratic syntax and ways of working
most have good documentation:
- either online
- or in a vignette
  - type vignette("pxweb"), or go to packages → pxweb → “User guides, package vignettes and other documentation.”
- or examples in the help files
most APIs will return data in a R-friendly format (often data.frames) and require minimal cleaning

Best practice

when using an API, you are just connecting to the data provider’s server via the internet through a particular interface
it’s best practice to write a separate script to download the data more or less interactively (remember lab 3), and save it
- save it either as text file (.csv, .tsv, .txt, …)
- or as compressed R data file:
  - saveRDS() for a single object (e.g. a single data.frame or mts); format is .rds
  - save() for multiple objects (e.g. multiple data.frames, and some lists); format is .rdata
then write a separate script to clean the data
- cleaning can be a bit of trial and error, you want to keep the raw data untouched
- save the clean data (typically using saveRDS() or save())
and then another separate script for the actual analysis

Cleaning time

(Pun intended)

Cleaning the time variable is a critical step in every data preparation. Since you will do this with every project that has data with a time dimension, let’s spend some time on this. Cleaning other variables is usually more of a case-by-case business.

API data often has the time variable (if frequency is higher than yearly) stored as something like
- 2000Q1
- 2000qtr1
- 2000m01
- 2000 January
- 2000-03-21
when working with “bigger” datasets it is often better to store them as data.frames, and turn them into ts or mts objects only immediately before running timeseries functions
so we want to turn these into workable time variables to store in a date column in our data.frame

Splitting strings

simplest way: separate them into year and quarter/month/week
- use function substr() to extract parts of the combined string by position, and then use as.numeric()
- possible, but would not recommend

d <- "2000Q1"

y <- substr(d, start = 0, stop = 4) |> as.numeric()
q <- substr(d, start = 6, stop = 6) |> as.numeric()

does not work with month names!
pretty generic, R does not know we want to turn the character into numerics representing dates
errors may creep in unnoticed, if format not consistent in the data

Matching patterns

more sophisticated and robust way is to match patterns
library zoo offers classes yearqtr and yearmon to store quarterly and monthly dates
extract them from the date as character using pattern matching
- idea: the year is always a 4-digit number, the quarter always a 1-digit number, months have several standard naming conventions, etc.
- tell R where those are within the string, and it will extract them into a standardized time variable
- codes are e.g. %Y for 4-digit years, %m for 2-digit month, %B for written out months (English)
  - see ?strptime for a list

Quarterly

library(zoo)

dq1 <- "2000Q2"
dq2 <- "2000qtr2"
dq3 <- "2000 Quarter 2"

dateq1 <- as.yearqtr(dq1, format = "%YQ%q")
dateq2 <- as.yearqtr(dq2, "%Yqtr%q")
dateq3 <- as.yearqtr(dq3, "%Y Quarter %q")

dateq1

## [1] "2000 Q2"

dateq2

## [1] "2000 Q2"

dateq3

## [1] "2000 Q2"

zoo stores the time index internally like R: 2000 Q1 → 2000.00; 2000 Q2 → 2000.25; …

as.numeric(dateq3)

## [1] 2000.25

essentially like the time index of a ts object, but stored as a variable

Monthly

dm1 <- "2000m03"
dm2 <- "2000 March"

datem1 <- as.yearmon(dm1, "%Ym%m")
datem2 <- as.yearmon(dm2, "%Y %B")

datem1

## [1] "Mar 2000"

datem2

## [1] "Mar 2000"

again, internally this is stored like the base R ts index for monthly data:

as.numeric(datem2)

## [1] 2000.167

Separated time variables

y1 <- 2000
m1 <- 11

datec1 <- as.yearmon(
  x = paste(y1, m1, sep = " "), 
  format = "%Y %m"
  )

datec1

## [1] "Nov 2000"

Daily

this one is implemented in base R, dates are stored as Date class internally
- how to store dates is a rabbit hole, google POSIXct if you are interested
easily deal with non-standard ways of writing dates (like the American way)

dd1 <- "2000-24-12"

dated1 <- as.Date(dd1, format = "%Y-%d-%m")

dated1

## [1] "2000-12-24"

by default, R prints dates as Year-Month-Day (or "%Y-%m-%d"; the ISO 8601 format)

Sidenote

these weird formatting codes can be used both ways

format(datem1, "%Y.%m")

## [1] "2000.03"

format(datem2, "%b %y")

## [1] "Mar 00"

format(datem2, "%b %Y")

## [1] "Mar 2000"

format(datem2, "%B %Y")

## [1] "March 2000"

format(dated1, "%A, %d %B %Y")

## [1] "Sunday, 24 December 2000"

Econometrics 3b

Thore Petersen¹

04.05.2022

Computer Lab 8

Getting data and APIs

Best practice

Cleaning time

Splitting strings

Matching patterns

Quarterly

Monthly

Separated time variables

Daily

Sidenote

Econometrics 3b

Thore Petersen1

04.05.2022

Computer Lab 8

Getting data and APIs

Best practice

Cleaning time

Splitting strings

Matching patterns

Quarterly

Monthly

Separated time variables

Daily

Sidenote

Thore Petersen¹