Econometrics 3b

Computer Labs 2 and 3: Exercises

Exercise 1

Write a commented script, that does the following:

Create two normally distributed realizations with length 100. Both with mean one, but one should have variance one, and the other variance two.
Test whether sample means are different from zero.
Test whether the sample means are the same or not.
Run the script in batch mode, so that a result file is created.

Solution

Let us start by creating a new project for this lab.

a)

Create two normally distributed realizations with length 100. Both with mean one, but one should have variance one, and the other variance two.

two normally distributed realizations of length 100 = two samples of random draws from a Normal/Gaussian distribution
computers can not be “truly” random, they only execute whatever we ask them to
we speak of “pseudorandomness” when generating “random” numbers on a computer
any sequence or sample of such numbers seem random (e.g. satisfy some statistical properties), but are in fact deterministic
random numbers are useful for lots of modelling, and you need a lot of numbers (in the millions if not more to solve some models)
- R’s default algorithm (Mersenne Twister) has a period of \(2^{19937} - 1\) (that’s a lot of numbers)
nice side effect of the numbers being deterministic: models are (theoretically) perfectly reproducible
how it works:
- choose algorithm to generate numbers (good default choices in R)
- set the “seed”, the starting point of the sequence
- given that seed, the number generator will always put out the same random numbers
- read more here by typing ?Random in R

So let’s generate the samples.

set.seed(2)

n <- 100
mu <- 1

sigx <- 1
sigy <- 2

x <- rnorm(n = n, mean = mu, sd = sqrt(sigx))
y <- rnorm(n, mu, sqrt(sigy))

good practice to declare all parameters (length, mean, and variances) in one place (the beginning is a good place)
call the function rnorm() to generate/draw the samples
- note that we can use the sqrt() function inside: it is evaluated first, and the result used for the sd argument

Let’s have a look at the random samples

plot(x)

plot(y)

Looks pretty random, but both centered around 1, as expected.

Look at histograms (we overlay the theoretical density):

probability = TRUE puts relative frequencies on the \(y\)-axis
curve() takes a function, and overlays over the previous plot, if add = TRUE

hist(x, probability = TRUE)
curve(dnorm(x, mean = mu, sd = sqrt(sigx)), from = -2, to = 4, add = TRUE)

hist(y, probability = TRUE)
curve(dnorm(x, mean = mu, sd = sqrt(sigy)), from = -3, to = 5, add = TRUE)

b)

Test whether sample means are different from zero.

this is done with one of the simplest tests: the \(t\)-test

t.test(x, mu = 0)

## 
##  One Sample t-test
## 
## data:  x
## t = 8.3547, df = 99, p-value = 4.148e-13
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.7390951 1.1995086
## sample estimates:
## mean of x 
## 0.9693018

interpretation of output
- t = 8.3547: value of the \(t\)-statistic (\(\frac{\bar{x}}{\hat{\sigma} / \sqrt{n}}\))
- df = 99: degrees of freedom of the \(t\)-distribution
- p-value = 4.148e-13: probability mass of the \(t_{\nu = 99}\) distribution beyond 8.3547
- conclusion: we reject the Null at any reasonable significance level
  - \(H_0: \mu = 0\) and \(H_1: \mu \neq 0\)
  - we work with the alternative, i.e. that the true mean is not zero

t.test(y, mu = 0)

## 
##  One Sample t-test
## 
## data:  y
## t = 7.4727, df = 99, p-value = 3.175e-11
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.764817 1.317814
## sample estimates:
## mean of x 
##  1.041316

same interpretation
but note that the \(p\)-value is much larger (albeit still tiny)
- the second sample has a greater variance, so the test is less “sure” of the result

c)

Test whether the sample means are the same or not.

testing for equality of sample means also done using a \(t\)-test
implemented in R using the same function, but now with different arguments
- in b) we supply one vector (the sample) and specify mu = 0
- R “understands” we want to test the mean of that single sample
- now supply two vectors, and nothing else
- default is then to compare the means of those two

t.test(x, y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -0.39716, df = 191.71, p-value = 0.6917
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4296611  0.2856333
## sample estimates:
## mean of x mean of y 
## 0.9693018 1.0413158

interpretation similar again
but now p-value = 0.6917
- much larger than before
- \(H_0: \mu_x = \mu_y\) and \(H_1: \mu_x \neq \mu_y\)
- can NOT reject the Null, i.e. we must assume both processes have the same true mean (reassuring, since we know this is in fact true)

d)

Run the script in batch mode, so that a result file is created.

comment out the plotting stuff, so that we are left with

# Econometrics 3b
# Exercise 1

# parameters
set.seed(2)

n <- 100
mu <- 1

sigx <- 1
sigy <- 2

# generate samples
x <- rnorm(n = n, mean = mu, sd = sqrt(sigx))
y <- rnorm(n, mu, sqrt(sigy))

# testing
t.test(x, mu = 0)
t.test(y, mu = 0)
t.test(x, y)

save as for example exercise-1.R
execute in Terminal/Console:
- R CMD BATCH more flexible than RScript from last time
- --vanilla means “plain vanilla”, or the basic R configuration (no special user-defined options etc.)
- "exercise-1.R" to call our script, then space and "output.txt" that is the file to which everything will be printed (you can omit the “” if there are no spaces in the names)

R CMD BATCH --vanilla "exercise-1.R" "output.txt"

look in Finder/Explorer for the file output.txt
- it now contains the results

Exercise 2

Write a commented script that estimates \(\Delta \text{gdpjp}_t = \phi_0 + \phi_1 \Delta \text{gdpjp}_{t - 1} + e_t\) with OLS, where \(\text{gdpjp}\) is found in the data set Macrodat in package Ecdat. Test estimated residuals for autocorrelation. Run the script in batch mode so that a result file is created.

Repeat, but with the variable \(\Delta \text{Irates1}\), but use now both lm() and dynlm().
Repeat but estimate the covariance matrix with the Newey-West estimator.
Download 12-month inflation rate according to KPIF from SCB using the package pxweb. Write a script that documents the source of the data and saves the data. Estimate a deterministic trend model and compare the results with the HP-filter and with Hamilton’s solution.

Solution

time series regressions with the dynlm package
load data from the package
- check Environment, now Macrodat is listed (check its class)
specify formula for regression
- dynlm defines functions like d() and L() for inside formulae

library(dynlm)
library(portes)
library(lmtest)
library(sandwich)

data(Macrodat, package = "Ecdat")

workspace now contains Macrodat, check it out
- head() prints the first elements of any object

head(Macrodat)

##             lhur    punew fyff fygm3 fygt1   exruk   gdpjp
## 1959 Q1 5.833333 28.99333 2.80  2.80  3.61 281.271 14.3195
## 1959 Q2 5.100000 29.04333 3.39  3.21  4.07 281.254 14.8032
## 1959 Q3 5.266667 29.19333 3.76  4.04  5.00 280.370 15.5764
## 1959 Q4 5.600000 29.37000 3.99  4.49  5.14 279.845 15.4222
## 1960 Q1 5.133333 29.39667 3.84  3.31  4.02 280.586 16.5126
## 1960 Q2 5.233333 29.57333 3.32  2.46  3.36 280.268 16.5398

class(Macrodat)

## [1] "mts" "ts"

mts and ts: multivariate timeseries
- a matrix containing individual ts objects as its columns
matrix columns can have names

colnames(Macrodat)

## [1] "lhur"  "punew" "fyff"  "fygm3" "fygt1" "exruk" "gdpjp"

to know which column is which variable
- we can use these names when calling our regressions

Now to specify the model.

formula1 <- d(gdpjp) ~ L(d(gdpjp))
model1 <- dynlm(formula1, data = Macrodat)

remember lists?
- model1 is now a named list containing all the results of the regression
most model outputs have a nice summary:

summary(model1)

## 
## Time series regression with "ts" data:
## Start = 1959(3), End = 1999(2)
## 
## Call:
## dynlm(formula = formula1, data = Macrodat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8825 -0.3535 -0.0066  0.4351  2.3532 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.64608    0.07645   8.451 1.79e-14 ***
## L(d(gdpjp)) -0.01733    0.07966  -0.218    0.828    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.723 on 158 degrees of freedom
##   (0 observations deleted due to missingness)
## Multiple R-squared:  0.0002995,  Adjusted R-squared:  -0.006028 
## F-statistic: 0.04733 on 1 and 158 DF,  p-value: 0.8281

Having a look at residuals is always a good idea.

note that the class of residuals(model1) is ts
- use built-in functionality to plot

plot(residuals(model1))

does not look perfectly random, some autocorrelation, some “swings”
for sure some periods of larger variance
- be careful, if there is still some “structure” in the residuals, the model does not capture everything!

Test for autocorrelation

first, just look at the autocorrelation function acf()

acf(residuals(model1))

visually there is some autocorrelation
formal test is the Ljung-Box test
- implemented in portes library
- go back to beginning of script and load library
in the call:
- residuals() function to get the timeseries of residuals
- lags the lag order up to which we want to test
- order is the number of parameters in the model (intercept + slope)
to determine lag order (lecture 1, slide 90)
- rule of thumb to use up to \(q = \lfloor 0.75 \cdot T^{1/3} \rfloor\) lags
- use nobs() function (short for number of observations) to get \(T\)

q <- floor(0.75 * nobs(model1)^(1/3))
m <- length(coef(model1))

LjungBox(residuals(model1), lags = 1:q, order = m)

##  lags    statistic df      p-value
##     1  0.001115598  0           NA
##     2  3.156573290  0 0.0000000000
##     3 10.280790212  1 0.0013442236
##     4 17.395024245  2 0.0001670008

check for some different values of \(q\), to see if it is sensitive
- do this especially in cases in which the \(p\)-values are only marginally significant
interpretation:
- \(p\)-value is much smaller than 0.01 for all lags
- the Null is no autocorrelation, we reject that
- working hypothesis: there is autocorrelation in the residuals

a)

Repeat, but with the variable \(\Delta \text{Irates1}\), but use now both lm() and dynlm().

Load the data

data(Irates, package = "Ecdat")

start with the “normal” lm()
- note that d() and L() are functions internal to dynlm
- use the base R counterparts diff() and lag() (they behave the same by default)

model2_lm <- lm(diff(r1) ~ lag(diff(r1)), data = Irates)

summary(model2_lm)

## Warning in summary.lm(model2_lm): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = diff(r1) ~ lag(diff(r1)), data = Irates)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.311e-14  1.710e-17  2.400e-17  3.170e-17  5.717e-16 
## 
## Coefficients:
##                 Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)   -8.439e-18  2.491e-17 -3.390e-01    0.735    
## lag(diff(r1))  1.000e+00  4.106e-17  2.435e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.733e-16 on 528 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.931e+32 on 1 and 528 DF,  p-value: < 2.2e-16

Alarm bells must ring now!

R prints a warning message “essentially perfect fit”
look at coefficient estimates:
- coefficient of lagged variable is 1, and highly significant
- intercept is essentially zero
- look at residuals

hist(residuals(model2_lm))

essentially all zero
implied model is \(\Delta \text{Irates1}_t = 0 + 1 \cdot \Delta \text{Irates1}_{t - 1} + e_t\), where \(e_t = 0\) for almost all \(t\), so \(\Delta \text{Irates1}_t = \Delta \text{Irates1}_{t - 1}\): the series is linear
is that true? would be a boring timeseries

plot(Irates[, "r1"])

not very linear, so there is something wrong with the regression!
now estimate using dynlm(), which is designed for timeseries (“dynamic”) regressions

model2 <- dynlm(d(r1) ~ L(d(r1)), data = Irates)

summary(model2)

## 
## Time series regression with "ts" data:
## Start = 1947(2), End = 1991(2)
## 
## Call:
## dynlm(formula = d(r1) ~ L(d(r1)), data = Irates)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7172 -0.1668  0.0169  0.2111  2.7171 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.009889   0.026439   0.374    0.709
## L(d(r1))    0.021999   0.043559   0.505    0.614
## 
## Residual standard error: 0.608 on 527 degrees of freedom
## Multiple R-squared:  0.0004838,  Adjusted R-squared:  -0.001413 
## F-statistic: 0.2551 on 1 and 527 DF,  p-value: 0.6137

Check the same diagnostics

plot(residuals(model2))

acf(residuals(model2))

looks more reasonable
test for autocorrelation in residuals again

q <- floor(0.75 * nobs(model2)^(1/3))
m <- length(coef(model2))

LjungBox(residuals(model2), lags = 1:q, order = m)

##  lags    statistic df     p-value
##     1 1.041116e-04  0          NA
##     2 1.644574e-01  0 0.000000000
##     3 7.238289e+00  1 0.007136495
##     4 1.017624e+01  2 0.006169598
##     5 1.027875e+01  3 0.016339371
##     6 1.269716e+01  4 0.012854346

\(p\)-values are below 5% for all lags
- reject the Null-hypothesis of no autocorrelation

b)

Repeat but estimate the covariance matrix with the Newey-West estimator.

lm() and dynlm() by default use covariance estimators that are not robust to heteroskedasticity and autocorrelation
- assumes that the variance of the residuals is constant, which is questionable looking at plot of the residuals
- assumes that residuals are white noise, i.e. random, and not autocorrelated
there are heteroskedasticity and autocorrelation-robust covariance estimators, like Newey-West (sometimes called HAC)
- implemented in the sandwich library
- use the lmtest library to calculate them for our model
- go back to beginning of script, and import those packages

Reproduce the non-robust variance that is default, and then estimate robust version

coeftest(model2)

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0098888  0.0264394   0.374   0.7085
## L(d(r1))    0.0219992  0.0435593   0.505   0.6137

coeftest(model2, vcov. = NeweyWest(model2))

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0098888  0.0253651  0.3899   0.6968
## L(d(r1))    0.0219992  0.0910067  0.2417   0.8091

point estimates are identical, only variance (and thus standard error) is affected
variance of coefficient of lagged difference increases: it is a weaker predictor than implied by the basic regression
see lecture 1, slides 108ff for details on usage of NeweyWest()

c)

Download 12-month inflation rate according to KPIF from SCB using the package pxweb. Write a script that documents the source of the data and saves the data. Estimate a deterministic trend model and compare the results with the HP-filter and with Hamilton’s solution.

this exercise has two parts:
1. download data from within R
2. estimate models using that data
both parts are “fundamentally” different (you will see), so it makes sense to split it into two scripts
first, script to download and save data

We are now going to use R interactively. The script should look something like this (and nothing more):

# Econometrics 3b
# Exercise 2c
# Download data

library(pxweb)

dl <- pxweb_interactive()
# sequence to get to desired data:
# 1 1 2 1 18 1 3 1 3 * y n y y n

kpif <- dl$data

# head(kpif, 20)

# first year of data is missing (NA), remove it
kpif <- na.exclude(kpif)

write.csv(kpif, file = "kpif.csv", row.names = FALSE)

# end of script

Execute this one directly in RStudio. It requires interaction, so no sense in running it in batch mode.

Now for the non-interactive/“normal” part. Open a new script, and save it.

library(dynlm)
library(mFilter)

df <- read.csv("kpif.csv")
head(df)

##     månad KPIF..12.månadsförändring..1987.100
## 1 1988M01                                 4.5
## 2 1988M02                                 4.9
## 3 1988M03                                 5.2
## 4 1988M04                                 5.9
## 5 1988M05                                 6.0
## 6 1988M06                                 6.4

data is in a data.frame, but we want to do timeseries stuff, so transform it

infl <- ts(df[, 2], start = c(1988, 1), frequency = 12)

plot(infl)

now estimate a deterministic trend model (linear time trend to be more specific)
- essentially try to fit a linear trend (i.e. long-run phenomenon) into the data
- easily done with dynlm

model_dt <- dynlm(infl ~ trend(infl))

what does that look like?
- use ts.plot() to plot multiple timeseries in the same figure
- fitted() function returns the fitted values of the model, i.e. the model’s prediction
- lty = c(1, 2) argument to specify the line types of the series: 1 solid, 2 dashed, … (see here)

ts.plot(infl, fitted(model_dt), lty = c(1, 2))

linear trend in inflation rate makes little sense economically
something more sophisticated: Hodrick-Prescott filter
- implemented in mFilter package
- idea is to decompose the series into cyclical and trend components, and filter out some noise

model_hp <- hpfilter(infl)

ts.plot(infl, fitted(model_hp), lty = c(1, 2))
legend(
  "topright",
  legend = c("KPIF", "HP-filtered trend"),
  lty = c(1, 2)
)

looks much more reasonable
key is the smoothing parameter
- see lecture 2, slide 25
- rule of thumb parameters for different timeseries frequencies, but we can play around:

model_hp_quarter <- hpfilter(infl, freq = 1600, type = "lambda")

ts.plot(infl, fitted(model_hp_quarter), lty = c(1, 2))
legend(
  "topright",
  # legend = c("KPIF", "HP-filtered trend, lambda=1600"),
  legend = c("KPIF", expression(list(plain("HP-filtered trend"), lambda == 1600))),
  lty = c(1, 2)
)

much closer to the data, less of a a “trend” as we understand it

model_hp_year <- hpfilter(infl, freq = 6.25, type = "lambda")

ts.plot(infl, fitted(model_hp_year), lty = c(1, 2))
legend(
  "topright",
  # legend = c("KPIF", "HP-filtered trend, lambda=6.25"),
  legend = c("KPIF", expression(list(plain("HP-filtered trend"), lambda == 6.25))),
  lty = c(1, 2)
)

\(\lambda\) “penalizes” too much movement of the trend component
- for monthly data, rule-of-thumb is \(\lambda = 129600\)
Hodrick-Prescott filter ciritized, somewhat ad-hoc
alternative suggested by Hamilton (2017)
- lecture 2, slide 36
- model is

\[ x_{t + h} = \beta_0 + \beta_1 x_t + \beta_2 x_{t - 1} + \beta_3 x_{t - 2} + \beta_4 x_{t - 3} + \epsilon_{t + h} \]

where \(h\) is a two-year horizon
we have monthly data, so two years means \(h = 24\)
then the equation above is equivalent to

\[ x_t = \beta_0 + \beta_1 x_{t - 24} + \beta_2 x_{t - 25} + \beta_3 x_{t - 26} + \beta_4 x_{t - 27} + \epsilon_t \]

model_hamilton <- dynlm(infl ~ L(infl, 24:27))
# shorthand for:
# model_hamilton <- dynlm(infl ~ L(infl, 24) + L(infl, 25) + L(infl, 26) + L(infl, 27))

ts.plot(infl, fitted(model_hamilton), lty = c(1, 2))
legend(
  "topright",
  legend = c("KPIF", "Hamilton's trend"),
  lty = c(1, 2)
)

looks kind of noisy still
better? Who knows, and depends on the application

Compare all:

specify line width to make dashed/dotted lines better visible

ts.plot(
  infl, fitted(model_dt), fitted(model_hp), fitted(model_hamilton),
  lty = 1:4,
  col = 1:4,
  lwd = c(1, 1, 2, 2)
  )
legend(
  "topright",
  legend = c("KPIF", "Deterministic trend", "HP-filtered trend", "Hamilton's trend"),
  lty = 1:4,
  col = 1:4
)

Econometrics 3b

Thore Petersen¹

31.03.2022/01.04.2022

Computer Labs 2 and 3: Exercises

Exercise 1

Solution

a)

b)

c)

d)

Exercise 2

Solution

a)

b)

c)

Econometrics 3b

Thore Petersen1

31.03.2022/01.04.2022

Computer Labs 2 and 3: Exercises

Exercise 1

Solution

a)

b)

c)

d)

Exercise 2

Solution

a)

b)

c)

Thore Petersen¹