Stockholm University, VT2022

Computer Labs 2 and 3: Exercises

Exercise 1

Write a commented script, that does the following:

  1. Create two normally distributed realizations with length 100. Both with mean one, but one should have variance one, and the other variance two.
  2. Test whether sample means are different from zero.
  3. Test whether the sample means are the same or not.
  4. Run the script in batch mode, so that a result file is created.

Solution

Let us start by creating a new project for this lab.

a)

Create two normally distributed realizations with length 100. Both with mean one, but one should have variance one, and the other variance two.

  • two normally distributed realizations of length 100 = two samples of random draws from a Normal/Gaussian distribution

  • computers can not be “truly” random, they only execute whatever we ask them to

  • we speak of “pseudorandomness” when generating “random” numbers on a computer

  • any sequence or sample of such numbers seem random (e.g. satisfy some statistical properties), but are in fact deterministic

  • random numbers are useful for lots of modelling, and you need a lot of numbers (in the millions if not more to solve some models)

    • R’s default algorithm (Mersenne Twister) has a period of \(2^{19937} - 1\) (that’s a lot of numbers)
  • nice side effect of the numbers being deterministic: models are (theoretically) perfectly reproducible

  • how it works:

    • choose algorithm to generate numbers (good default choices in R)
    • set the “seed”, the starting point of the sequence
    • given that seed, the number generator will always put out the same random numbers
    • read more here by typing ?Random in R

So let’s generate the samples.

set.seed(2)

n <- 100
mu <- 1

sigx <- 1
sigy <- 2

x <- rnorm(n = n, mean = mu, sd = sqrt(sigx))
y <- rnorm(n, mu, sqrt(sigy))
  • good practice to declare all parameters (length, mean, and variances) in one place (the beginning is a good place)
  • call the function rnorm() to generate/draw the samples
    • note that we can use the sqrt() function inside: it is evaluated first, and the result used for the sd argument

Let’s have a look at the random samples

plot(x)

plot(y)

Looks pretty random, but both centered around 1, as expected.

Look at histograms (we overlay the theoretical density):

  • probability = TRUE puts relative frequencies on the \(y\)-axis
  • curve() takes a function, and overlays over the previous plot, if add = TRUE
hist(x, probability = TRUE)
curve(dnorm(x, mean = mu, sd = sqrt(sigx)), from = -2, to = 4, add = TRUE)

hist(y, probability = TRUE)
curve(dnorm(x, mean = mu, sd = sqrt(sigy)), from = -3, to = 5, add = TRUE)

b)

Test whether sample means are different from zero.

  • this is done with one of the simplest tests: the \(t\)-test
t.test(x, mu = 0)
## 
##  One Sample t-test
## 
## data:  x
## t = 8.3547, df = 99, p-value = 4.148e-13
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.7390951 1.1995086
## sample estimates:
## mean of x 
## 0.9693018
  • interpretation of output
    • t = 8.3547: value of the \(t\)-statistic (\(\frac{\bar{x}}{\hat{\sigma} / \sqrt{n}}\))
    • df = 99: degrees of freedom of the \(t\)-distribution
    • p-value = 4.148e-13: probability mass of the \(t_{\nu = 99}\) distribution beyond 8.3547
    • conclusion: we reject the Null at any reasonable significance level
      • \(H_0: \mu = 0\) and \(H_1: \mu \neq 0\)
      • we work with the alternative, i.e. that the true mean is not zero
t.test(y, mu = 0)
## 
##  One Sample t-test
## 
## data:  y
## t = 7.4727, df = 99, p-value = 3.175e-11
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.764817 1.317814
## sample estimates:
## mean of x 
##  1.041316
  • same interpretation
  • but note that the \(p\)-value is much larger (albeit still tiny)
    • the second sample has a greater variance, so the test is less “sure” of the result
c)

Test whether the sample means are the same or not.

  • testing for equality of sample means also done using a \(t\)-test
  • implemented in R using the same function, but now with different arguments
    • in b) we supply one vector (the sample) and specify mu = 0
    • R “understands” we want to test the mean of that single sample
    • now supply two vectors, and nothing else
    • default is then to compare the means of those two
t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -0.39716, df = 191.71, p-value = 0.6917
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4296611  0.2856333
## sample estimates:
## mean of x mean of y 
## 0.9693018 1.0413158
  • interpretation similar again
  • but now p-value = 0.6917
    • much larger than before
    • \(H_0: \mu_x = \mu_y\) and \(H_1: \mu_x \neq \mu_y\)
    • can NOT reject the Null, i.e. we must assume both processes have the same true mean (reassuring, since we know this is in fact true)
d)

Run the script in batch mode, so that a result file is created.

  • comment out the plotting stuff, so that we are left with
# Econometrics 3b
# Exercise 1

# parameters
set.seed(2)

n <- 100
mu <- 1

sigx <- 1
sigy <- 2

# generate samples
x <- rnorm(n = n, mean = mu, sd = sqrt(sigx))
y <- rnorm(n, mu, sqrt(sigy))

# testing
t.test(x, mu = 0)
t.test(y, mu = 0)
t.test(x, y)
  • save as for example exercise-1.R
  • execute in Terminal/Console:
    • R CMD BATCH more flexible than RScript from last time
    • --vanilla means “plain vanilla”, or the basic R configuration (no special user-defined options etc.)
    • "exercise-1.R" to call our script, then space and "output.txt" that is the file to which everything will be printed (you can omit the “” if there are no spaces in the names)
R CMD BATCH --vanilla "exercise-1.R" "output.txt"
  • look in Finder/Explorer for the file output.txt
    • it now contains the results

Exercise 2

Write a commented script that estimates \(\Delta \text{gdpjp}_t = \phi_0 + \phi_1 \Delta \text{gdpjp}_{t - 1} + e_t\) with OLS, where \(\text{gdpjp}\) is found in the data set Macrodat in package Ecdat. Test estimated residuals for autocorrelation. Run the script in batch mode so that a result file is created.

  1. Repeat, but with the variable \(\Delta \text{Irates1}\), but use now both lm() and dynlm().
  2. Repeat but estimate the covariance matrix with the Newey-West estimator.
  3. Download 12-month inflation rate according to KPIF from SCB using the package pxweb. Write a script that documents the source of the data and saves the data. Estimate a deterministic trend model and compare the results with the HP-filter and with Hamilton’s solution.

Solution

  • time series regressions with the dynlm package
  • load data from the package
    • check Environment, now Macrodat is listed (check its class)
  • specify formula for regression
    • dynlm defines functions like d() and L() for inside formulae
library(dynlm)
library(portes)
library(lmtest)
library(sandwich)

data(Macrodat, package = "Ecdat")
  • workspace now contains Macrodat, check it out
    • head() prints the first elements of any object
head(Macrodat)
##             lhur    punew fyff fygm3 fygt1   exruk   gdpjp
## 1959 Q1 5.833333 28.99333 2.80  2.80  3.61 281.271 14.3195
## 1959 Q2 5.100000 29.04333 3.39  3.21  4.07 281.254 14.8032
## 1959 Q3 5.266667 29.19333 3.76  4.04  5.00 280.370 15.5764
## 1959 Q4 5.600000 29.37000 3.99  4.49  5.14 279.845 15.4222
## 1960 Q1 5.133333 29.39667 3.84  3.31  4.02 280.586 16.5126
## 1960 Q2 5.233333 29.57333 3.32  2.46  3.36 280.268 16.5398
class(Macrodat)
## [1] "mts" "ts"
  • mts and ts: multivariate timeseries
    • a matrix containing individual ts objects as its columns
  • matrix columns can have names
colnames(Macrodat)
## [1] "lhur"  "punew" "fyff"  "fygm3" "fygt1" "exruk" "gdpjp"
  • to know which column is which variable
    • we can use these names when calling our regressions

Now to specify the model.

formula1 <- d(gdpjp) ~ L(d(gdpjp))
model1 <- dynlm(formula1, data = Macrodat)
  • remember lists?
    • model1 is now a named list containing all the results of the regression
  • most model outputs have a nice summary:
summary(model1)
## 
## Time series regression with "ts" data:
## Start = 1959(3), End = 1999(2)
## 
## Call:
## dynlm(formula = formula1, data = Macrodat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8825 -0.3535 -0.0066  0.4351  2.3532 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.64608    0.07645   8.451 1.79e-14 ***
## L(d(gdpjp)) -0.01733    0.07966  -0.218    0.828    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.723 on 158 degrees of freedom
##   (0 observations deleted due to missingness)
## Multiple R-squared:  0.0002995,  Adjusted R-squared:  -0.006028 
## F-statistic: 0.04733 on 1 and 158 DF,  p-value: 0.8281

Having a look at residuals is always a good idea.

  • note that the class of residuals(model1) is ts
    • use built-in functionality to plot
plot(residuals(model1))

  • does not look perfectly random, some autocorrelation, some “swings”
  • for sure some periods of larger variance
    • be careful, if there is still some “structure” in the residuals, the model does not capture everything!

Test for autocorrelation

  • first, just look at the autocorrelation function acf()
acf(residuals(model1))

  • visually there is some autocorrelation
  • formal test is the Ljung-Box test
    • implemented in portes library
    • go back to beginning of script and load library
  • in the call:
    • residuals() function to get the timeseries of residuals
    • lags the lag order up to which we want to test
    • order is the number of parameters in the model (intercept + slope)
  • to determine lag order (lecture 1, slide 90)
    • rule of thumb to use up to \(q = \lfloor 0.75 \cdot T^{1/3} \rfloor\) lags
    • use nobs() function (short for number of observations) to get \(T\)
q <- floor(0.75 * nobs(model1)^(1/3))
m <- length(coef(model1))

LjungBox(residuals(model1), lags = 1:q, order = m)
##  lags    statistic df      p-value
##     1  0.001115598  0           NA
##     2  3.156573290  0 0.0000000000
##     3 10.280790212  1 0.0013442236
##     4 17.395024245  2 0.0001670008
  • check for some different values of \(q\), to see if it is sensitive
    • do this especially in cases in which the \(p\)-values are only marginally significant
  • interpretation:
    • \(p\)-value is much smaller than 0.01 for all lags
    • the Null is no autocorrelation, we reject that
    • working hypothesis: there is autocorrelation in the residuals
a)

Repeat, but with the variable \(\Delta \text{Irates1}\), but use now both lm() and dynlm().

Load the data

data(Irates, package = "Ecdat")
  • start with the “normal” lm()
    • note that d() and L() are functions internal to dynlm
    • use the base R counterparts diff() and lag() (they behave the same by default)
model2_lm <- lm(diff(r1) ~ lag(diff(r1)), data = Irates)

summary(model2_lm)
## Warning in summary.lm(model2_lm): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = diff(r1) ~ lag(diff(r1)), data = Irates)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.311e-14  1.710e-17  2.400e-17  3.170e-17  5.717e-16 
## 
## Coefficients:
##                 Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)   -8.439e-18  2.491e-17 -3.390e-01    0.735    
## lag(diff(r1))  1.000e+00  4.106e-17  2.435e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.733e-16 on 528 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.931e+32 on 1 and 528 DF,  p-value: < 2.2e-16

Alarm bells must ring now!

  • R prints a warning message “essentially perfect fit”
  • look at coefficient estimates:
    • coefficient of lagged variable is 1, and highly significant
    • intercept is essentially zero
    • look at residuals
hist(residuals(model2_lm))

  • essentially all zero
  • implied model is \(\Delta \text{Irates1}_t = 0 + 1 \cdot \Delta \text{Irates1}_{t - 1} + e_t\), where \(e_t = 0\) for almost all \(t\), so \(\Delta \text{Irates1}_t = \Delta \text{Irates1}_{t - 1}\): the series is linear
  • is that true? would be a boring timeseries
plot(Irates[, "r1"])

  • not very linear, so there is something wrong with the regression!
  • now estimate using dynlm(), which is designed for timeseries (“dynamic”) regressions
model2 <- dynlm(d(r1) ~ L(d(r1)), data = Irates)

summary(model2)
## 
## Time series regression with "ts" data:
## Start = 1947(2), End = 1991(2)
## 
## Call:
## dynlm(formula = d(r1) ~ L(d(r1)), data = Irates)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7172 -0.1668  0.0169  0.2111  2.7171 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.009889   0.026439   0.374    0.709
## L(d(r1))    0.021999   0.043559   0.505    0.614
## 
## Residual standard error: 0.608 on 527 degrees of freedom
## Multiple R-squared:  0.0004838,  Adjusted R-squared:  -0.001413 
## F-statistic: 0.2551 on 1 and 527 DF,  p-value: 0.6137

Check the same diagnostics

plot(residuals(model2))

acf(residuals(model2))

  • looks more reasonable
  • test for autocorrelation in residuals again
q <- floor(0.75 * nobs(model2)^(1/3))
m <- length(coef(model2))

LjungBox(residuals(model2), lags = 1:q, order = m)
##  lags    statistic df     p-value
##     1 1.041116e-04  0          NA
##     2 1.644574e-01  0 0.000000000
##     3 7.238289e+00  1 0.007136495
##     4 1.017624e+01  2 0.006169598
##     5 1.027875e+01  3 0.016339371
##     6 1.269716e+01  4 0.012854346
  • \(p\)-values are below 5% for all lags
    • reject the Null-hypothesis of no autocorrelation
b)

Repeat but estimate the covariance matrix with the Newey-West estimator.

  • lm() and dynlm() by default use covariance estimators that are not robust to heteroskedasticity and autocorrelation
    • assumes that the variance of the residuals is constant, which is questionable looking at plot of the residuals
    • assumes that residuals are white noise, i.e. random, and not autocorrelated
  • there are heteroskedasticity and autocorrelation-robust covariance estimators, like Newey-West (sometimes called HAC)
    • implemented in the sandwich library
    • use the lmtest library to calculate them for our model
    • go back to beginning of script, and import those packages

Reproduce the non-robust variance that is default, and then estimate robust version

coeftest(model2)
## 
## t test of coefficients:
## 
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0098888  0.0264394   0.374   0.7085
## L(d(r1))    0.0219992  0.0435593   0.505   0.6137
coeftest(model2, vcov. = NeweyWest(model2))
## 
## t test of coefficients:
## 
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0098888  0.0253651  0.3899   0.6968
## L(d(r1))    0.0219992  0.0910067  0.2417   0.8091
  • point estimates are identical, only variance (and thus standard error) is affected
  • variance of coefficient of lagged difference increases: it is a weaker predictor than implied by the basic regression
  • see lecture 1, slides 108ff for details on usage of NeweyWest()
c)

Download 12-month inflation rate according to KPIF from SCB using the package pxweb. Write a script that documents the source of the data and saves the data. Estimate a deterministic trend model and compare the results with the HP-filter and with Hamilton’s solution.

  • this exercise has two parts:
    1. download data from within R
    2. estimate models using that data
  • both parts are “fundamentally” different (you will see), so it makes sense to split it into two scripts
  • first, script to download and save data

We are now going to use R interactively. The script should look something like this (and nothing more):

# Econometrics 3b
# Exercise 2c
# Download data

library(pxweb)

dl <- pxweb_interactive()
# sequence to get to desired data:
# 1 1 2 1 18 1 3 1 3 * y n y y n

kpif <- dl$data

# head(kpif, 20)

# first year of data is missing (NA), remove it
kpif <- na.exclude(kpif)

write.csv(kpif, file = "kpif.csv", row.names = FALSE)

# end of script

Execute this one directly in RStudio. It requires interaction, so no sense in running it in batch mode.

Now for the non-interactive/“normal” part. Open a new script, and save it.

library(dynlm)
library(mFilter)

df <- read.csv("kpif.csv")
head(df)
##     månad KPIF..12.månadsförändring..1987.100
## 1 1988M01                                 4.5
## 2 1988M02                                 4.9
## 3 1988M03                                 5.2
## 4 1988M04                                 5.9
## 5 1988M05                                 6.0
## 6 1988M06                                 6.4
  • data is in a data.frame, but we want to do timeseries stuff, so transform it
infl <- ts(df[, 2], start = c(1988, 1), frequency = 12)

plot(infl)

  • now estimate a deterministic trend model (linear time trend to be more specific)
    • essentially try to fit a linear trend (i.e. long-run phenomenon) into the data
    • easily done with dynlm
model_dt <- dynlm(infl ~ trend(infl))
  • what does that look like?
    • use ts.plot() to plot multiple timeseries in the same figure
    • fitted() function returns the fitted values of the model, i.e. the model’s prediction
    • lty = c(1, 2) argument to specify the line types of the series: 1 solid, 2 dashed, … (see here)
ts.plot(infl, fitted(model_dt), lty = c(1, 2))

  • linear trend in inflation rate makes little sense economically
  • something more sophisticated: Hodrick-Prescott filter
    • implemented in mFilter package
    • idea is to decompose the series into cyclical and trend components, and filter out some noise
model_hp <- hpfilter(infl)

ts.plot(infl, fitted(model_hp), lty = c(1, 2))
legend(
  "topright",
  legend = c("KPIF", "HP-filtered trend"),
  lty = c(1, 2)
)

  • looks much more reasonable
  • key is the smoothing parameter
    • see lecture 2, slide 25
    • rule of thumb parameters for different timeseries frequencies, but we can play around:
model_hp_quarter <- hpfilter(infl, freq = 1600, type = "lambda")

ts.plot(infl, fitted(model_hp_quarter), lty = c(1, 2))
legend(
  "topright",
  # legend = c("KPIF", "HP-filtered trend, lambda=1600"),
  legend = c("KPIF", expression(list(plain("HP-filtered trend"), lambda == 1600))),
  lty = c(1, 2)
)

  • much closer to the data, less of a a “trend” as we understand it
model_hp_year <- hpfilter(infl, freq = 6.25, type = "lambda")

ts.plot(infl, fitted(model_hp_year), lty = c(1, 2))
legend(
  "topright",
  # legend = c("KPIF", "HP-filtered trend, lambda=6.25"),
  legend = c("KPIF", expression(list(plain("HP-filtered trend"), lambda == 6.25))),
  lty = c(1, 2)
)

  • \(\lambda\) “penalizes” too much movement of the trend component
    • for monthly data, rule-of-thumb is \(\lambda = 129600\)
  • Hodrick-Prescott filter ciritized, somewhat ad-hoc
  • alternative suggested by Hamilton (2017)
    • lecture 2, slide 36
    • model is

\[ x_{t + h} = \beta_0 + \beta_1 x_t + \beta_2 x_{t - 1} + \beta_3 x_{t - 2} + \beta_4 x_{t - 3} + \epsilon_{t + h} \]

  • where \(h\) is a two-year horizon
  • we have monthly data, so two years means \(h = 24\)
  • then the equation above is equivalent to

\[ x_t = \beta_0 + \beta_1 x_{t - 24} + \beta_2 x_{t - 25} + \beta_3 x_{t - 26} + \beta_4 x_{t - 27} + \epsilon_t \]

model_hamilton <- dynlm(infl ~ L(infl, 24:27))
# shorthand for:
# model_hamilton <- dynlm(infl ~ L(infl, 24) + L(infl, 25) + L(infl, 26) + L(infl, 27))

ts.plot(infl, fitted(model_hamilton), lty = c(1, 2))
legend(
  "topright",
  legend = c("KPIF", "Hamilton's trend"),
  lty = c(1, 2)
)

  • looks kind of noisy still
  • better? Who knows, and depends on the application

Compare all:

  • specify line width to make dashed/dotted lines better visible
ts.plot(
  infl, fitted(model_dt), fitted(model_hp), fitted(model_hamilton),
  lty = 1:4,
  col = 1:4,
  lwd = c(1, 1, 2, 2)
  )
legend(
  "topright",
  legend = c("KPIF", "Deterministic trend", "HP-filtered trend", "Hamilton's trend"),
  lty = 1:4,
  col = 1:4
)


  1. ↩︎