Generating a lag/lead variables
A few days ago, my friend asked me is there any function in R to generate lag/lead variables in a data.frame or did similar thing as _n in stata. He would like to use that to clean-up his dataset in R.
In stata help manual: _n contains the number of the current observation.
Here’s an example to illustrate what _n does:
set obs 10
generate x = _n
generate x_lag1 = x[_n-1]
generate x_lead1 = x[_n+1]
The data generated would be :
x = {1,2,3,4,5,6,7,8,9,10}
x_lag1 = {NA,1,2,3,4,5,6,7,8,9}
x_lead1 = {1,2,3,4,5,6,7,8,9,NA}
The key feature is the new vector has the same length as the original vector, so we can use it with the original vector or other generated vector.
One application is to create a MA series (just an example, it is better to use function in any time-series packages to do that)
generate x_ma_1 = (x[_n-1] + x[_n]) / 2
I googled a while for that, basically there’re two types of method to generate lag/lead variables in R:(reference)
1> Function that generate a shorter vector (e.g. embed(), running() in gtools
2> Function in ts, zoo, xts, dynlm,dlm.
However, both solutions do not solve his problem. Then I wrote a “shift” function to do the task:
shift<-function(x,shift_by){
stopifnot(is.numeric(shift_by))
stopifnot(is.numeric(x))
if (length(shift_by)>1)
return(sapply(shift_by,shift, x=x))
out<-NULL
abs_shift_by=abs(shift_by)
if (shift_by > 0 )
out<-c(tail(x,-abs_shift_by),rep(NA,abs_shift_by))
else if (shift_by < 0 )
out<-c(rep(NA,abs_shift_by), head(x,-abs_shift_by))
else
out<-x
out
}
# Example
d<-data.frame(x=1:15)
#generate lead variable
d$df_lead2<-shift(d$x,2)
#generate lag variable
d$df_lag2<-shift(d$x,-2)
> d
x df_lead2 df_lag2
1 1 3 NA
2 2 4 NA
3 3 5 1
4 4 6 2
5 5 7 3
6 6 8 4
7 7 9 5
8 8 10 6
9 9 NA 7
10 10 NA 8
# shift_by is vectorized
d$df_lead2 shift(d$x,-2:2)
[,1] [,2] [,3] [,4] [,5]
[1,] NA NA 1 2 3
[2,] NA 1 2 3 4
[3,] 1 2 3 4 5
[4,] 2 3 4 5 6
[5,] 3 4 5 6 7
[6,] 4 5 6 7 8
[7,] 5 6 7 8 9
[8,] 6 7 8 9 10
[9,] 7 8 9 10 NA
[10,] 8 9 10 NA NA
# Test library(testthat) expect_that(shift(1:10,2),is_identical_to(c(3:10,NA,NA))) expect_that(shift(1:10,-2), is_identical_to(c(NA,NA,1:8))) expect_that(shift(1:10,0), is_identical_to(1:10)) expect_that(shift(1:10,0), is_identical_to(1:10)) expect_that(shift(1:10,1:2), is_identical_to(cbind(c(2:10,NA),c(3:10,NA,NA))))
Notice that the result depends on how the data.frame is sorted.
Coming to R from Stata, I was really disappointed at how difficult it was to include lags/leads in ad-hoc analysis. I’m sure would be of use to a great many R beginners. IMO, I’d like to see something like this in r-core. I’d personally prefer if it were two functions though. One named lag() and another lead() since I find the negative number in the second parameter a little strange.
That said, excellent work. It even works put right into a regression so one can quickly test the relationship between y, x, and a lag of x with lm(y ~ x + shift(x,-1)).
what about:
lag = 0) {
return(c(rep(NA, k), x[1:(length(x) - k)]))
} else {
return(c(x[(1 - k):length(x)], rep(NA, -k)))
}
}
@ Andrew:
Same as you i also came from stata. My feeling is that stata is more convenient than R in some operation(like cleaning economics data). This is because stata is deigned for those purposes and R serve a more general purpose. I think R can do as good as stata if there’s a package to do those.
Another reason is that stata has only one data.frame. Every command you typed would apply to that data.frame directly, you dont have to worry about anything else. If you only have one data.frame, then stata’s environment is very good for that. However, it will drive you crazy if you want to perform something more complicated(just imagine your data contain several data.frame.)
However, R store everything in object. it may be hard for people without programming background to understand what is going on. It would be a bit more difficult to do simple task, but much easier to do complicated task.
I think it’s fine that the r-core do not have those features, but it would be good if we could have a package that implement similar features in stata. I would like to try that if I have more idea and time in future.
last, if you want to have lag and lead separately, you can create:
lag <- function(x,lag) { shift(x,-lag) }
lead <- function(x,lead) { shift(x, lead) }
A way to do this with built-in R functions lag() and ts.union():
# create a time series variable
y1 <- ts(1:10)
# create lead variable
y1.lead <- lag(y1, k=2)
# create lag variable
y1.lag <- lag(y1, k=-2)
# combine the time series variables
ts.union(y1, y1.lead, y1.lag)
Time Series:
Start = -1
End = 12
Frequency = 1
y1 y1.lead y1.lag
-1 NA 1 NA
0 NA 2 NA
1 1 3 NA
2 2 4 NA
3 3 5 1
4 4 6 2
5 5 7 3
6 6 8 4
7 7 9 5
8 8 10 6
9 9 NA 7
10 10 NA 8
11 NA NA 9
12 NA NA 10
Magnificent web site. Plenty of useful information here. I am sending it to some friends ans also sharing in delicious. And certainly, thanks for your sweat!
Thank you!
Hello,
I read your blog only after I had found may way of lagging variables in a panel. To share with other uses who have panel data they want to lag below my solution.
In particular I have panel data (‘year’ giving the time dimension and ‘ID’ the cross section).
The variable ‘myvariable’ is the one I want to lag. All Variables are saved in the data.frame ‘mydata’.
# first construct an index
sort1<- paste(mydata$year, mydata$ID) # ID can be a character, year must be numeric
sort2<- paste(maydata$year -1, mydata$ID)
index_lag<-match(sort2, sort1)
rm(sort1, sort2) # we don#t need them anymore
mydata$myvariable.Lag <- mydata$myvariable – mydata$myvariable[index_lag]
rm(index_lag)