Home > Custom Function > Generating a lag/lead variables

Generating a lag/lead variables

A few days ago, my friend asked me is there any function in R to generate lag/lead variables in a data.frame or did similar thing as _n in stata. He would like to use that to clean-up his dataset in R.

In stata help manual: _n contains the number of the current observation.
Here’s an example to illustrate what _n does:

set obs 10
generate x = _n
generate x_lag1 = x[_n-1]
generate x_lead1 = x[_n+1]

The data generated would be :
x = {1,2,3,4,5,6,7,8,9,10}
x_lag1 = {NA,1,2,3,4,5,6,7,8,9}
x_lead1 = {1,2,3,4,5,6,7,8,9,NA}

The key feature is the new vector has the same length as the original vector, so we can use it with the original vector or other generated vector.

One application is to create a MA series (just an example, it is better to use function in any time-series packages to do that)
generate x_ma_1 = (x[_n-1] + x[_n]) / 2

I googled a while for that, basically there’re two types of method to generate lag/lead variables in R:(reference)

1> Function that generate a shorter vector (e.g. embed(), running() in gtools
2> Function in ts, zoo, xts, dynlm,dlm.

However, both solutions do not solve his problem. Then I wrote a “shift” function to do the task:

shift<-function(x,shift_by){
	stopifnot(is.numeric(shift_by))
	stopifnot(is.numeric(x))

	if (length(shift_by)>1)
		return(sapply(shift_by,shift, x=x))

	out<-NULL
	abs_shift_by=abs(shift_by)
	if (shift_by > 0 )
		out<-c(tail(x,-abs_shift_by),rep(NA,abs_shift_by))
	else if (shift_by < 0 )
		out<-c(rep(NA,abs_shift_by), head(x,-abs_shift_by))
	else 
		out<-x
	out
}
# Example
d<-data.frame(x=1:15) 
#generate lead variable
d$df_lead2<-shift(d$x,2)
#generate lag variable
d$df_lag2<-shift(d$x,-2)

> d
    x df_lead2 df_lag2
1   1        3      NA
2   2        4      NA
3   3        5       1
4   4        6       2
5   5        7       3
6   6        8       4
7   7        9       5
8   8       10       6
9   9       NA       7
10 10       NA       8

# shift_by is vectorized
d$df_lead2 shift(d$x,-2:2)
      [,1] [,2] [,3] [,4] [,5]
 [1,]   NA   NA    1    2    3
 [2,]   NA    1    2    3    4
 [3,]    1    2    3    4    5
 [4,]    2    3    4    5    6
 [5,]    3    4    5    6    7
 [6,]    4    5    6    7    8
 [7,]    5    6    7    8    9
 [8,]    6    7    8    9   10
 [9,]    7    8    9   10   NA
[10,]    8    9   10   NA   NA
# Test
library(testthat)
expect_that(shift(1:10,2),is_identical_to(c(3:10,NA,NA)))
expect_that(shift(1:10,-2), is_identical_to(c(NA,NA,1:8)))
expect_that(shift(1:10,0), is_identical_to(1:10))
expect_that(shift(1:10,0), is_identical_to(1:10))
expect_that(shift(1:10,1:2), is_identical_to(cbind(c(2:10,NA),c(3:10,NA,NA))))

Notice that the result depends on how the data.frame is sorted.

About these ads
Categories: Custom Function
  1. March 12, 2012 at 12:39

    Coming to R from Stata, I was really disappointed at how difficult it was to include lags/leads in ad-hoc analysis. I’m sure would be of use to a great many R beginners. IMO, I’d like to see something like this in r-core. I’d personally prefer if it were two functions though. One named lag() and another lead() since I find the negative number in the second parameter a little strange.

    That said, excellent work. It even works put right into a regression so one can quickly test the relationship between y, x, and a lag of x with lm(y ~ x + shift(x,-1)).

  2. Mr.Ed.
    March 12, 2012 at 17:30

    what about:

    lag = 0) {
    return(c(rep(NA, k), x[1:(length(x) - k)]))
    } else {
    return(c(x[(1 - k):length(x)], rep(NA, -k)))
    }
    }

  3. March 12, 2012 at 19:22

    @ Andrew:

    Same as you i also came from stata. My feeling is that stata is more convenient than R in some operation(like cleaning economics data). This is because stata is deigned for those purposes and R serve a more general purpose. I think R can do as good as stata if there’s a package to do those.

    Another reason is that stata has only one data.frame. Every command you typed would apply to that data.frame directly, you dont have to worry about anything else. If you only have one data.frame, then stata’s environment is very good for that. However, it will drive you crazy if you want to perform something more complicated(just imagine your data contain several data.frame.)

    However, R store everything in object. it may be hard for people without programming background to understand what is going on. It would be a bit more difficult to do simple task, but much easier to do complicated task.

    I think it’s fine that the r-core do not have those features, but it would be good if we could have a package that implement similar features in stata. I would like to try that if I have more idea and time in future.

    last, if you want to have lag and lead separately, you can create:
    lag <- function(x,lag) { shift(x,-lag) }
    lead <- function(x,lead) { shift(x, lead) }

  4. RP
    March 14, 2012 at 07:49

    A way to do this with built-in R functions lag() and ts.union():

    # create a time series variable
    y1 <- ts(1:10)

    # create lead variable
    y1.lead <- lag(y1, k=2)

    # create lag variable
    y1.lag <- lag(y1, k=-2)

    # combine the time series variables
    ts.union(y1, y1.lead, y1.lag)
    Time Series:
    Start = -1
    End = 12
    Frequency = 1
    y1 y1.lead y1.lag
    -1 NA 1 NA
    0 NA 2 NA
    1 1 3 NA
    2 2 4 NA
    3 3 5 1
    4 4 6 2
    5 5 7 3
    6 6 8 4
    7 7 9 5
    8 8 10 6
    9 9 NA 7
    10 10 NA 8
    11 NA NA 9
    12 NA NA 10

  5. February 11, 2013 at 03:59

    Magnificent web site. Plenty of useful information here. I am sending it to some friends ans also sharing in delicious. And certainly, thanks for your sweat!

  6. Uli
    February 14, 2013 at 10:41

    Hello,

    I read your blog only after I had found may way of lagging variables in a panel. To share with other uses who have panel data they want to lag below my solution.

    In particular I have panel data (‘year’ giving the time dimension and ‘ID’ the cross section).
    The variable ‘myvariable’ is the one I want to lag. All Variables are saved in the data.frame ‘mydata’.

    # first construct an index
    sort1<- paste(mydata$year, mydata$ID) # ID can be a character, year must be numeric
    sort2<- paste(maydata$year -1, mydata$ID)
    index_lag<-match(sort2, sort1)
    rm(sort1, sort2) # we don#t need them anymore

    mydata$myvariable.Lag <- mydata$myvariable – mydata$myvariable[index_lag]
    rm(index_lag)

  7. tamago
    August 22, 2013 at 02:18

    thank you!! it’s really helpful

  8. October 2, 2013 at 23:01

    TszKin Julian,

    Great post and your blog is very professional, congrats.

    I invite you to my blogs.

    Regards, Sergio

  9. xHiE
    October 2, 2013 at 23:02

    TszKin Julian,

    Great post, very helpfull.

    Regards, Sergio

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: