## A handy concatenation operator

It may be useful for you to define a concatenation operator for characters. Sometimes, I find this is more intuitive and handy than using `paste0 `

or `paste`

. Also, it makes your code look better when you have nested paste, e.g.`paste0("Y~",paste0("z",1:3, "*x",1:3,collapse="+")`

. The drawback is that it may reduce the readability of your code to other R user, since it is a self define function.(i guess it should be fine, cuz it is really intuitive. Also other scripting language also has similar concatenation operator)

"%+%" <- function(...){ paste0(...,sep="") } > "hello" %+% "world" [1] "helloworld" "hello" %+% "world" %+% 1:3 [1] "helloworld1" "helloworld2" "helloworld3"

Generating formula:

"Y~" %+% paste0("z",1:3, "*x",1:3,collapse="+") [1] "Y~z1*x1+z2*x2+z3*x3"

## Generating a lag/lead variables

A few days ago, my friend asked me is there any function in R to generate lag/lead variables in a data.frame or did similar thing as _n in stata. He would like to use that to clean-up his dataset in R.

In stata help manual: _n contains the number of the current observation.

Here’s an example to illustrate what _n does:

set obs 10

generate x = _n

generate x_lag1 = x[_n-1]

generate x_lead1 = x[_n+1]

The data generated would be :

x = {1,2,3,4,5,6,7,8,9,10}

x_lag1 = {NA,1,2,3,4,5,6,7,8,9}

x_lead1 = {1,2,3,4,5,6,7,8,9,NA}

The key feature is the new vector has the same length as the original vector, so we can use it with the original vector or other generated vector.

One application is to create a MA series (just an example, it is better to use function in any time-series packages to do that)

generate x_ma_1 = (x[_n-1] + x[_n]) / 2

I googled a while for that, basically there’re two types of method to generate lag/lead variables in R:(reference)

1> Function that generate a shorter vector (e.g. embed(), running() in gtools

2> Function in ts, zoo, xts, dynlm,dlm.

However, both solutions do not solve his problem. Then I wrote a “shift” function to do the task:

shift<-function(x,shift_by){ stopifnot(is.numeric(shift_by)) stopifnot(is.numeric(x)) if (length(shift_by)>1) return(sapply(shift_by,shift, x=x)) out<-NULL abs_shift_by=abs(shift_by) if (shift_by > 0 ) out<-c(tail(x,-abs_shift_by),rep(NA,abs_shift_by)) else if (shift_by < 0 ) out<-c(rep(NA,abs_shift_by), head(x,-abs_shift_by)) else out<-x out }

# Example d<-data.frame(x=1:15) #generate lead variable d$df_lead2<-shift(d$x,2) #generate lag variable d$df_lag2<-shift(d$x,-2) > d x df_lead2 df_lag2 1 1 3 NA 2 2 4 NA 3 3 5 1 4 4 6 2 5 5 7 3 6 6 8 4 7 7 9 5 8 8 10 6 9 9 NA 7 10 10 NA 8 # shift_by is vectorized d$df_lead2 shift(d$x,-2:2) [,1] [,2] [,3] [,4] [,5] [1,] NA NA 1 2 3 [2,] NA 1 2 3 4 [3,] 1 2 3 4 5 [4,] 2 3 4 5 6 [5,] 3 4 5 6 7 [6,] 4 5 6 7 8 [7,] 5 6 7 8 9 [8,] 6 7 8 9 10 [9,] 7 8 9 10 NA [10,] 8 9 10 NA NA

# Test library(testthat) expect_that(shift(1:10,2),is_identical_to(c(3:10,NA,NA))) expect_that(shift(1:10,-2), is_identical_to(c(NA,NA,1:8))) expect_that(shift(1:10,0), is_identical_to(1:10)) expect_that(shift(1:10,0), is_identical_to(1:10)) expect_that(shift(1:10,1:2), is_identical_to(cbind(c(2:10,NA),c(3:10,NA,NA))))

Notice that the result depends on how the data.frame is sorted.

## A shortcut function for install.packages() and library()

I enjoy trying difference kind of R packages. Since I have more than 1 computers (1 at home, 1 at office and a laptop)

it is troublesome to check whether I have installed some new packages for each computer. Therefore i wrote a function to load and install packages at once. If the package does not exist, then the it will be downloaded from CRAN and be loaded it.

packages<-function(x, repos="http://cran.r-project.org", ...){ x <- deparse(substitute(x)) if (!require(x,character.only=TRUE)){ install.packages(pkgs=x, repos=repos, ...) require(x,character.only=TRUE) } } packages(Hmisc)

Thanks richierocks for the suggestion of using deparse(substitute(x)) in the code.

richierocks’s

## A quick way to do row repeat and col repeat (rep.row, rep.col)

Today I worked on a simulation program which require me to create a matrix by repeating the vector n times (both by row and by col).

Even the task is extremely simple and only take 1 line to finish(10sec), I have to think about should the argument in rep be each or times and should the argument in matrix is nrow or ncol. It distracted me from the original task i am working on.

Just now, I wrote a function rep.row and rep.col to do what I really want to do. Next time, i don’t have to worry about how to use the matrix and rep command to repeat an vector to form a matrix!

Code

rep.row<-function(x,n){ matrix(rep(x,each=n),nrow=n) } rep.col<-function(x,n){ matrix(rep(x,each=n), ncol=n, byrow=TRUE) }

x is the vector to be repeated and n is the number of replication. Example:

> rep.row(1:3,5) [,1] [,2] [,3] [1,] 1 2 3 [2,] 1 2 3 [3,] 1 2 3 [4,] 1 2 3 [5,] 1 2 3 > rep.col(1:3,5) [,1] [,2] [,3] [,4] [,5] [1,] 1 1 2 2 3 [2,] 1 1 2 3 3 [3,] 1 2 2 3 3

I am sure it should appear in some packages, but it would be faster for me to write it out than find it out!