## Construct an unique index from two integer (Pairing Function)

Recently, I need to construct an unique index from two integer. The best solution I found is the Pairing function.

Pairing function is an one to one and onto function that map two integers to a single integer. The definition as follows:

pair<-function(x,y){ 0.5*(x+y)*(x+y+1) + x } unpair<-function(z){ w= floor( (sqrt(8*z+1) - 1)/2 ) t = w*(w+1)/2 cbind(z-t,w-z+t) } foreach (i = 0:4,.combine=rbind) %do% { x<-0:i y<-i:0 key<-pair(x,y) unpair_key <- unpair(key) cbind(x,y,key=key,unpair_key=unpair_key) } x y key x y [1,] 0 0 0 0 0 [2,] 0 1 1 0 1 [3,] 1 0 2 1 0 [4,] 0 2 3 0 2 [5,] 1 1 4 1 1 [6,] 2 0 5 2 0 [7,] 0 3 6 0 3 [8,] 1 2 7 1 2 [9,] 2 1 8 2 1 [10,] 3 0 9 3 0 [11,] 0 4 10 0 4 [12,] 1 3 11 1 3 [13,] 2 2 12 2 2 [14,] 3 1 13 3 1 [15,] 4 0 14 4 0

If ordering of `x`

and `y`

is not important, we can swap `x`

and `y`

if `x>y`

. However, the Pairing function is not one to one and we can not back out `x`

and `y`

with `z`

pair<-cmpfun(function(x,y,ordering_matter=TRUE){ if (ordering_matter){ return(0.5*(x+y)*(x+y+1) + x) } else{ swap <- x>y return(0.5*(x+y)*(x+y+1) + (x* !swap) + (y*swap )) } }) foreach (i = 0:4,.combine=rbind) %do% { x<-0:i y<-i:0 key<-pair(x,y,ordering_matter=FALSE) unpair_key <- unpair(key) cbind(x,y,key=key,unpair_key=unpair_key) } x y key x y [1,] 0 0 0 0 0 [2,] 0 1 1 0 1 [3,] 1 0 1 0 1 [4,] 0 2 3 0 2 [5,] 1 1 4 1 1 [6,] 2 0 3 0 2 [7,] 0 3 6 0 3 [8,] 1 2 7 1 2 [9,] 2 1 7 1 2 [10,] 3 0 6 0 3 [11,] 0 4 10 0 4 [12,] 1 3 11 1 3 [13,] 2 2 12 2 2 [14,] 3 1 11 1 3 [15,] 4 0 10 0 4 >

If we have more than two integers, we can apply the Pairing function in a nested manner.

nestedPair<-function(x){ ncol_x = ncol(x) if(ncol_x==1){ return(x) } else if(ncol_x ==2) { return(pair(x[,1],x[,2])) } else if ( ncol_x > 2){ return(pair( x[,1] ,nestedPair(x[,2:ncol_x]) ) ) } } nestedUnpair<-function(x,order){ if(order==1){ return(unpair(x)) } else if(order >1) { out <- unpair(x) return(cbind(out[,1],nestedUnpair(out[,2],order-1))) } } x<-expand.grid(0:2,0:2,0:2) key <- nestedPair(x) unpair_key <- nestedUnpair(key,2) cbind(x=x,key=key,unpair_key=unpair_key) x.Var1 x.Var2 x.Var3 key unpair_key.1 unpair_key.2 unpair_key.3 1 0 0 0 0 0 0 0 2 1 0 0 2 1 0 0 3 2 0 0 5 2 0 0 4 0 1 0 3 0 1 0 5 1 1 0 7 1 1 0 6 2 1 0 12 2 1 0 7 0 2 0 15 0 2 0 8 1 2 0 22 1 2 0 9 2 2 0 30 2 2 0 10 0 0 1 1 0 0 1 11 1 0 1 4 1 0 1 12 2 0 1 8 2 0 1 13 0 1 1 10 0 1 1 14 1 1 1 16 1 1 1 15 2 1 1 23 2 1 1 16 0 2 1 36 0 2 1 17 1 2 1 46 1 2 1 18 2 2 1 57 2 2 1 19 0 0 2 6 0 0 2 20 1 0 2 11 1 0 2 21 2 0 2 17 2 0 2 22 0 1 2 28 0 1 2 23 1 1 2 37 1 1 2 24 2 1 2 47 2 1 2 25 0 2 2 78 0 2 2 26 1 2 2 92 1 2 2 27 2 2 2 107 2 2 2

## A handy concatenation operator

It may be useful for you to define a concatenation operator for characters. Sometimes, I find this is more intuitive and handy than using `paste0 `

or `paste`

. Also, it makes your code look better when you have nested paste, e.g.`paste0("Y~",paste0("z",1:3, "*x",1:3,collapse="+")`

. The drawback is that it may reduce the readability of your code to other R user, since it is a self define function.(i guess it should be fine, cuz it is really intuitive. Also other scripting language also has similar concatenation operator)

"%+%" <- function(...){ paste0(...,sep="") } > "hello" %+% "world" [1] "helloworld" "hello" %+% "world" %+% 1:3 [1] "helloworld1" "helloworld2" "helloworld3"

Generating formula:

"Y~" %+% paste0("z",1:3, "*x",1:3,collapse="+") [1] "Y~z1*x1+z2*x2+z3*x3"

## Compute the self excluded sample mean by group

egen(stata cmd) compute a summary statistics by groups and store it in to a new variable. For example, the data has three variables, id, time and y, we want to compute the mean of y by for each id and then store it as a new variable mean_y.

In stata, the command would be

egen mean_y = mean(y), by(id)

In R, this task can be completed by `ave`

Generate dataset:

id <- rep(1:3,each=3) t<-rep(1:3,3) y<-sample(1:5,9,replace=T) my_data<-data.frame(id=id,time=t,y=y)

Orignal data:

> my_data id time y 1 1 1 4 2 1 2 1 3 1 3 4 4 2 1 2 5 2 2 3 6 2 3 3 7 3 1 4 8 3 2 4 9 3 3 3

> within(my_data, {mean_y = ave(y,id)} ) id time y mean_y 1 1 1 4 3.000000 2 1 2 1 3.000000 3 1 3 4 3.000000 4 2 1 2 2.666667 5 2 2 3 2.666667 6 2 3 3 2.666667 7 3 1 4 3.666667 8 3 2 4 3.666667 9 3 3 3 3.666667

The default summary statistics is `mean`

. However, we can assign a particular function to compute the summary statistics. For example, if we want to compute the sd of y by id, then we can have

within(my_data, {sd_y = ave(y,id,FUN=sd)} ) id time y sd_y 1 1 1 4 1.7320508 2 1 2 1 1.7320508 3 1 3 4 1.7320508 4 2 1 2 0.5773503 5 2 2 3 0.5773503 6 2 3 3 0.5773503 7 3 1 4 0.5773503 8 3 2 4 0.5773503 9 3 3 3 0.5773503

Remark: The `within`

evaluate an expression in an environment created from the data.frame. In addition, it will modify the data.frame and return it back(in our case, it create new variables, mean_y or sd_y )

Here is another usage of `ave`

. We would like to create a self excluded sample mean by group.

Suppose the data has three variables, id, time and y, we want to compute the mean of y by for each id but excluding the value of y of current time period.

id <- rep(1:3,each=3) t<-rep(1:3,3) y<-sample(1:5,9,replace=T) my_data<-data.frame(id=id,time=t,y=y)

Orignal data:

> my_data id time y 1 1 1 4 2 1 2 1 3 1 3 4 4 2 1 2 5 2 2 3 6 2 3 3 7 3 1 4 8 3 2 4 9 3 3 3

First, we need a function to compute the self excluded mean. This function takes a vector and a function(default is mean) as argument. It apply the function to the vector where one of the element is removed. The return value is a vector that i-th element is given by FUN(x[-i])

excludeSelfSummary<-function(x,FUN=mean){ sapply(1:length(x), function(i) FUN(x[-i])) } > excludeSelfSummary(1:5,mean) [1] 3.50 3.25 3.00 2.75 2.50 > excludeSelfSummary(1:5,min) [1] 2 1 1 1 1 > excludeSelfSummary(1:5,max) [1] 5 5 5 5 4

Then we pass the `excludeSelfSummary into ave as argument. `

```
```> within(my_data, {sd_y = ave(y,id,FUN=excludeSelfSummary)} )
id time y sd_y
1 1 1 4 2.5
2 1 2 1 4.0
3 1 3 4 2.5
4 2 1 2 3.0
5 2 2 3 2.5
6 2 3 3 2.5
7 3 1 4 3.5
8 3 2 4 3.5
9 3 3 3 4.0

Of course, we could compute the self excluded minimum or maximum.

> within(my_data, {sd_y = ave(y,id,FUN=function(x) excludeSelfSummary(x,min) )})
id time y sd_y
1 1 1 4 1
2 1 2 1 4
3 1 3 4 1
4 2 1 2 3
5 2 2 3 2
6 2 3 3 2
7 3 1 4 3
8 3 2 4 3
9 3 3 3 4

```
Categories: data cleaning
```

```
```## How to do egen (stata cmd) in R

February 12, 2013
2 comments
egen(stata cmd) compute a summary statistics by groups and store it in to a new variable. For example, the data has three variables, id, time and y, we want to compute the mean of y by for each id and then store it as a new variable mean_y.

In stata, the command would be

egen mean_y = mean(y), by(id)

In R, this task can be completed by `ave`

Generate dataset:

id <- rep(1:3,each=3)
t<-rep(1:3,3)
y<-sample(1:5,9,replace=T)
my_data<-data.frame(id=id,time=t,y=y)

Orignal data:

> my_data
id time y
1 1 1 4
2 1 2 1
3 1 3 4
4 2 1 2
5 2 2 3
6 2 3 3
7 3 1 4
8 3 2 4
9 3 3 3

> within(my_data, {mean_y = ave(y,id)} )
id time y mean_y
1 1 1 4 3.000000
2 1 2 1 3.000000
3 1 3 4 3.000000
4 2 1 2 2.666667
5 2 2 3 2.666667
6 2 3 3 2.666667
7 3 1 4 3.666667
8 3 2 4 3.666667
9 3 3 3 3.666667

The default summary statistics is `mean`

. However, we can assign a particular function to compute the summary statistics. For example, if we want to compute the sd of y by id, then we can have

within(my_data, {sd_y = ave(y,id,FUN=sd)} )
id time y sd_y
1 1 1 4 1.7320508
2 1 2 1 1.7320508
3 1 3 4 1.7320508
4 2 1 2 0.5773503
5 2 2 3 0.5773503
6 2 3 3 0.5773503
7 3 1 4 0.5773503
8 3 2 4 0.5773503
9 3 3 3 0.5773503

Remark: The `within`

evaluate an expression in an environment created from the data.frame. In addition, it will modify the data.frame and return it back(in our case, it create new variables, mean_y or sd_y )

Categories: data cleaning, stata
## Generating a lag/lead variables

March 11, 2012
10 comments
A few days ago, my friend asked me is there any function in R to generate lag/lead variables in a data.frame or did similar thing as _n in stata. He would like to use that to clean-up his dataset in R.

In stata help manual: _n contains the number of the current observation.

Here’s an example to illustrate what _n does:

set obs 10

generate x = _n

generate x_lag1 = x[_n-1]

generate x_lead1 = x[_n+1]

The data generated would be :

x = {1,2,3,4,5,6,7,8,9,10}

x_lag1 = {NA,1,2,3,4,5,6,7,8,9}

x_lead1 = {1,2,3,4,5,6,7,8,9,NA}

The key feature is the new vector has the same length as the original vector, so we can use it with the original vector or other generated vector.

One application is to create a MA series (just an example, it is better to use function in any time-series packages to do that)

generate x_ma_1 = (x[_n-1] + x[_n]) / 2

I googled a while for that, basically there’re two types of method to generate lag/lead variables in R:(reference)

1> Function that generate a shorter vector (e.g. embed(), running() in gtools

2> Function in ts, zoo, xts, dynlm,dlm.

However, both solutions do not solve his problem. Then I wrote a “shift” function to do the task:

shift<-function(x,shift_by){
stopifnot(is.numeric(shift_by))
stopifnot(is.numeric(x))
if (length(shift_by)>1)
return(sapply(shift_by,shift, x=x))
out<-NULL
abs_shift_by=abs(shift_by)
if (shift_by > 0 )
out<-c(tail(x,-abs_shift_by),rep(NA,abs_shift_by))
else if (shift_by < 0 )
out<-c(rep(NA,abs_shift_by), head(x,-abs_shift_by))
else
out<-x
out
}

# Example
d<-data.frame(x=1:15)
#generate lead variable
d$df_lead2<-shift(d$x,2)
#generate lag variable
d$df_lag2<-shift(d$x,-2)
> d
x df_lead2 df_lag2
1 1 3 NA
2 2 4 NA
3 3 5 1
4 4 6 2
5 5 7 3
6 6 8 4
7 7 9 5
8 8 10 6
9 9 NA 7
10 10 NA 8
# shift_by is vectorized
d$df_lead2 shift(d$x,-2:2)
[,1] [,2] [,3] [,4] [,5]
[1,] NA NA 1 2 3
[2,] NA 1 2 3 4
[3,] 1 2 3 4 5
[4,] 2 3 4 5 6
[5,] 3 4 5 6 7
[6,] 4 5 6 7 8
[7,] 5 6 7 8 9
[8,] 6 7 8 9 10
[9,] 7 8 9 10 NA
[10,] 8 9 10 NA NA

# Test
library(testthat)
expect_that(shift(1:10,2),is_identical_to(c(3:10,NA,NA)))
expect_that(shift(1:10,-2), is_identical_to(c(NA,NA,1:8)))
expect_that(shift(1:10,0), is_identical_to(1:10))
expect_that(shift(1:10,0), is_identical_to(1:10))
expect_that(shift(1:10,1:2), is_identical_to(cbind(c(2:10,NA),c(3:10,NA,NA))))

Notice that the result depends on how the data.frame is sorted.

Categories: Custom Function
## Overhead cost of a function call

October 2, 2011
3 comments
Recently, I would like to apply unit testing method to my R program. The first thing i need to chop every few lines of the code into functions so that I can test each of them.

A Question comes up to my mind: What is the overhead cost of a function call? To answer this question, i wrote the following :

library(rbenchmark)
library(compiler)
f<-function(x,y){
x+y
}
g<-function(x,y){
f(x,y)
}
cmpf<-cmpfun(f)
cmpg<-cmpfun(g)
benchmark(1+2,f(1,2),g(1,2),cmpf(1,2),cmpg(1,2),cmpg2(1,2), replications =1000000, columns = c("test", "replications", "elapsed", "relative"),order='relative')
test replications elapsed relative
1 1 + 2 1000000 4.00 1.000
4 cmpf(1, 2) 1000000 4.34 1.085
2 f(1, 2) 1000000 4.82 1.205
5 cmpg(1, 2) 1000000 5.44 1.360
3 g(1, 2) 1000000 5.68 1.420

The result suggests several things

- The overhead cost is about 0.82 second for 1,000,000 times function call.
- If we compile the function, the overhead cost is about 0.34 second for 1,000,000 times function call.

I don’t know whether it is a huge cost, but I believe the benefit of cleaner writing code with unit testing must worth more than that!

Categories: R programming
## Call by reference in R

September 11, 2011
6 comments
Sometimes it is convenient to use “call by reference evaluation” inside an R function. For example, if you want to have multiple return value for your function, then either you return a list of return value and split them afterward or you can return the value via the argument.

For some reasons(I would like to know too), R do not support call by reference. The first reason come up in my mind is safety, if the function can do call by reference, it is more difficult to trace the code and debug(you have to find out which function change the value of your variables by examining the details of your function). In fact, R do “call by reference” when the value of the argument is not changed. They will make a copy of the argument only when the value is changed. So we can expect there’s no efficiency gain (at least not a significant one) even we can do call by reference.

Anyway, it is always good to know how to have a “pseudo call by reference” in R (you can choose (not) to use it for whatever reason). The trick to implement call by reference is to make use of the eval.parent function in R. You can add a code to replace the argument value in the parent environment so that the function looks like implementing the call by reference evaluation strategy. Here are some examples of how to do it:

set<-function(x,value){
eval.parent(substitute(x<-value))
}
valX <- 51
set(valX ,10)
valX
>[1] 10

addOne_1<-function(x,value){
eval.parent(substitute(x<-x+1))
}
valX <- 51
addOne_1(valX)
valX
>[1] 52

Note that you could not change the value of x inside the function. If you change the value of x, a new object will be created. The substitute function will replace x with the new value and hence this method wont work. For example

addOne_2<-function(x,value){
x<-x+1
eval.parent(substitute(x<-x))
}
valX <- 51
addOne_2(valX)
>Error in 52 <- 52 : invalid (do_set) left-hand side to assignment

If you want to change the value of x inside the function, you have to copy x to a new object and use new object as x. At the end of the function, you can replace the value of x with the new object at the parent environment.

addOne_3<-function(x,value){
xx<-x
xx<-xx+1
eval.parent(substitute(x<-xx))
}
valX <- 51
addOne_3(valX)
valX
>[1] 52

Another way to do call by reference more formally is using the R.oo packages.

Another way to implement

Categories: R programming

```
```