Pairing function is an one to one and onto function that map two integers to a single integer. The definition as follows:
pair<function(x,y){ 0.5*(x+y)*(x+y+1) + x } unpair<function(z){ w= floor( (sqrt(8*z+1)  1)/2 ) t = w*(w+1)/2 cbind(zt,wz+t) } foreach (i = 0:4,.combine=rbind) %do% { x<0:i y<i:0 key<pair(x,y) unpair_key < unpair(key) cbind(x,y,key=key,unpair_key=unpair_key) } x y key x y [1,] 0 0 0 0 0 [2,] 0 1 1 0 1 [3,] 1 0 2 1 0 [4,] 0 2 3 0 2 [5,] 1 1 4 1 1 [6,] 2 0 5 2 0 [7,] 0 3 6 0 3 [8,] 1 2 7 1 2 [9,] 2 1 8 2 1 [10,] 3 0 9 3 0 [11,] 0 4 10 0 4 [12,] 1 3 11 1 3 [13,] 2 2 12 2 2 [14,] 3 1 13 3 1 [15,] 4 0 14 4 0
If ordering of x
and y
is not important, we can swap x
and y
if x>y
. However, the Pairing function is not one to one and we can not back out x
and y
with z
pair<cmpfun(function(x,y,ordering_matter=TRUE){ if (ordering_matter){ return(0.5*(x+y)*(x+y+1) + x) } else{ swap < x>y return(0.5*(x+y)*(x+y+1) + (x* !swap) + (y*swap )) } }) foreach (i = 0:4,.combine=rbind) %do% { x<0:i y<i:0 key<pair(x,y,ordering_matter=FALSE) unpair_key < unpair(key) cbind(x,y,key=key,unpair_key=unpair_key) } x y key x y [1,] 0 0 0 0 0 [2,] 0 1 1 0 1 [3,] 1 0 1 0 1 [4,] 0 2 3 0 2 [5,] 1 1 4 1 1 [6,] 2 0 3 0 2 [7,] 0 3 6 0 3 [8,] 1 2 7 1 2 [9,] 2 1 7 1 2 [10,] 3 0 6 0 3 [11,] 0 4 10 0 4 [12,] 1 3 11 1 3 [13,] 2 2 12 2 2 [14,] 3 1 11 1 3 [15,] 4 0 10 0 4 >
If we have more than two integers, we can apply the Pairing function in a nested manner.
nestedPair<function(x){ ncol_x = ncol(x) if(ncol_x==1){ return(x) } else if(ncol_x ==2) { return(pair(x[,1],x[,2])) } else if ( ncol_x > 2){ return(pair( x[,1] ,nestedPair(x[,2:ncol_x]) ) ) } } nestedUnpair<function(x,order){ if(order==1){ return(unpair(x)) } else if(order >1) { out < unpair(x) return(cbind(out[,1],nestedUnpair(out[,2],order1))) } } x<expand.grid(0:2,0:2,0:2) key < nestedPair(x) unpair_key < nestedUnpair(key,2) cbind(x=x,key=key,unpair_key=unpair_key) x.Var1 x.Var2 x.Var3 key unpair_key.1 unpair_key.2 unpair_key.3 1 0 0 0 0 0 0 0 2 1 0 0 2 1 0 0 3 2 0 0 5 2 0 0 4 0 1 0 3 0 1 0 5 1 1 0 7 1 1 0 6 2 1 0 12 2 1 0 7 0 2 0 15 0 2 0 8 1 2 0 22 1 2 0 9 2 2 0 30 2 2 0 10 0 0 1 1 0 0 1 11 1 0 1 4 1 0 1 12 2 0 1 8 2 0 1 13 0 1 1 10 0 1 1 14 1 1 1 16 1 1 1 15 2 1 1 23 2 1 1 16 0 2 1 36 0 2 1 17 1 2 1 46 1 2 1 18 2 2 1 57 2 2 1 19 0 0 2 6 0 0 2 20 1 0 2 11 1 0 2 21 2 0 2 17 2 0 2 22 0 1 2 28 0 1 2 23 1 1 2 37 1 1 2 24 2 1 2 47 2 1 2 25 0 2 2 78 0 2 2 26 1 2 2 92 1 2 2 27 2 2 2 107 2 2 2]]>
paste0
or paste
. Also, it makes your code look better when you have nested paste, e.g.paste0("Y~",paste0("z",1:3, "*x",1:3,collapse="+")
. The drawback is that it may reduce the readability of your code to other R user, since it is a self define function.(i guess it should be fine, cuz it is really intuitive. Also other scripting language also has similar concatenation operator)
"%+%" < function(...){ paste0(...,sep="") } > "hello" %+% "world" [1] "helloworld" "hello" %+% "world" %+% 1:3 [1] "helloworld1" "helloworld2" "helloworld3"
Generating formula:
"Y~" %+% paste0("z",1:3, "*x",1:3,collapse="+") [1] "Y~z1*x1+z2*x2+z3*x3"]]>
In stata, the command would be
egen mean_y = mean(y), by(id)
In R, this task can be completed by ave
Generate dataset:
id < rep(1:3,each=3) t<rep(1:3,3) y<sample(1:5,9,replace=T) my_data<data.frame(id=id,time=t,y=y)
Orignal data:
> my_data id time y 1 1 1 4 2 1 2 1 3 1 3 4 4 2 1 2 5 2 2 3 6 2 3 3 7 3 1 4 8 3 2 4 9 3 3 3
> within(my_data, {mean_y = ave(y,id)} ) id time y mean_y 1 1 1 4 3.000000 2 1 2 1 3.000000 3 1 3 4 3.000000 4 2 1 2 2.666667 5 2 2 3 2.666667 6 2 3 3 2.666667 7 3 1 4 3.666667 8 3 2 4 3.666667 9 3 3 3 3.666667
The default summary statistics is mean
. However, we can assign a particular function to compute the summary statistics. For example, if we want to compute the sd of y by id, then we can have
within(my_data, {sd_y = ave(y,id,FUN=sd)} ) id time y sd_y 1 1 1 4 1.7320508 2 1 2 1 1.7320508 3 1 3 4 1.7320508 4 2 1 2 0.5773503 5 2 2 3 0.5773503 6 2 3 3 0.5773503 7 3 1 4 0.5773503 8 3 2 4 0.5773503 9 3 3 3 0.5773503
Remark: The within
evaluate an expression in an environment created from the data.frame. In addition, it will modify the data.frame and return it back(in our case, it create new variables, mean_y or sd_y )
Here is another usage of ave
. We would like to create a self excluded sample mean by group.
Suppose the data has three variables, id, time and y, we want to compute the mean of y by for each id but excluding the value of y of current time period.
id < rep(1:3,each=3) t<rep(1:3,3) y<sample(1:5,9,replace=T) my_data<data.frame(id=id,time=t,y=y)
Orignal data:
> my_data id time y 1 1 1 4 2 1 2 1 3 1 3 4 4 2 1 2 5 2 2 3 6 2 3 3 7 3 1 4 8 3 2 4 9 3 3 3
First, we need a function to compute the self excluded mean. This function takes a vector and a function(default is mean) as argument. It apply the function to the vector where one of the element is removed. The return value is a vector that ith element is given by FUN(x[i])
excludeSelfSummary<function(x,FUN=mean){ sapply(1:length(x), function(i) FUN(x[i])) } > excludeSelfSummary(1:5,mean) [1] 3.50 3.25 3.00 2.75 2.50 > excludeSelfSummary(1:5,min) [1] 2 1 1 1 1 > excludeSelfSummary(1:5,max) [1] 5 5 5 5 4
Then we pass the excludeSelfSummary into ave as argument.
> within(my_data, {sd_y = ave(y,id,FUN=excludeSelfSummary)} )
id time y sd_y
1 1 1 4 2.5
2 1 2 1 4.0
3 1 3 4 2.5
4 2 1 2 3.0
5 2 2 3 2.5
6 2 3 3 2.5
7 3 1 4 3.5
8 3 2 4 3.5
9 3 3 3 4.0
Of course, we could compute the self excluded minimum or maximum.
> within(my_data, {sd_y = ave(y,id,FUN=function(x) excludeSelfSummary(x,min) )})
id time y sd_y
1 1 1 4 1
2 1 2 1 4
3 1 3 4 1
4 2 1 2 3
5 2 2 3 2
6 2 3 3 2
7 3 1 4 3
8 3 2 4 3
9 3 3 3 4
]]>
https://ctszkin.com/2013/02/12/computetheselfexcludedsamplemeanbygroup/feed/
0
kin233

How to do egen (stata cmd) in R
https://ctszkin.com/2013/02/12/howtodoegenstatacmdinr/
https://ctszkin.com/2013/02/12/howtodoegenstatacmdinr/#comments
Tue, 12 Feb 2013 06:36:11 +0000
http://ctszkin.com/?p=132
egen(stata cmd) compute a summary statistics by groups and store it in to a new variable. For example, the data has three variables, id, time and y, we want to compute the mean of y by for each id and then store it as a new variable mean_y.
In stata, the command would be
egen mean_y = mean(y), by(id)
In R, this task can be completed by ave
Generate dataset:
id < rep(1:3,each=3)
t<rep(1:3,3)
y<sample(1:5,9,replace=T)
my_data<data.frame(id=id,time=t,y=y)
Orignal data:
> my_data
id time y
1 1 1 4
2 1 2 1
3 1 3 4
4 2 1 2
5 2 2 3
6 2 3 3
7 3 1 4
8 3 2 4
9 3 3 3
> within(my_data, {mean_y = ave(y,id)} )
id time y mean_y
1 1 1 4 3.000000
2 1 2 1 3.000000
3 1 3 4 3.000000
4 2 1 2 2.666667
5 2 2 3 2.666667
6 2 3 3 2.666667
7 3 1 4 3.666667
8 3 2 4 3.666667
9 3 3 3 3.666667
The default summary statistics is mean
. However, we can assign a particular function to compute the summary statistics. For example, if we want to compute the sd of y by id, then we can have
within(my_data, {sd_y = ave(y,id,FUN=sd)} )
id time y sd_y
1 1 1 4 1.7320508
2 1 2 1 1.7320508
3 1 3 4 1.7320508
4 2 1 2 0.5773503
5 2 2 3 0.5773503
6 2 3 3 0.5773503
7 3 1 4 0.5773503
8 3 2 4 0.5773503
9 3 3 3 0.5773503
Remark: The within
evaluate an expression in an environment created from the data.frame. In addition, it will modify the data.frame and return it back(in our case, it create new variables, mean_y or sd_y )
]]>
https://ctszkin.com/2013/02/12/howtodoegenstatacmdinr/feed/
2
kin233

Generating a lag/lead variables
https://ctszkin.com/2012/03/11/generatingalagleadvariables/
https://ctszkin.com/2012/03/11/generatingalagleadvariables/#comments
Mon, 12 Mar 2012 03:06:52 +0000
http://ctszkin.com/?p=84
A few days ago, my friend asked me is there any function in R to generate lag/lead variables in a data.frame or did similar thing as _n in stata. He would like to use that to cleanup his dataset in R.
In stata help manual: _n contains the number of the current observation.
Here’s an example to illustrate what _n does:
set obs 10
generate x = _n
generate x_lag1 = x[_n1]
generate x_lead1 = x[_n+1]
The data generated would be :
x = {1,2,3,4,5,6,7,8,9,10}
x_lag1 = {NA,1,2,3,4,5,6,7,8,9}
x_lead1 = {1,2,3,4,5,6,7,8,9,NA}
The key feature is the new vector has the same length as the original vector, so we can use it with the original vector or other generated vector.
One application is to create a MA series (just an example, it is better to use function in any timeseries packages to do that)
generate x_ma_1 = (x[_n1] + x[_n]) / 2
I googled a while for that, basically there’re two types of method to generate lag/lead variables in R:(reference)
1> Function that generate a shorter vector (e.g. embed(), running() in gtools
2> Function in ts, zoo, xts, dynlm,dlm.
However, both solutions do not solve his problem. Then I wrote a “shift” function to do the task:
shift<function(x,shift_by){
stopifnot(is.numeric(shift_by))
stopifnot(is.numeric(x))
if (length(shift_by)>1)
return(sapply(shift_by,shift, x=x))
out<NULL
abs_shift_by=abs(shift_by)
if (shift_by > 0 )
out<c(tail(x,abs_shift_by),rep(NA,abs_shift_by))
else if (shift_by < 0 )
out<c(rep(NA,abs_shift_by), head(x,abs_shift_by))
else
out<x
out
}
# Example
d<data.frame(x=1:15)
#generate lead variable
d$df_lead2<shift(d$x,2)
#generate lag variable
d$df_lag2<shift(d$x,2)
> d
x df_lead2 df_lag2
1 1 3 NA
2 2 4 NA
3 3 5 1
4 4 6 2
5 5 7 3
6 6 8 4
7 7 9 5
8 8 10 6
9 9 NA 7
10 10 NA 8
# shift_by is vectorized
d$df_lead2 shift(d$x,2:2)
[,1] [,2] [,3] [,4] [,5]
[1,] NA NA 1 2 3
[2,] NA 1 2 3 4
[3,] 1 2 3 4 5
[4,] 2 3 4 5 6
[5,] 3 4 5 6 7
[6,] 4 5 6 7 8
[7,] 5 6 7 8 9
[8,] 6 7 8 9 10
[9,] 7 8 9 10 NA
[10,] 8 9 10 NA NA
# Test
library(testthat)
expect_that(shift(1:10,2),is_identical_to(c(3:10,NA,NA)))
expect_that(shift(1:10,2), is_identical_to(c(NA,NA,1:8)))
expect_that(shift(1:10,0), is_identical_to(1:10))
expect_that(shift(1:10,0), is_identical_to(1:10))
expect_that(shift(1:10,1:2), is_identical_to(cbind(c(2:10,NA),c(3:10,NA,NA))))
Notice that the result depends on how the data.frame is sorted.
]]>
https://ctszkin.com/2012/03/11/generatingalagleadvariables/feed/
10
kin233

Overhead cost of a function call
https://ctszkin.com/2011/10/02/overheadcostofafunctioncall/
https://ctszkin.com/2011/10/02/overheadcostofafunctioncall/#comments
Sun, 02 Oct 2011 04:25:54 +0000
http://rlearner.wordpress.com/?p=53
Recently, I would like to apply unit testing method to my R program. The first thing i need to chop every few lines of the code into functions so that I can test each of them.
A Question comes up to my mind: What is the overhead cost of a function call? To answer this question, i wrote the following :
library(rbenchmark)
library(compiler)
f<function(x,y){
x+y
}
g<function(x,y){
f(x,y)
}
cmpf<cmpfun(f)
cmpg<cmpfun(g)
benchmark(1+2,f(1,2),g(1,2),cmpf(1,2),cmpg(1,2),cmpg2(1,2), replications =1000000, columns = c("test", "replications", "elapsed", "relative"),order='relative')
test replications elapsed relative
1 1 + 2 1000000 4.00 1.000
4 cmpf(1, 2) 1000000 4.34 1.085
2 f(1, 2) 1000000 4.82 1.205
5 cmpg(1, 2) 1000000 5.44 1.360
3 g(1, 2) 1000000 5.68 1.420
The result suggests several things
 The overhead cost is about 0.82 second for 1,000,000 times function call.
 If we compile the function, the overhead cost is about 0.34 second for 1,000,000 times function call.
I don’t know whether it is a huge cost, but I believe the benefit of cleaner writing code with unit testing must worth more than that!
]]>
https://ctszkin.com/2011/10/02/overheadcostofafunctioncall/feed/
3
kin233

Call by reference in R
https://ctszkin.com/2011/09/11/callbyreferenceinr/
https://ctszkin.com/2011/09/11/callbyreferenceinr/#comments
Mon, 12 Sep 2011 01:27:26 +0000
http://rlearner.wordpress.com/?p=38
Sometimes it is convenient to use “call by reference evaluation” inside an R function. For example, if you want to have multiple return value for your function, then either you return a list of return value and split them afterward or you can return the value via the argument.
For some reasons(I would like to know too), R do not support call by reference. The first reason come up in my mind is safety, if the function can do call by reference, it is more difficult to trace the code and debug(you have to find out which function change the value of your variables by examining the details of your function). In fact, R do “call by reference” when the value of the argument is not changed. They will make a copy of the argument only when the value is changed. So we can expect there’s no efficiency gain (at least not a significant one) even we can do call by reference.
Anyway, it is always good to know how to have a “pseudo call by reference” in R (you can choose (not) to use it for whatever reason). The trick to implement call by reference is to make use of the eval.parent function in R. You can add a code to replace the argument value in the parent environment so that the function looks like implementing the call by reference evaluation strategy. Here are some examples of how to do it:
set<function(x,value){
eval.parent(substitute(x<value))
}
valX < 51
set(valX ,10)
valX
>[1] 10
addOne_1<function(x,value){
eval.parent(substitute(x<x+1))
}
valX < 51
addOne_1(valX)
valX
>[1] 52
Note that you could not change the value of x inside the function. If you change the value of x, a new object will be created. The substitute function will replace x with the new value and hence this method wont work. For example
addOne_2<function(x,value){
x<x+1
eval.parent(substitute(x<x))
}
valX < 51
addOne_2(valX)
>Error in 52 < 52 : invalid (do_set) lefthand side to assignment
If you want to change the value of x inside the function, you have to copy x to a new object and use new object as x. At the end of the function, you can replace the value of x with the new object at the parent environment.
addOne_3<function(x,value){
xx<x
xx<xx+1
eval.parent(substitute(x<xx))
}
valX < 51
addOne_3(valX)
valX
>[1] 52
Another way to do call by reference more formally is using the R.oo packages.
Another way to implement
]]>
https://ctszkin.com/2011/09/11/callbyreferenceinr/feed/
6
kin233

A shortcut function for install.packages() and library()
https://ctszkin.com/2011/09/11/ashortcutfunctionforinstallpackagesandlibrary/
https://ctszkin.com/2011/09/11/ashortcutfunctionforinstallpackagesandlibrary/#comments
Sun, 11 Sep 2011 04:30:55 +0000
http://rlearner.wordpress.com/?p=28
I enjoy trying difference kind of R packages. Since I have more than 1 computers (1 at home, 1 at office and a laptop)
it is troublesome to check whether I have installed some new packages for each computer. Therefore i wrote a function to load and install packages at once. If the package does not exist, then the it will be downloaded from CRAN and be loaded it.
packages<function(x, repos="http://cran.rproject.org", ...){
x < deparse(substitute(x))
if (!require(x,character.only=TRUE)){
install.packages(pkgs=x, repos=repos, ...)
require(x,character.only=TRUE)
}
}
packages(Hmisc)
Thanks richierocks for the suggestion of using deparse(substitute(x)) in the code.
richierocks’s
]]>
https://ctszkin.com/2011/09/11/ashortcutfunctionforinstallpackagesandlibrary/feed/
6
kin233

A quick way to do row repeat and col repeat (rep.row, rep.col)
https://ctszkin.com/2011/09/02/aquickwaytodoreprowandrepcol/
https://ctszkin.com/2011/09/02/aquickwaytodoreprowandrepcol/#comments
Fri, 02 Sep 2011 23:47:18 +0000
http://rlearner.wordpress.com/?p=7
Today I worked on a simulation program which require me to create a matrix by repeating the vector n times (both by row and by col).
Even the task is extremely simple and only take 1 line to finish(10sec), I have to think about should the argument in rep be each or times and should the argument in matrix is nrow or ncol. It distracted me from the original task i am working on.
Just now, I wrote a function rep.row and rep.col to do what I really want to do. Next time, i don’t have to worry about how to use the matrix and rep command to repeat an vector to form a matrix!
Code
rep.row<function(x,n){
matrix(rep(x,each=n),nrow=n)
}
rep.col<function(x,n){
matrix(rep(x,each=n), ncol=n, byrow=TRUE)
}
x is the vector to be repeated and n is the number of replication. Example:
> rep.row(1:3,5)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 3
[3,] 1 2 3
[4,] 1 2 3
[5,] 1 2 3
> rep.col(1:3,5)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 2 2 3
[2,] 1 1 2 3 3
[3,] 1 2 2 3 3
I am sure it should appear in some packages, but it would be faster for me to write it out than find it out!
]]>
https://ctszkin.com/2011/09/02/aquickwaytodoreprowandrepcol/feed/
2
kin233