Archive

Archive for February, 2013

A handy concatenation operator

February 12, 2013 6 comments

It may be useful for you to define a concatenation operator for characters. Sometimes, I find this is more intuitive and handy than using paste0 or paste. Also, it makes your code look better when you have nested paste, e.g.paste0("Y~",paste0("z",1:3, "*x",1:3,collapse="+"). The drawback is that it may reduce the readability of your code to other R user, since it is a self define function.(i guess it should be fine, cuz it is really intuitive. Also other scripting language also has similar concatenation operator)

"%+%" <- function(...){
paste0(...,sep="")
}
> "hello" %+% "world"
 "helloworld"
"hello" %+% "world" %+% 1:3
 "helloworld1" "helloworld2" "helloworld3"

Generating formula:

"Y~" %+% paste0("z",1:3, "*x",1:3,collapse="+")
 "Y~z1*x1+z2*x2+z3*x3"
Categories: Custom Function

Compute the self excluded sample mean by group

egen(stata cmd) compute a summary statistics by groups and store it in to a new variable. For example, the data has three variables, id, time and y, we want to compute the mean of y by for each id and then store it as a new variable mean_y.

In stata, the command would be

egen mean_y = mean(y), by(id)

In R, this task can be completed by ave

Generate dataset:

id <- rep(1:3,each=3)
t<-rep(1:3,3)
y<-sample(1:5,9,replace=T)
my_data<-data.frame(id=id,time=t,y=y)

Orignal data:

> my_data
id time y
1  1    1 4
2  1    2 1
3  1    3 4
4  2    1 2
5  2    2 3
6  2    3 3
7  3    1 4
8  3    2 4
9  3    3 3
> within(my_data, {mean_y = ave(y,id)} )
id time y   mean_y
1  1    1 4 3.000000
2  1    2 1 3.000000
3  1    3 4 3.000000
4  2    1 2 2.666667
5  2    2 3 2.666667
6  2    3 3 2.666667
7  3    1 4 3.666667
8  3    2 4 3.666667
9  3    3 3 3.666667

The default summary statistics is mean. However, we can assign a particular function to compute the summary statistics. For example, if we want to compute the sd of y by id, then we can have

within(my_data, {sd_y = ave(y,id,FUN=sd)} )
id time y      sd_y
1  1    1 4 1.7320508
2  1    2 1 1.7320508
3  1    3 4 1.7320508
4  2    1 2 0.5773503
5  2    2 3 0.5773503
6  2    3 3 0.5773503
7  3    1 4 0.5773503
8  3    2 4 0.5773503
9  3    3 3 0.5773503

Remark: The within evaluate an expression in an environment created from the data.frame. In addition, it will modify the data.frame and return it back(in our case, it create new variables, mean_y or sd_y )

Here is another usage of ave. We would like to create a self excluded sample mean by group.

Suppose the data has three variables, id, time and y, we want to compute the mean of y by for each id but excluding the value of y of current time period.

id <- rep(1:3,each=3)
t<-rep(1:3,3)
y<-sample(1:5,9,replace=T)
my_data<-data.frame(id=id,time=t,y=y)

Orignal data:

> my_data
id time y
1  1    1 4
2  1    2 1
3  1    3 4
4  2    1 2
5  2    2 3
6  2    3 3
7  3    1 4
8  3    2 4
9  3    3 3

First, we need a function to compute the self excluded mean. This function takes a vector and a function(default is mean) as argument. It apply the function to the vector where one of the element is removed. The return value is a vector that i-th element is given by FUN(x[-i])

excludeSelfSummary<-function(x,FUN=mean){
sapply(1:length(x), function(i) FUN(x[-i]))
}
> excludeSelfSummary(1:5,mean)
 3.50 3.25 3.00 2.75 2.50
> excludeSelfSummary(1:5,min)
 2 1 1 1 1
> excludeSelfSummary(1:5,max)
 5 5 5 5 4

Then we pass the excludeSelfSummary into ave as argument.

> within(my_data, {sd_y = ave(y,id,FUN=excludeSelfSummary)} )
id time y sd_y
1  1    1 4  2.5
2  1    2 1  4.0
3  1    3 4  2.5
4  2    1 2  3.0
5  2    2 3  2.5
6  2    3 3  2.5
7  3    1 4  3.5
8  3    2 4  3.5
9  3    3 3  4.0

Of course, we could compute the self excluded minimum or maximum.

> within(my_data, {sd_y = ave(y,id,FUN=function(x) excludeSelfSummary(x,min) )})
id time y sd_y
1  1    1 4    1
2  1    2 1    4
3  1    3 4    1
4  2    1 2    3
5  2    2 3    2
6  2    3 3    2
7  3    1 4    3
8  3    2 4    3
9  3    3 3    4
Categories: data cleaning

How to do egen (stata cmd) in R

February 12, 2013 2 comments

egen(stata cmd) compute a summary statistics by groups and store it in to a new variable. For example, the data has three variables, id, time and y, we want to compute the mean of y by for each id and then store it as a new variable mean_y.

In stata, the command would be

egen mean_y = mean(y), by(id)

In R, this task can be completed by ave

Generate dataset:

id <- rep(1:3,each=3)
t<-rep(1:3,3)
y<-sample(1:5,9,replace=T)
my_data<-data.frame(id=id,time=t,y=y)

Orignal data:

> my_data
id time y
1  1    1 4
2  1    2 1
3  1    3 4
4  2    1 2
5  2    2 3
6  2    3 3
7  3    1 4
8  3    2 4
9  3    3 3
> within(my_data, {mean_y = ave(y,id)} )
id time y   mean_y
1  1    1 4 3.000000
2  1    2 1 3.000000
3  1    3 4 3.000000
4  2    1 2 2.666667
5  2    2 3 2.666667
6  2    3 3 2.666667
7  3    1 4 3.666667
8  3    2 4 3.666667
9  3    3 3 3.666667

The default summary statistics is mean. However, we can assign a particular function to compute the summary statistics. For example, if we want to compute the sd of y by id, then we can have

within(my_data, {sd_y = ave(y,id,FUN=sd)} )
id time y      sd_y
1  1    1 4 1.7320508
2  1    2 1 1.7320508
3  1    3 4 1.7320508
4  2    1 2 0.5773503
5  2    2 3 0.5773503
6  2    3 3 0.5773503
7  3    1 4 0.5773503
8  3    2 4 0.5773503
9  3    3 3 0.5773503

Remark: The within evaluate an expression in an environment created from the data.frame. In addition, it will modify the data.frame and return it back(in our case, it create new variables, mean_y or sd_y )

Categories: data cleaning, stata