# Compute the self excluded sample mean by group

Compute the self excluded sample mean by group: egen(stata cmd) compute a summary statistics by groups and store it in to a new variable. For example, the data has three variables, id, time and y, we want to compute the mean of y by for each id and then store it as a new variable mean_y.

In stata, the command would be

egen mean_y = mean(y), by(id)

In R, this task can be completed by `ave`

Generate dataset:

 1 2 3 4 `id <- ``rep``(1:3,each=3)` `t<-``rep``(1:3,3)` `y<-``sample``(1:5,9,replace=T)` `my_data<-``data.frame``(id=id,time=t,y=y)`

Orignal data:

 1 2 3 4 5 6 7 8 9 10 11 `> my_data` `  ``id time y` `1  1    1 4` `2  1    2 1` `3  1    3 4` `4  2    1 2` `5  2    2 3` `6  2    3 3` `7  3    1 4` `8  3    2 4` `9  3    3 3`
 1 2 3 4 5 6 7 8 9 10 11 `> ``within``(my_data, {mean_y = ``ave``(y,id)} )` `  ``id time y   mean_y` `1  1    1 4 3.000000` `2  1    2 1 3.000000` `3  1    3 4 3.000000` `4  2    1 2 2.666667` `5  2    2 3 2.666667` `6  2    3 3 2.666667` `7  3    1 4 3.666667` `8  3    2 4 3.666667` `9  3    3 3 3.666667`

The default summary statistics is `mean`. However, we can assign a particular function to compute the summary statistics. For example, if we want to compute the sd of y by id, then we can have

 1 2 3 4 5 6 7 8 9 10 11 `within``(my_data, {sd_y = ``ave``(y,id,FUN=sd)} )` `  ``id time y      sd_y` `1  1    1 4 1.7320508` `2  1    2 1 1.7320508` `3  1    3 4 1.7320508` `4  2    1 2 0.5773503` `5  2    2 3 0.5773503` `6  2    3 3 0.5773503` `7  3    1 4 0.5773503` `8  3    2 4 0.5773503` `9  3    3 3 0.5773503`

Remark: The `within` evaluate an expression in an environment created from the data.frame. In addition, it will modify the data.frame and return it back(in our case, it create new variables, mean_y or sd_y )

Here is another usage of `ave`. We would like to create a self excluded sample mean by group.

Suppose the data has three variables, id, time and y, we want to compute the mean of y by for each id but excluding the value of y of current time period.

 1 2 3 4 `id <- ``rep``(1:3,each=3)` `t<-``rep``(1:3,3)` `y<-``sample``(1:5,9,replace=T)` `my_data<-``data.frame``(id=id,time=t,y=y)`

Orignal data:

 1 2 3 4 5 6 7 8 9 10 11 `> my_data` `  ``id time y` `1  1    1 4` `2  1    2 1` `3  1    3 4` `4  2    1 2` `5  2    2 3` `6  2    3 3` `7  3    1 4` `8  3    2 4` `9  3    3 3`

First, we need a function to compute the self excluded mean. This function takes a vector and a function(default is mean) as argument. It apply the function to the vector where one of the element is removed. The return value is a vector that i-th element is given by FUN(x[-i])

 1 2 3 4 5 6 7 8 9 `excludeSelfSummary<-``function``(x,FUN=mean){` `    ``sapply``(1:``length``(x), ``function``(i) ``FUN``(x[-i]))` `}` `> ``excludeSelfSummary``(1:5,mean)` `[1] 3.50 3.25 3.00 2.75 2.50` `> ``excludeSelfSummary``(1:5,min)` `[1] 2 1 1 1 1` `> ``excludeSelfSummary``(1:5,max)` `[1] 5 5 5 5 4`

Then we pass the `excludeSelfSummary into ave as argument.`

 1 2 3 4 5 6 7 8 9 10 11 `> ``within``(my_data, {sd_y = ``ave``(y,id,FUN=excludeSelfSummary)} )` `  ``id time y sd_y` `1  1    1 4  2.5` `2  1    2 1  4.0` `3  1    3 4  2.5` `4  2    1 2  3.0` `5  2    2 3  2.5` `6  2    3 3  2.5` `7  3    1 4  3.5` `8  3    2 4  3.5` `9  3    3 3  4.0`

Of course, we could compute the self excluded minimum or maximum.

 1 2 3 4 5 6 7 8 9 10 11 `> ``within``(my_data, {sd_y = ``ave``(y,id,FUN=``function``(x) ``excludeSelfSummary``(x,min) )})` `  ``id time y sd_y` `1  1    1 4    1` `2  1    2 1    4` `3  1    3 4    1` `4  2    1 2    3` `5  2    2 3    2` `6  2    3 3    2` `7  3    1 4    3` `8  3    2 4    3` `9  3    3 3    4`

TszKin Julian Chan