[R] Sweep out control

Thaler,Thorn,LAUSANNE,Applied Mathematics Thorn.Thaler at rdls.nestle.com
Mon Dec 10 16:29:23 CET 2012


Dear all,

Assume that I have the following data structure:

d <- expand.grid(subj=1:5, time=1:3, treatment=LETTERS[1:3])
d$value <- 10 ^ (as.numeric(d$treatment) + 1) + 10 * d$subj + d$time
d$value2 <- 100000 + d$value

where d$treatment == "C" stands for my control group. What I want to achieve now is to subtract the values corresponding to d$treatment == "C" from all values in order to get the difference between the treatments. If I do that by hand, it will look like:

va <- rep(d$value[d$treatment == "C"], 3) # don't need to rep because R would do the recycling for me anyways
d$value - va
va2 <- rep(d$value2[d$treatment == "C"], 3)
d$value2 - va2

This works because the data frame is sorted in the right way and all cases are present. Furthermore, it would be a bit elaborative if you want to that for more than a couple of columns and it is not very error prone nor scalable (what if somebody changes the order of the data frame before, or somebody assumes that the data frame is in a certain order afterwards? If I want to add some columns later, I have to  add new lines. What if some cases are missing?) Thus, this approach is clearly not a good one, especially since I don't like solutions which depend on a certain order.

So my questions:
1. Is there a ready made solution for that?
2. If not (what I assume), what would be an elegant way of solving this? Is the only way to sort the data? Not that I have any problem with sorting, but I would appreciate any solution which works w/o sorting, because I don't want to run into the risk of having issues downstream with people who assume a certain order in the data (which is of course anyways a no-go, but I assume that the time to find a solution w/o altering the order is shorter than the time it takes to educate these guys [not on the long run though, but this battle has to be fought later] ;)
3. This solution should be easily extendable to an arbitrary set of columns and should work with missing cases for the treatments like d <- d[-c(2, 21)]

Thanks for your input, I am looking forward to your suggestions.


Kind Regards,

Thorn Thaler
Mathematician

Applied Mathematics 
Nestec Ltd,
Nestlé Research Center
PO Box 44 
CH-1000 Lausanne 26
Phone: +41 21 785 8220
Fax: +41 21 785 9486



More information about the R-help mailing list