[R] data management question

Fri May 9 09:01:31 CEST 2008

> I want to draw a subset of "ex" by selecting only the A and B units:
>
> > ex1 <- subset(ex[which(ex$id=="A"|ex$id=="B"),])

or a bit simpler:

ex1 <- subset(ex, ex$id %in% c('A','B'))

In your expresion you don't need the subset function, as you are already
using indexing to extract the desired subset. Furthermore, there is no
need to use which() because R will happily use a logical vector for
indexing. Finally, I prefer the solution using %in% because it scales
nicely for longer lists where using '|' becomes cumbersome. So another
way to put it would have been:

ex1 <- ex[ex$id %in% c('A','B'), ]

> > tapply(ex1$x, ex1$id, mean)
>   A    B    C
> 22.5 32.5   NA
>
> But this gives me an NA value for the unit C, which I thought I had  
> already left out.

id is a factor and the subset extraction does not alter the set of levels
of the factor even when no actual case of a level is left:

> str(ex1)
'data.frame':   4 obs. of  3 variables:
$ id  : Factor w/ 3 levels "A","B","C": 1 2 1 2
$ year: num  1970 1970 1980 1980
$ x   : num  20 30 25 35

If you want to get rid of the unused levels you can "re-build" the
factor like this:

> ex1$id <- factor(ex1$id)
> str(ex1)
'data.frame':   4 obs. of  3 variables:
 $ id  : Factor w/ 2 levels "A","B": 1 2 1 2
 $ year: num  1970 1970 1980 1980
 $ x   : num  20 30 25 35

> tapply(ex1$x, ex1$id, mean)
   A    B
22.5 32.5

cu
	Philipp

-- 
Dr. Philipp Pagel
Lehrstuhl für Genomorientierte Bioinformatik
Technische Universität München
Wissenschaftszentrum Weihenstephan
85350 Freising, Germany

 and

Institut für Bioinformatik und Systembiologie / MIPS
Helmholtz Zentrum München -
Deutsches Forschungszentrum für Gesundheit und Umwelt
Ingolstädter Landstrasse 1
85764 Neuherberg, Germany
http://mips.gsf.de/staff/pagel