[R] dropping rows

Fri Dec 3 00:05:45 CET 2004

Douglas Bates <bates at stat.wisc.edu> wrote:
	In R this is called subsetting and the simplest way to do this
	is with the subset function.

	older <- subset(master, year < 1960)

I'm not sure that it's the "simplest".
Since rows for year < 1960 were to be dropped,
I'd say the _simplest_ way to do it is one which exploits
a primitive feature of R:

    master[master$year >= 1960,]

For me, the fact that the 'subset' argument of subset() is evaluated
in the scope of the data frame makes subset() quite a complicated way
to do things.  It's certainly something I'd hesitate to use inside a
function which might be given a data frame without knowing _exactly_
which column names were going to be in scope for the 2nd argument.
The fact that the 'subset' argument is *not* evaluated in the scope
of the 1st argument in other cases also makes subset() a somewhat
confusing function, compared with simple logical indexing.

Strengths of subset() include
 - you can select which columns you want, either instead of choosing
   a subset or at the same time (but you can do this with indexing too)
 - the drop= argument of indexing defaults to FALSE instead of TRUE
   (but this is not a problem for indexing data frames, where
    master[master$year == 1960,] will give you a data frame even if
    there is exactly one row with year 1960)

I would suggest that people who aren't yet thoroughly familiar with
what a simple "[" can do should add subset() to the list of things to
learn about _after_ they've done learning about "[".  On second thoughts,
maybe looking at the implementation of subset.default and subset.data.frame
would be helpful in learning about "[".