[R] dataframe subsetting behaviour
Douglas Grove
dgrove at fhcrc.org
Thu Jan 23 00:29:03 CET 2003
> Douglas Grove <dgrove at fhcrc.org> writes:
>
> > Hi,
> >
> > I'm trying to understand a behaviour that I have encountered
> > and can't fathom.
> >
> >
> > Here's some code I will use to illustrate the behaviour:
> >
> > # start with some data frame "a" having some named columns
> > a <- data.frame(a=rep(1,3),c=rep(2,3),d=rep(3,3),e=rep(4,3))
> >
> > # create a subset of the original data frame, but include a
> > # name "b" that is not present in my original data frame
> > b <- a[,c("a","b","c")]
> >
> >
> > ## Up until now no errors are issued, but the following commands
> > ## will give the error shown:
> >
> > b[1,] ## "Error in x[[j]] : subscript out of bounds"
> > b[1,2] ## "Error in "names<-.default"(*tmp*, value = cols) :
> > ## names attribute must be the same length as the vector"
> >
> >
> > Can anyone explain to me the meaning of these error messages in terms
> > of R is actually doing? These error messages had me baffled and
> > it took me hours to track down that the source of the error was an
> > incorrect column name in my data frame subsetting.
>
> Looks like a (semi-)bug. Indexing outside of the data frame creates a
> "column" which is really the single value NULL, e.g.
>
> > dput(a[,4:5])
> structure(list(e = c(4, 4, 4), "NA" = NULL), .Names = c("e",
> NA), row.names = c("1", "2", "3"), class = "data.frame")
>
> This will print because the format.data.frame called inside
> print.data.frame will recycle the NULL and give you
>
> > a[,4:5]
> e NA
> 1 4 NULL
> 2 4 NULL
> 3 4 NULL
>
> However, it confuses the h*ck out of "[.data.frame"
>
> > (a[,4:5])[2]
> Error in "[.data.frame"((a[, 4:5]), 2) : undefined columns selected
> > (a[,4:5])[,2]
> NULL
> > (a[,4:5])[,1]
> [1] 4 4 4
>
> and also the examples you found. However, the main issue is that you
> have managed to construct a corrupt data frame. So indexing outside
> the array should probably either give an error or return a column of
> NA.
Yes, it would be nice if trying to index outside the data frame generated
an error, that is what happens in Splus (at least the version I have
access to: 6.0 Release 1 for Linux 2.2.12)
>
> --
> O__ ---- Peter Dalgaard Blegdamsvej 3
> c/ /'_ --- Dept. of Biostatistics 2200 Cph. N
> (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
>
More information about the R-help
mailing list