[R] Corrupt data frame construction - bug?
Steven McKinney
smckinney at bccrc.ca
Thu Apr 30 03:21:03 CEST 2009
Thanks Duncan,
Comments and a proposed bug fix in-line below:
> -----Original Message-----
> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
> Sent: Wednesday, April 29, 2009 5:10 PM
> To: Steven McKinney
> Cc: R-help at r-project.org
> Subject: Re: [R] Corrupt data frame construction - bug?
>
> On 29/04/2009 6:41 PM, Steven McKinney wrote:
> > Hi useRs,
> >
> > A recent coding infelicity along these lines yielded a corrupt data
> > frame.
> >
> > foo <- matrix(1:12, nrow = 3)
> > bar <- data.frame(foo)
> > bar$NewCol <- foo[foo[, 1] == 4, 4]
> > bar
> > lapply(bar, length)
> >
> >
> >
> >
> >> foo <- matrix(1:12, nrow = 3)
> >> bar <- data.frame(foo)
> >> bar$NewCol <- foo[foo[, 1] == 4, 4]
> >> bar
> > X1 X2 X3 X4 NewCol
> > 1 1 4 7 10 <NA>
> > 2 2 5 8 11 <NA>
> > 3 3 6 9 12 <NA>
> > Warning message:
> > In format.data.frame(x, digits = digits, na.encode = FALSE) :
> > corrupt data frame: columns will be truncated or padded with NAs
> >> lapply(bar, length)
> > $X1
> > [1] 3
> >
> > $X2
> > [1] 3
> >
> > $X3
> > [1] 3
> >
> > $X4
> > [1] 3
> >
> > $NewCol
> > [1] 0
> >
> >
> > Is this a bug in the data.frame machinery?
> > If an attempt is made to add a new column to a data frame, and the
> new
> > object does not have length = number of rows of data frame, or
cannot
> > be made to have such length via recycling, shouldn't an error be
> > thrown?
> >
> > Instead in this example I end up with a "corrupt data frame" having
> > one zero-length column.
> >
> >
> > Should this be reported as a bug, or did I misinterpret the
> > documentation?
>
> I don't think "$" uses any data.frame machinery. You are working at a
> lower level.
>
> If you had added the new column using
>
> bar <- data.frame(bar, NewCol=foo[foo[, 1] == 4, 4])
>
> you would have seen the error:
>
> Error in data.frame(bar, NewCol = foo[foo[, 1] == 4, 4]) :
> arguments imply differing number of rows: 3, 0
>
> But since you treated it as a list, it let you go ahead and create
> something that was labelled as a data.frame but wasn't. This is one
of
> the reasons some people prefer S4 methods: it's easier to protect
> against people who mislabel things.
>
I did some more digging on '$' - there is a data.frame method for it:
> getAnywhere("$<-.data.frame" )
A single object matching '$<-.data.frame' was found
It was found in the following places
package:base
registered S3 method for $<- from namespace base
namespace:base
with value
function (x, i, value)
{
cl <- oldClass(x)
class(x) <- NULL
nrows <- .row_names_info(x, 2L)
if (!is.null(value)) {
N <- NROW(value)
if (N > nrows)
stop(gettextf("replacement has %d rows, data has %d",
N, nrows), domain = NA)
if (N < nrows && N > 0L)
if (nrows%%N == 0L && length(dim(value)) <= 1L)
value <- rep(value, length.out = nrows)
else stop(gettextf("replacement has %d rows, data has %d",
N, nrows), domain = NA)
if (is.atomic(value))
names(value) <- NULL
}
x[[i]] <- value
class(x) <- cl
return(x)
}<environment: namespace:base>
>
I placed a browser() command before return(x) and did some poking
around.
It seems to me there's a bug in this function. It should be able to
detect the problem I threw at it, and throw an error as you point out is
thrown by the other data.frame assign method.
I modified the rows
if (N < nrows && N > 0L)
if (nrows%%N == 0L && length(dim(value)) <= 1L)
to read
if (N < nrows)
if (N > 0L && nrows%%N == 0L && length(dim(value)) <= 1L)
as in
"$<-.data.frame" <-
function (x, i, value)
{
cl <- oldClass(x)
class(x) <- NULL
nrows <- .row_names_info(x, 2L)
if (!is.null(value)) {
N <- NROW(value)
if (N > nrows)
stop(gettextf("replacement has %d rows, data has %d",
N, nrows), domain = NA)
if (N < nrows)
if (N > 0L && nrows%%N == 0L && length(dim(value)) <= 1L)
value <- rep(value, length.out = nrows)
else stop(gettextf("replacement has %d rows, data has %d",
N, nrows), domain = NA)
if (is.atomic(value))
names(value) <- NULL
}
x[[i]] <- value
class(x) <- cl
return(x)
}
Now it detects the problem I created, in the fashion you demonstrated
above for the replacement using data.frame().
> foo <- matrix(1:12, nrow = 3)
> bar <- data.frame(foo)
> bar$NewCol <- foo[foo[, 1] == 4, 4]
Error in `$<-.data.frame`(`*tmp*`, "NewCol", value = integer(0)) :
replacement has 0 rows, data has 3
It doesn't appear to stumble on weird data frames (these from the
?data.frame help page)
> L3 <- LETTERS[1:3]
> (d <- data.frame(cbind(x=1, y=1:10), fac=sample(L3, 10,
replace=TRUE)))
> (d0 <- d[, FALSE]) # NULL data frame with 10 rows
> (d.0 <- d[FALSE, ]) # <0 rows> data frame (3 cols)
> (d00 <- d0[FALSE,]) # NULL data frame with 0 rows
> d0$NewCol <- foo[foo[, 1] == 4, 4]
Error in `$<-.data.frame`(`*tmp*`, "NewCol", value = integer(0)) :
replacement has 0 rows, data has 10
### Catches this problem above alright.
> d.0$NewCol <- foo[foo[, 1] == 4, 4]
> d.0
[1] x y fac NewCol
<0 rows> (or 0-length row.names)
### Lets the above one through alright.
> d00$NewCol <- foo[foo[, 1] == 4, 4]
>
> d00
[1] NewCol
<0 rows> (or 0-length row.names)
### Lets the above one through alright.
Would the above modification work to fix this problem?
> Duncan Murdoch
>
> >
> >
> >
> >
> >> sessionInfo()
> > R version 2.9.0 (2009-04-17)
> > powerpc-apple-darwin8.11.1
> >
> > locale:
> > en_CA.UTF-8/en_CA.UTF-8/C/C/en_CA.UTF-8/en_CA.UTF-8
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
> >
> > other attached packages:
> > [1] nlme_3.1-90
> >
> > loaded via a namespace (and not attached):
> > [1] grid_2.9.0 lattice_0.17-22 tools_2.9.0
> >
> >
> > Also occurs on Windows box with R 2.8.1
> >
> >
> >
> > Steven McKinney
> >
> > Statistician
> > Molecular Oncology and Breast Cancer Program British Columbia Cancer
> > Research Centre
> >
> > email: smckinney +at+ bccrc +dot+ ca
> >
> > tel: 604-675-8000 x7561
> >
> > BCCRC
> > Molecular Oncology
> > 675 West 10th Ave, Floor 4
> > Vancouver B.C.
> > V5Z 1L3
> > Canada
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list