[R] Corrupt data frame construction - bug?

Steven McKinney smckinney at bccrc.ca
Thu Apr 30 03:21:03 CEST 2009


Thanks Duncan,

Comments and a proposed bug fix in-line below:
 

> -----Original Message-----
> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
> Sent: Wednesday, April 29, 2009 5:10 PM
> To: Steven McKinney
> Cc: R-help at r-project.org
> Subject: Re: [R] Corrupt data frame construction - bug?
> 
> On 29/04/2009 6:41 PM, Steven McKinney wrote:
> > Hi useRs,
> >
> > A recent coding infelicity along these lines yielded a corrupt data
> > frame.
> >
> > foo <- matrix(1:12, nrow = 3)
> > bar <- data.frame(foo)
> > bar$NewCol <- foo[foo[, 1] == 4, 4]
> > bar
> > lapply(bar, length)
> >
> >
> >
> >
> >> foo <- matrix(1:12, nrow = 3)
> >> bar <- data.frame(foo)
> >> bar$NewCol <- foo[foo[, 1] == 4, 4]
> >> bar
> >   X1 X2 X3 X4 NewCol
> > 1  1  4  7 10   <NA>
> > 2  2  5  8 11   <NA>
> > 3  3  6  9 12   <NA>
> > Warning message:
> > In format.data.frame(x, digits = digits, na.encode = FALSE) :
> >   corrupt data frame: columns will be truncated or padded with NAs
> >> lapply(bar, length)
> > $X1
> > [1] 3
> >
> > $X2
> > [1] 3
> >
> > $X3
> > [1] 3
> >
> > $X4
> > [1] 3
> >
> > $NewCol
> > [1] 0
> >
> >
> > Is this a bug in the data.frame machinery?
> > If an attempt is made to add a new column to a data frame, and the
> new
> > object does not have length = number of rows of data frame, or
cannot
> > be made to have such length via recycling, shouldn't an error be
> > thrown?
> >
> > Instead in this example I end up with a "corrupt data frame" having
> > one zero-length column.
> >
> >
> > Should this be reported as a bug, or did I misinterpret the
> > documentation?
> 
> I don't think "$" uses any data.frame machinery.  You are working at a
> lower level.
> 
> If you had added the new column using
> 
> bar <- data.frame(bar, NewCol=foo[foo[, 1] == 4, 4])
> 
> you would have seen the error:
> 
> Error in data.frame(bar, NewCol = foo[foo[, 1] == 4, 4]) :
>    arguments imply differing number of rows: 3, 0
> 
> But since you treated it as a list, it let you go ahead and create
> something that was labelled as a data.frame but wasn't.  This is one
of
> the reasons some people prefer S4 methods:  it's easier to protect
> against people who mislabel things.
> 

I did some more digging on '$' - there is a data.frame method for it:

> getAnywhere("$<-.data.frame" )
A single object matching '$<-.data.frame' was found
It was found in the following places
  package:base
  registered S3 method for $<- from namespace base
  namespace:base
with value

function (x, i, value) 
{
    cl <- oldClass(x)
    class(x) <- NULL
    nrows <- .row_names_info(x, 2L)
    if (!is.null(value)) {
        N <- NROW(value)
        if (N > nrows) 
            stop(gettextf("replacement has %d rows, data has %d", 
                N, nrows), domain = NA)
        if (N < nrows && N > 0L) 
            if (nrows%%N == 0L && length(dim(value)) <= 1L) 
                value <- rep(value, length.out = nrows)
            else stop(gettextf("replacement has %d rows, data has %d", 
                N, nrows), domain = NA)
        if (is.atomic(value)) 
            names(value) <- NULL
    }
    x[[i]] <- value
    class(x) <- cl
    return(x)
}<environment: namespace:base>
>


I placed a browser() command before return(x) and did some poking
around.

It seems to me there's a bug in this function.  It should be able to
detect the problem I threw at it, and throw an error as you point out is
thrown by the other data.frame assign method.


I modified the rows
          if (N < nrows && N > 0L) 
            if (nrows%%N == 0L && length(dim(value)) <= 1L)
to read
           if (N < nrows) 
            if (N > 0L && nrows%%N == 0L && length(dim(value)) <= 1L)

as in

"$<-.data.frame" <-
function (x, i, value) 
{
    cl <- oldClass(x)
    class(x) <- NULL
    nrows <- .row_names_info(x, 2L)
    if (!is.null(value)) {
        N <- NROW(value)
        if (N > nrows) 
            stop(gettextf("replacement has %d rows, data has %d", 
                N, nrows), domain = NA)
        if (N < nrows) 
            if (N > 0L && nrows%%N == 0L && length(dim(value)) <= 1L) 
                value <- rep(value, length.out = nrows)
            else stop(gettextf("replacement has %d rows, data has %d", 
                N, nrows), domain = NA)
        if (is.atomic(value)) 
            names(value) <- NULL
    }
    x[[i]] <- value
    class(x) <- cl
    return(x)
} 

Now it detects the problem I created, in the fashion you demonstrated
above for the replacement using data.frame().

> foo <- matrix(1:12, nrow = 3)
> bar <- data.frame(foo)
> bar$NewCol <- foo[foo[, 1] == 4, 4]
Error in `$<-.data.frame`(`*tmp*`, "NewCol", value = integer(0)) : 
  replacement has 0 rows, data has 3

It doesn't appear to stumble on weird data frames (these from the
?data.frame help page)


> L3 <- LETTERS[1:3]
> (d <- data.frame(cbind(x=1, y=1:10), fac=sample(L3, 10,
replace=TRUE)))
> (d0  <- d[, FALSE]) # NULL data frame with 10 rows
 
> (d.0 <- d[FALSE, ]) # <0 rows> data frame  (3 cols)

> (d00 <- d0[FALSE,])  # NULL data frame with 0 rows
 
> d0$NewCol <- foo[foo[, 1] == 4, 4]
Error in `$<-.data.frame`(`*tmp*`, "NewCol", value = integer(0)) : 
  replacement has 0 rows, data has 10

### Catches this problem above alright.

> d.0$NewCol <- foo[foo[, 1] == 4, 4]
> d.0
[1] x      y      fac    NewCol
<0 rows> (or 0-length row.names)

### Lets the above one through alright.

> d00$NewCol <- foo[foo[, 1] == 4, 4]
> 
> d00
[1] NewCol
<0 rows> (or 0-length row.names)
### Lets the above one through alright.


Would the above modification work to fix this problem?





> Duncan Murdoch
> 
> >
> >
> >
> >
> >> sessionInfo()
> > R version 2.9.0 (2009-04-17)
> > powerpc-apple-darwin8.11.1
> >
> > locale:
> > en_CA.UTF-8/en_CA.UTF-8/C/C/en_CA.UTF-8/en_CA.UTF-8
> >
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base
> >
> > other attached packages:
> > [1] nlme_3.1-90
> >
> > loaded via a namespace (and not attached):
> > [1] grid_2.9.0      lattice_0.17-22 tools_2.9.0
> >
> >
> > Also occurs on Windows box with R 2.8.1
> >
> >
> >
> > Steven McKinney
> >
> > Statistician
> > Molecular Oncology and Breast Cancer Program British Columbia Cancer
> > Research Centre
> >
> > email: smckinney +at+ bccrc +dot+ ca
> >
> > tel: 604-675-8000 x7561
> >
> > BCCRC
> > Molecular Oncology
> > 675 West 10th Ave, Floor 4
> > Vancouver B.C.
> > V5Z 1L3
> > Canada
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list