[R] Corrupt data frame construction - bug?
Duncan Murdoch
murdoch at stats.uwo.ca
Thu Apr 30 09:53:26 CEST 2009
On 29/04/2009 9:21 PM, Steven McKinney wrote:
> Thanks Duncan,
>
> Comments and a proposed bug fix in-line below:
Thanks; sorry for the misinformation about the $ method.
I'm not going to have time today to look at the patch, but will check it
out tomorrow, unless someone else gets there first.
Duncan Murdoch
>
>
>> -----Original Message-----
>> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
>> Sent: Wednesday, April 29, 2009 5:10 PM
>> To: Steven McKinney
>> Cc: R-help at r-project.org
>> Subject: Re: [R] Corrupt data frame construction - bug?
>>
>> On 29/04/2009 6:41 PM, Steven McKinney wrote:
>>> Hi useRs,
>>>
>>> A recent coding infelicity along these lines yielded a corrupt data
>>> frame.
>>>
>>> foo <- matrix(1:12, nrow = 3)
>>> bar <- data.frame(foo)
>>> bar$NewCol <- foo[foo[, 1] == 4, 4]
>>> bar
>>> lapply(bar, length)
>>>
>>>
>>>
>>>
>>>> foo <- matrix(1:12, nrow = 3)
>>>> bar <- data.frame(foo)
>>>> bar$NewCol <- foo[foo[, 1] == 4, 4]
>>>> bar
>>> X1 X2 X3 X4 NewCol
>>> 1 1 4 7 10 <NA>
>>> 2 2 5 8 11 <NA>
>>> 3 3 6 9 12 <NA>
>>> Warning message:
>>> In format.data.frame(x, digits = digits, na.encode = FALSE) :
>>> corrupt data frame: columns will be truncated or padded with NAs
>>>> lapply(bar, length)
>>> $X1
>>> [1] 3
>>>
>>> $X2
>>> [1] 3
>>>
>>> $X3
>>> [1] 3
>>>
>>> $X4
>>> [1] 3
>>>
>>> $NewCol
>>> [1] 0
>>>
>>>
>>> Is this a bug in the data.frame machinery?
>>> If an attempt is made to add a new column to a data frame, and the
>> new
>>> object does not have length = number of rows of data frame, or
> cannot
>>> be made to have such length via recycling, shouldn't an error be
>>> thrown?
>>>
>>> Instead in this example I end up with a "corrupt data frame" having
>>> one zero-length column.
>>>
>>>
>>> Should this be reported as a bug, or did I misinterpret the
>>> documentation?
>> I don't think "$" uses any data.frame machinery. You are working at a
>> lower level.
>>
>> If you had added the new column using
>>
>> bar <- data.frame(bar, NewCol=foo[foo[, 1] == 4, 4])
>>
>> you would have seen the error:
>>
>> Error in data.frame(bar, NewCol = foo[foo[, 1] == 4, 4]) :
>> arguments imply differing number of rows: 3, 0
>>
>> But since you treated it as a list, it let you go ahead and create
>> something that was labelled as a data.frame but wasn't. This is one
> of
>> the reasons some people prefer S4 methods: it's easier to protect
>> against people who mislabel things.
>>
>
> I did some more digging on '$' - there is a data.frame method for it:
>
>> getAnywhere("$<-.data.frame" )
> A single object matching '$<-.data.frame' was found
> It was found in the following places
> package:base
> registered S3 method for $<- from namespace base
> namespace:base
> with value
>
> function (x, i, value)
> {
> cl <- oldClass(x)
> class(x) <- NULL
> nrows <- .row_names_info(x, 2L)
> if (!is.null(value)) {
> N <- NROW(value)
> if (N > nrows)
> stop(gettextf("replacement has %d rows, data has %d",
> N, nrows), domain = NA)
> if (N < nrows && N > 0L)
> if (nrows%%N == 0L && length(dim(value)) <= 1L)
> value <- rep(value, length.out = nrows)
> else stop(gettextf("replacement has %d rows, data has %d",
> N, nrows), domain = NA)
> if (is.atomic(value))
> names(value) <- NULL
> }
> x[[i]] <- value
> class(x) <- cl
> return(x)
> }<environment: namespace:base>
>
>
> I placed a browser() command before return(x) and did some poking
> around.
>
> It seems to me there's a bug in this function. It should be able to
> detect the problem I threw at it, and throw an error as you point out is
> thrown by the other data.frame assign method.
>
>
> I modified the rows
> if (N < nrows && N > 0L)
> if (nrows%%N == 0L && length(dim(value)) <= 1L)
> to read
> if (N < nrows)
> if (N > 0L && nrows%%N == 0L && length(dim(value)) <= 1L)
>
> as in
>
> "$<-.data.frame" <-
> function (x, i, value)
> {
> cl <- oldClass(x)
> class(x) <- NULL
> nrows <- .row_names_info(x, 2L)
> if (!is.null(value)) {
> N <- NROW(value)
> if (N > nrows)
> stop(gettextf("replacement has %d rows, data has %d",
> N, nrows), domain = NA)
> if (N < nrows)
> if (N > 0L && nrows%%N == 0L && length(dim(value)) <= 1L)
> value <- rep(value, length.out = nrows)
> else stop(gettextf("replacement has %d rows, data has %d",
> N, nrows), domain = NA)
> if (is.atomic(value))
> names(value) <- NULL
> }
> x[[i]] <- value
> class(x) <- cl
> return(x)
> }
>
> Now it detects the problem I created, in the fashion you demonstrated
> above for the replacement using data.frame().
>
>> foo <- matrix(1:12, nrow = 3)
>> bar <- data.frame(foo)
>> bar$NewCol <- foo[foo[, 1] == 4, 4]
> Error in `$<-.data.frame`(`*tmp*`, "NewCol", value = integer(0)) :
> replacement has 0 rows, data has 3
>
> It doesn't appear to stumble on weird data frames (these from the
> ?data.frame help page)
>
>
>> L3 <- LETTERS[1:3]
>> (d <- data.frame(cbind(x=1, y=1:10), fac=sample(L3, 10,
> replace=TRUE)))
>> (d0 <- d[, FALSE]) # NULL data frame with 10 rows
>
>> (d.0 <- d[FALSE, ]) # <0 rows> data frame (3 cols)
>
>> (d00 <- d0[FALSE,]) # NULL data frame with 0 rows
>
>> d0$NewCol <- foo[foo[, 1] == 4, 4]
> Error in `$<-.data.frame`(`*tmp*`, "NewCol", value = integer(0)) :
> replacement has 0 rows, data has 10
>
> ### Catches this problem above alright.
>
>> d.0$NewCol <- foo[foo[, 1] == 4, 4]
>> d.0
> [1] x y fac NewCol
> <0 rows> (or 0-length row.names)
>
> ### Lets the above one through alright.
>
>> d00$NewCol <- foo[foo[, 1] == 4, 4]
>>
>> d00
> [1] NewCol
> <0 rows> (or 0-length row.names)
> ### Lets the above one through alright.
>
>
> Would the above modification work to fix this problem?
>
>
>
>
>
>> Duncan Murdoch
>>
>>>
>>>
>>>
>>>> sessionInfo()
>>> R version 2.9.0 (2009-04-17)
>>> powerpc-apple-darwin8.11.1
>>>
>>> locale:
>>> en_CA.UTF-8/en_CA.UTF-8/C/C/en_CA.UTF-8/en_CA.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] nlme_3.1-90
>>>
>>> loaded via a namespace (and not attached):
>>> [1] grid_2.9.0 lattice_0.17-22 tools_2.9.0
>>>
>>>
>>> Also occurs on Windows box with R 2.8.1
>>>
>>>
>>>
>>> Steven McKinney
>>>
>>> Statistician
>>> Molecular Oncology and Breast Cancer Program British Columbia Cancer
>>> Research Centre
>>>
>>> email: smckinney +at+ bccrc +dot+ ca
>>>
>>> tel: 604-675-8000 x7561
>>>
>>> BCCRC
>>> Molecular Oncology
>>> 675 West 10th Ave, Floor 4
>>> Vancouver B.C.
>>> V5Z 1L3
>>> Canada
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list