[Rd] Apparent bug in summary.data.frame() with columns of Date class and NA's present
Marc Schwartz
marc_schwartz at me.com
Mon Feb 8 23:03:24 CET 2016
Hi all,
Based upon an exchange with Göran Broström on R-Help today:
https://stat.ethz.ch/pipermail/r-help/2016-February/435992.html
there appears to be a bug in summary.data.frame() in the case where a data frame contains Date class columns that contain NA's and other columns, if present, do not.
Example, modified from R-Help:
x <- c(18000000, 18810924, 19091227, 19027233, 19310526, 19691228, NA)
x.Date <- as.Date(as.character(x), format = "%Y%m%d")
DF.Dates <- data.frame(Col1 = x.Date)
> summary(x.Date)
Min. 1st Qu. Median Mean 3rd Qu.
"1881-09-24" "1902-12-04" "1920-09-10" "1923-04-12" "1941-01-17"
Max. NA's
"1969-12-28" "3"
# NA's missing from output
> summary(DF.Dates)
Col1
Min. :1881-09-24
1st Qu.:1902-12-04
Median :1920-09-10
Mean :1923-04-12
3rd Qu.:1941-01-17
Max. :1969-12-28
DF.Dates$x1 <- 1:7
> DF.Dates
Col1 x1
1 <NA> 1
2 1881-09-24 2
3 1909-12-27 3
4 <NA> 4
5 1931-05-26 5
6 1969-12-28 6
7 <NA> 7
# NA's still missing
> summary(DF.Dates)
Col1 x1
Min. :1881-09-24 Min. :1.0
1st Qu.:1902-12-04 1st Qu.:2.5
Median :1920-09-10 Median :4.0
Mean :1923-04-12 Mean :4.0
3rd Qu.:1941-01-17 3rd Qu.:5.5
Max. :1969-12-28 Max. :7.0
DF.Dates$x2 <- c(1:6, NA)
# NA's show if another column has any
> summary(DF.Dates)
Col1 x1 x2
Min. :1881-09-24 Min. :1.0 Min. :1.00
1st Qu.:1902-12-04 1st Qu.:2.5 1st Qu.:2.25
Median :1920-09-10 Median :4.0 Median :3.50
Mean :1923-04-12 Mean :4.0 Mean :3.50
3rd Qu.:1941-01-17 3rd Qu.:5.5 3rd Qu.:4.75
Max. :1969-12-28 Max. :7.0 Max. :6.00
NA's :3 NA's :1
The behavior appears to occur because summary.Date() assigns an "NAs" attribute internally that contains the count of NA's in the source Date vector:
x <- summary.default(unclass(object), digits = digits, ...)
if (m <- match("NA's", names(x), 0)) {
NAs <- as.integer(x[m])
x <- x[-m]
attr(x, "NAs") <- NAs
}
rather than the count being retained as an actual element in the result vector, as in summary.default():
nas <- is.na(object)
object <- object[!nas]
qq <- stats::quantile(object)
qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits)
names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.",
"Max.")
if (any(nas))
c(qq, `NA's` = sum(nas))
else qq
This results in an apparent (but not real) error in the value of the variable 'nr' within summary.date.frame(), which is used to set the length of the result created within that function:
nr <- if (nv)
max(unlist(lapply(z, NROW)))
else 0
'nr' is used later in the function to set the length of the initial result vector 'sms', which in turn is assigned back to the result list 'z'.
In the case of my example above, where the NA's are not printed, 'nr' is 6, rather than 7. 6 is correct, since that is the actual length of the result vector from summary.Date(), even though the printed output of the result, would appear to contain 7 elements, including the NA count, because of the behavior of print.summaryDefault().
This results in an apparent truncation of the result, with a loss of the "NAs" attribute from summary.Date(), when the result is returned by summary.data.frame().
If the source vector is numeric, as per my example above, then 'nr' is set to 7 when NA's are present and the result is correctly printed along with the other columns.
The history of the difference in the manner in which the NA counts are stored in the different summary() methods is not clear and so I am not clear on how to consider a resolution.
At least three options seem possible and I have not fully thought through the implications of each yet:
1. Modify the code that creates and uses 'nr' in summary.data.frame(), to account for the NAs attribute from summary.Date().
2. Restore the NAs attribute later in the code, if present in the vector that results from summary.Date().
3. Modify the code in summary.Date() so that it mimics the approach in summary.default() relative to storing the NA count.
It is important to note that summary.POSIXct() has code similar to summary.Date() relative to the handling of NA's.
In addition, print.summaryDefault() contains checks for both Date and POSIXct classes and outputs accordingly. So the inter-dependencies of the handling of NA's across the methods are notable.
Thus, since there are likely to be other implications for the choice of resolution that I am not considering here and I am likely to be missing some nuances here, I defer to others for comments/corrections.
Thanks and regards,
Marc Schwartz
More information about the R-devel
mailing list