[R] Dates and missing values

Marc Schwartz marc_schwartz at me.com
Mon Feb 8 21:36:56 CET 2016


> On Feb 8, 2016, at 12:45 PM, Göran Broström <goran.brostrom at umu.se> wrote:
> 
> Thanks Marc, but see below!
> 
> On 2016-02-08 19:26, Marc Schwartz wrote:
>> 
>>> On Feb 8, 2016, at 11:26 AM, Göran Broström <goran.brostrom at umu.se> wrote:
>>> 
>>> I have a data frame with dates as integers:
>>> 
>>>> summary(persons[, c("foddat", "doddat")])
>>>     foddat             doddat
>>> Min.   :16790000   Min.   :18000000
>>> 1st Qu.:18760904   1st Qu.:18810924
>>> Median :19030426   Median :19091227
>>> Mean   :18946659   Mean   :19027233
>>> 3rd Qu.:19220911   3rd Qu.:19310526
>>> Max.   :19660124   Max.   :19691228
>>> NA's   :624        NA's   :207570
>>> 
>>> After converting the dates to Date format ('as.Date') I get:
>>> 
>>>> summary(per[, c("foddat", "doddat")])
>>>    foddat               doddat
>>> Min.   :1679-07-01   Min.   :1800-01-26
>>> 1st Qu.:1876-09-04   1st Qu.:1881-09-24
>>> Median :1903-04-26   Median :1909-12-27
>>> Mean   :1895-02-04   Mean   :1903-02-22
>>> 3rd Qu.:1922-09-10   3rd Qu.:1931-05-26
>>> Max.   :1966-01-24   Max.   :1969-12-28
>>> 
>>> My question is: Why are the numbers of missing values not printed in the second case? 'is.na' gives the correct (same) numbers.
>>> 
>>> Can I somehow force 'summary' to print NA's? I found no clues in the documentation.
>> 
>> 
>> Hi,
>> 
>> Two things:
>> 
>> 1. We are going to need to see the exact call to as.Date() that you used. as.Date() will take a numeric vector as input, but the presumption is that the number represents the number of days since an origin, which needs to be specified explicitly. If you coerced the numeric vector to character first, presuming a "%Y%m%d" format, then you need to be cautious about how that is done and the result.
>> 
>> 2. Your second call is to a data frame called 'per', which may or may not have the same content as 'persons' in your first call.
>> 
>> 
>> If I do the following, taking some of your numeric values from above:
>> 
>> x <- c(18000000, 18810924, 19091227, 19027233, 19310526, 19691228, NA)
>> 
>> DF <- data.frame(x)
>> 
>>> summary(DF)
>>        x
>>  Min.   :18000000
>>  1st Qu.:18865001
>>  Median :19059230
>>  Mean   :18988523
>>  3rd Qu.:19255701
>>  Max.   :19691228
>>  NA's   :1
>> 
>>> as.character(DF$x)
>> [1] "1.8e+07"  "18810924" "19091227" "19027233" "19310526" "19691228"
>> [7] NA
>> 
>> DF$x.Date <- as.Date(as.character(DF$x), format = "%Y%m%d")
>> 
>>> DF
>>          x     x.Date
>> 1 18000000       <NA>
>> 2 18810924 1881-09-24
>> 3 19091227 1909-12-27
>> 4 19027233       <NA>
>> 5 19310526 1931-05-26
>> 6 19691228 1969-12-28
>> 7       NA       <NA>
>> 
>>> summary(DF)
>>        x                x.Date
>>  Min.   :18000000   Min.   :1881-09-24
>>  1st Qu.:18865001   1st Qu.:1902-12-04
>>  Median :19059230   Median :1920-09-10
>>  Mean   :18988523   Mean   :1923-04-12
>>  3rd Qu.:19255701   3rd Qu.:1941-01-17
>>  Max.   :19691228   Max.   :1969-12-28
>>  NA's   :1          NA's   :3
>> 
> But:
> 
> > summary(DF[, "x.Date", drop = FALSE])
>     x.Date
> Min.   :1881-09-24
> 1st Qu.:1902-12-04
> Median :1920-09-10
> Mean   :1923-04-12
> 3rd Qu.:1941-01-17
> Max.   :1969-12-28
> 
> No NA's. But again:
> 
> > summary(DF[, "x.Date"])
>        Min.      1st Qu.       Median         Mean      3rd Qu.   Max.
> "1881-09-24" "1902-12-04" "1920-09-10" "1923-04-12" "1941-01-17" "1969-12-28"
>        NA's
>         "3"
> 
>> 
>> So summary does support the reporting of NA's for Dates, using summary.Date().
> 
> Not always, as it seems. Strange. (The 'persons' vs. 'per' is a red herring.)
> 
> Göran Broström


Ok, thanks for the clarification.

I spent some time running summary.Date() under debug, trying to see where things fail.

Within the function, the result object 'x', is created correctly, with the correct class attributes and the count of NA values retained in an "NAs" attribute

However, upon function exit, the class attributes appear to be lost and the result is of class table, which also loses the "NAs" attribute, which is assigned within the function body.

I believe that this is happening within summary.data.frame().

I can extend the example more generally, when the only columns in the source data frame are Dates:

DF.Dates <- data.frame(Col1 = DF$x.Date, Col2 = DF$x.Date)

> DF.Dates
        Col1       Col2
1       <NA>       <NA>
2 1881-09-24 1881-09-24
3 1909-12-27 1909-12-27
4       <NA>       <NA>
5 1931-05-26 1931-05-26
6 1969-12-28 1969-12-28
7       <NA>       <NA>

> summary(DF.Dates)
      Col1                 Col2           
 Min.   :1881-09-24   Min.   :1881-09-24  
 1st Qu.:1902-12-04   1st Qu.:1902-12-04  
 Median :1920-09-10   Median :1920-09-10  
 Mean   :1923-04-12   Mean   :1923-04-12  
 3rd Qu.:1941-01-17   3rd Qu.:1941-01-17  
 Max.   :1969-12-28   Max.   :1969-12-28  


So, it is not dependent upon the subsetting used in your original call per se, but when the data frame passed to summary.data.frame() consists of only Date class columns.

I am still working through the code, but the preliminary source of the issue appears to be the following line in summary.data.frame:

length(sms) <- nr

which truncates the internal object 'sms', where before that line, 'sms' is of length 7 and afterwards, 6:

Browse[2]> nr
[1] 6
Browse[2]> sms
[1] "Min.   :1881-09-24  " "1st Qu.:1902-12-04  " "Median :1920-09-10  "
[4] "Mean   :1923-04-12  " "3rd Qu.:1941-01-17  " "Max.   :1969-12-28  "
[7] "NA's   :3  "         
Browse[2]> 
debug: length(sms) <- nr
Browse[2]> sms
[1] "Min.   :1881-09-24  " "1st Qu.:1902-12-04  " "Median :1920-09-10  "
[4] "Mean   :1923-04-12  " "3rd Qu.:1941-01-17  " "Max.   :1969-12-28  "
[7] "NA's   :3  "         
Browse[2]> 
debug: z[[i]] <- sms
Browse[2]> sms
[1] "Min.   :1881-09-24  " "1st Qu.:1902-12-04  " "Median :1920-09-10  "
[4] "Mean   :1923-04-12  " "3rd Qu.:1941-01-17  " "Max.   :1969-12-28  "



OK, I now believe that I have found the issue...

Internally, an object 'z' is created by the following:

z <- lapply(X = as.list(object), FUN = summary, maxsum = maxsum, 
        digits = 12L, ...)

For my data frame, DF.Dates, 'z' is:

Browse[2]> z
$Col1
        Min.      1st Qu.       Median         Mean      3rd Qu. 
"1881-09-24" "1902-12-04" "1920-09-10" "1923-04-12" "1941-01-17" 
        Max.         NA's 
"1969-12-28"          "3" 

$Col2
        Min.      1st Qu.       Median         Mean      3rd Qu. 
"1881-09-24" "1902-12-04" "1920-09-10" "1923-04-12" "1941-01-17" 
        Max.         NA's 
"1969-12-28"          "3" 

which shows the result of summary.Date() on the two columns. 

The print()ed output is the result of each list element being of the class set by summary.Date():

Browse[2]> class(z$Col1)
[1] "summaryDefault" "table"          "Date"          
Browse[2]> class(z$Col2)
[1] "summaryDefault" "table"          "Date"       


The problem is that the NA component of the result is an attribute and not part of the vector itself:

Browse[2]> str(z)
List of 2
 $ Col1: summaryDefault[1:6], format: "1881-09-24" ...
  ..- attr(*, "names")="Min." ...
 $ Col2: summaryDefault[1:6], format: "1881-09-24" ...
  ..- attr(*, "names")="Min." ...


Note that each list element is of length 6, hence the value used in 'nr' above, rather than 7.

The count of NA values are stored in attributes:

Browse[2]> attr(z$Col1, "NAs")
[1] 3
Browse[2]> attr(z$Col2, "NAs")
[1] 3


Hence, when internal variable 'nr' is set, it is:

Browse[2]> max(unlist(lapply(z, NROW)))
[1] 6

Browse[2]> nr
[1] 6


And...that results in the truncation seen above and the loss of the NA attribute components otherwise returned.

My original example worked, where a Date column is present with columns of other data types, because that 'nr' variable internally is set to the correct length (7) for the other data types, BUT, only if NA's are present in at least one other column:

DF.Dates$Col3 <- 1:7

> DF.Dates
        Col1       Col2 Col3
1       <NA>       <NA>    1
2 1881-09-24 1881-09-24    2
3 1909-12-27 1909-12-27    3
4       <NA>       <NA>    4
5 1931-05-26 1931-05-26    5
6 1969-12-28 1969-12-28    6
7       <NA>       <NA>    7

> summary(DF.Dates)
      Col1                 Col2                 Col3    
 Min.   :1881-09-24   Min.   :1881-09-24   Min.   :1.0  
 1st Qu.:1902-12-04   1st Qu.:1902-12-04   1st Qu.:2.5  
 Median :1920-09-10   Median :1920-09-10   Median :4.0  
 Mean   :1923-04-12   Mean   :1923-04-12   Mean   :4.0  
 3rd Qu.:1941-01-17   3rd Qu.:1941-01-17   3rd Qu.:5.5  
 Max.   :1969-12-28   Max.   :1969-12-28   Max.   :7.0  


DF.Dates$Col3 <- c(1:6, NA)

> summary(DF.Dates)
      Col1                 Col2                 Col3     
 Min.   :1881-09-24   Min.   :1881-09-24   Min.   :1.00  
 1st Qu.:1902-12-04   1st Qu.:1902-12-04   1st Qu.:2.25  
 Median :1920-09-10   Median :1920-09-10   Median :3.50  
 Mean   :1923-04-12   Mean   :1923-04-12   Mean   :3.50  
 3rd Qu.:1941-01-17   3rd Qu.:1941-01-17   3rd Qu.:4.75  
 Max.   :1969-12-28   Max.   :1969-12-28   Max.   :6.00  
 NA's   :3            NA's   :3            NA's   :1     



So, there is a bug in summary.data.frame() when only Date class columns are present and no other columns have NA's, from what this suggests.

The key would seem to be to modify the code that creates 'nr', which is currently:

nr <- if (nv) 
        max(unlist(lapply(z, NROW)))
    else 0


to account for the presence of the "NAs" attribute from summary.Date(), restore the attribute further down in the code, if present, or alternatively, to modify the code for summary.Date() so that rather than adding the "NAs" attribute:

  x <- summary.default(unclass(object), digits = digits, ...)
  if (m <- match("NA's", names(x), 0)) {
        NAs <- as.integer(x[m])
        x <- x[-m]
        attr(x, "NAs") <- NAs
    }


it behaves more like summary.default(), so that the NA count is an actual element in the result vector, rather than an attribute:

nas <- is.na(object)
        object <- object[!nas]
        qq <- stats::quantile(object)
        qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits)
        names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", 
            "Max.")
        if (any(nas)) 
            c(qq, `NA's` = sum(nas))
        else qq


This is where I would defer to a member of R Core for guidance, since I presume that there may be some logic in the difference, other than perhaps different authors over time and there may be other implications that I am not considering here.

Regards,

Marc



More information about the R-help mailing list