[R] Dates and missing values

Marc Schwartz marc_schwartz at me.com
Mon Feb 8 19:26:54 CET 2016


> On Feb 8, 2016, at 11:26 AM, Göran Broström <goran.brostrom at umu.se> wrote:
> 
> I have a data frame with dates as integers:
> 
> > summary(persons[, c("foddat", "doddat")])
>     foddat             doddat
> Min.   :16790000   Min.   :18000000
> 1st Qu.:18760904   1st Qu.:18810924
> Median :19030426   Median :19091227
> Mean   :18946659   Mean   :19027233
> 3rd Qu.:19220911   3rd Qu.:19310526
> Max.   :19660124   Max.   :19691228
> NA's   :624        NA's   :207570
> 
> After converting the dates to Date format ('as.Date') I get:
> 
> > summary(per[, c("foddat", "doddat")])
>    foddat               doddat
> Min.   :1679-07-01   Min.   :1800-01-26
> 1st Qu.:1876-09-04   1st Qu.:1881-09-24
> Median :1903-04-26   Median :1909-12-27
> Mean   :1895-02-04   Mean   :1903-02-22
> 3rd Qu.:1922-09-10   3rd Qu.:1931-05-26
> Max.   :1966-01-24   Max.   :1969-12-28
> 
> My question is: Why are the numbers of missing values not printed in the second case? 'is.na' gives the correct (same) numbers.
> 
> Can I somehow force 'summary' to print NA's? I found no clues in the documentation.


Hi,

Two things:

1. We are going to need to see the exact call to as.Date() that you used. as.Date() will take a numeric vector as input, but the presumption is that the number represents the number of days since an origin, which needs to be specified explicitly. If you coerced the numeric vector to character first, presuming a "%Y%m%d" format, then you need to be cautious about how that is done and the result.

2. Your second call is to a data frame called 'per', which may or may not have the same content as 'persons' in your first call.


If I do the following, taking some of your numeric values from above:

x <- c(18000000, 18810924, 19091227, 19027233, 19310526, 19691228, NA)

DF <- data.frame(x)

> summary(DF)
       x           
 Min.   :18000000  
 1st Qu.:18865001  
 Median :19059230  
 Mean   :18988523  
 3rd Qu.:19255701  
 Max.   :19691228  
 NA's   :1   

> as.character(DF$x)
[1] "1.8e+07"  "18810924" "19091227" "19027233" "19310526" "19691228"
[7] NA    

DF$x.Date <- as.Date(as.character(DF$x), format = "%Y%m%d")

> DF
         x     x.Date
1 18000000       <NA>
2 18810924 1881-09-24
3 19091227 1909-12-27
4 19027233       <NA>
5 19310526 1931-05-26
6 19691228 1969-12-28
7       NA       <NA>

> summary(DF)
       x                x.Date          
 Min.   :18000000   Min.   :1881-09-24  
 1st Qu.:18865001   1st Qu.:1902-12-04  
 Median :19059230   Median :1920-09-10  
 Mean   :18988523   Mean   :1923-04-12  
 3rd Qu.:19255701   3rd Qu.:1941-01-17  
 Max.   :19691228   Max.   :1969-12-28  
 NA's   :1          NA's   :3   


So summary does support the reporting of NA's for Dates, using summary.Date().

Regards,

Marc Schwartz



More information about the R-help mailing list