[Rd] hist.Date() and cut.Date(): approximations used when using breaks = 'months' or 'years'

Sat Jan 12 03:35:21 CET 2008

Hi all,

I came across some curious behavior today in using hist.Date() and
subsequently noted the same behavior in cut.Date(), both of which are
using similar code when 'breaks = "months"' or 'breaks = "years"'.

I was in the process of creating a histogram of subject enrollment in a
clinical trial. The counts needed to be by month, so essentially used:

  hist(Dates, breaks = "months")

When reviewing the counts generated, I noted a discrepancy between the
histogram and another frequency table generated independently. In
attempting to identify the etiology, I reviewed the code for hist.Date()
and noted the following:

          start <- as.POSIXlt(min(x, na.rm = TRUE))
...
          if (valid == 3) {
              start$mday <- 1
              incr <- 31
          }
          if (valid == 4) {
              start$mon <- 0
              incr <- 366
          }
          start <- .Internal(POSIXlt2Date(start))
          maxx <- max(x, na.rm = TRUE)
          breaks <- seq.int(start, maxx + incr, breaks)
          breaks <- breaks[1:(1 + max(which(breaks < maxx)))]
...
    res <- hist.default(unclass(x), unclass(breaks), plot = FALSE,
        ...)

The first check is for breaks = "months" and the second for "years".  If
I am reading it correctly, it seems that the discrepancy is due to the
approximations of the numbers of days in a month and the number of days
in a year, respectively, which get further and further off, especially
for "boundary" dates near the interval breaks.

The use of approximations is not noted in ?hist.Date (or in ?cut.Date),
so I was a bit surprised.

To give a specific example, I have uploaded a text file containing a
date series that shows at least some aspects of the discrepancy.

# Read in the file and convert to dates
# Total of 1361 entries
Dates <-
as.Date(scan("http://home.comcast.net/~marc_schwartz/Dates.txt", what =
"character"))

# Get the hist.Date() counts for months
> hist(Dates, breaks = "months", plot = FALSE)$counts
 [1]  2  3  2  9 10 15 21 34 52 85 77 59 56 71 73 55 52 88 67 66 74 86
[23] 58 96 64 71 15

# Get the hist.Date() counts for years
> hist(Dates, breaks = "years", plot = FALSE)$counts
[1]   6 533 822

# Now format the dates for the subsequent counts
months <- format(Dates, format = "%m")
years <- format(Dates, format = "%Y")

# Tabulate the years - NOTE there are 4 years, not 3 as above
> table(years)
years
2005 2006 2007 2008
   5  491  850   15

# Now split months by years and tabulate - NOTE count diffs
> sapply(split(months, years), table)
$`2005`

11 12
 1  4

$`2006`

01 02 03 04 05 06 07 08 09 10 11 12
 2  8 11 14 18 38 45 85 84 58 54 74

$`2007`

01 02 03 04 05 06 07 08 09 10 11 12
71 57 52 78 74 69 70 90 57 87 74 71

$`2008`

01
15

I think that it becomes clear just how far off the hist.Dates() based
counts are, though this is clearly affected by the specific date series
in question.

I would like to suggest that a warning be added to both hist.Date() and
to cut.Date() giving users a heads up that approximations are being used
for these intervals, possibly resulting in count errors.

If it is desirable, I would be willing to spend some time incorporating
code similar to the above, as appropriate for each interval
specification, and make it available for both functions. I suspect
additional tweaking would be required to handle other aspects of the two
functions as required.

If there are any pitfalls that I should be aware of that perhaps have
led to the use of the current approach, I'd love to hear about them, so
that I can avoid re-inventing the wheel, if it is desired for me to
proceed with code updates here.

Thanks,

Marc Schwartz