[Rd] hist.Date() and cut.Date(): approximations used when using breaks = 'months' or 'years'
Marc Schwartz
marc_schwartz at comcast.net
Sat Jan 12 03:35:21 CET 2008
Hi all,
I came across some curious behavior today in using hist.Date() and
subsequently noted the same behavior in cut.Date(), both of which are
using similar code when 'breaks = "months"' or 'breaks = "years"'.
I was in the process of creating a histogram of subject enrollment in a
clinical trial. The counts needed to be by month, so essentially used:
hist(Dates, breaks = "months")
When reviewing the counts generated, I noted a discrepancy between the
histogram and another frequency table generated independently. In
attempting to identify the etiology, I reviewed the code for hist.Date()
and noted the following:
start <- as.POSIXlt(min(x, na.rm = TRUE))
...
if (valid == 3) {
start$mday <- 1
incr <- 31
}
if (valid == 4) {
start$mon <- 0
incr <- 366
}
start <- .Internal(POSIXlt2Date(start))
maxx <- max(x, na.rm = TRUE)
breaks <- seq.int(start, maxx + incr, breaks)
breaks <- breaks[1:(1 + max(which(breaks < maxx)))]
...
res <- hist.default(unclass(x), unclass(breaks), plot = FALSE,
...)
The first check is for breaks = "months" and the second for "years". If
I am reading it correctly, it seems that the discrepancy is due to the
approximations of the numbers of days in a month and the number of days
in a year, respectively, which get further and further off, especially
for "boundary" dates near the interval breaks.
The use of approximations is not noted in ?hist.Date (or in ?cut.Date),
so I was a bit surprised.
To give a specific example, I have uploaded a text file containing a
date series that shows at least some aspects of the discrepancy.
# Read in the file and convert to dates
# Total of 1361 entries
Dates <-
as.Date(scan("http://home.comcast.net/~marc_schwartz/Dates.txt", what =
"character"))
# Get the hist.Date() counts for months
> hist(Dates, breaks = "months", plot = FALSE)$counts
[1] 2 3 2 9 10 15 21 34 52 85 77 59 56 71 73 55 52 88 67 66 74 86
[23] 58 96 64 71 15
# Get the hist.Date() counts for years
> hist(Dates, breaks = "years", plot = FALSE)$counts
[1] 6 533 822
# Now format the dates for the subsequent counts
months <- format(Dates, format = "%m")
years <- format(Dates, format = "%Y")
# Tabulate the years - NOTE there are 4 years, not 3 as above
> table(years)
years
2005 2006 2007 2008
5 491 850 15
# Now split months by years and tabulate - NOTE count diffs
> sapply(split(months, years), table)
$`2005`
11 12
1 4
$`2006`
01 02 03 04 05 06 07 08 09 10 11 12
2 8 11 14 18 38 45 85 84 58 54 74
$`2007`
01 02 03 04 05 06 07 08 09 10 11 12
71 57 52 78 74 69 70 90 57 87 74 71
$`2008`
01
15
I think that it becomes clear just how far off the hist.Dates() based
counts are, though this is clearly affected by the specific date series
in question.
I would like to suggest that a warning be added to both hist.Date() and
to cut.Date() giving users a heads up that approximations are being used
for these intervals, possibly resulting in count errors.
If it is desirable, I would be willing to spend some time incorporating
code similar to the above, as appropriate for each interval
specification, and make it available for both functions. I suspect
additional tweaking would be required to handle other aspects of the two
functions as required.
If there are any pitfalls that I should be aware of that perhaps have
led to the use of the current approach, I'd love to hear about them, so
that I can avoid re-inventing the wheel, if it is desired for me to
proceed with code updates here.
Thanks,
Marc Schwartz
More information about the R-devel
mailing list