[R] Histogram omitting/collapsing groups

Aren Cambre aren at arencambre.com
Sat Dec 31 16:25:25 CET 2011


I have two large datasets (156K and 2.06M records). Each row has the
hour that an event happened, represented by an integer from 0 to 23.

R's histogram is combining some data.

Here's the command I ran to get the histogram:
> histinfo <- hist(crashes$hour, right=FALSE)

Here's histinfo:
> histinfo
$breaks
 [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

$counts
 [1]  4755  4618  5959  3292  2378  2715  4592  6144  6860  5598  5601
 6596  7152  7490  8166
[16]  9758 11301 11745  9943  7494  6272  6220 11669

$intensities
 [1] 0.03041876 0.02954234 0.03812101 0.02105963 0.01521258 0.01736844
0.02937602 0.03930449
 [9] 0.04388490 0.03581161 0.03583081 0.04219604 0.04575289 0.04791515
0.05223967 0.06242403
[17] 0.07229494 0.07513530 0.06360752 0.04794074 0.04012334 0.03979068
0.07464911

$density
 [1] 0.03041876 0.02954234 0.03812101 0.02105963 0.01521258 0.01736844
0.02937602 0.03930449
 [9] 0.04388490 0.03581161 0.03583081 0.04219604 0.04575289 0.04791515
0.05223967 0.06242403
[17] 0.07229494 0.07513530 0.06360752 0.04794074 0.04012334 0.03979068
0.07464911

$mids
 [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5
13.5 14.5 15.5 16.5 17.5
[19] 18.5 19.5 20.5 21.5 22.5

$xname
[1] "crashes$hour"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

Note how the last value in counts is 11669. It's relevant to the
output of table(crashes$hour):
    0     1     2     3     4     5     6     7     8     9    10
11    12    13    14
 4755  4618  5959  3292  2378  2715  4592  6144  6860  5598  5601
6596  7152  7490  8166
   15    16    17    18    19    20    21    22    23
 9758 11301 11745  9943  7494  6272  6220  6000  5669

Notice how the sum of 22 and 23 from table(crashes$hour) is 11669? Is
that correct for the histogram to combine hours 22 and 23? Since I
specified right = FALSE, I figured there's no way 23 would be combined
with 22?

Adding breaks=24 to the hist makes no difference; it's still stuck at
23 breaks. I also tried breaks=25 and 23 and several other values, in
case I am misinterpreting breaks's meaning, but none of them make a
difference.

I imagine this is a n00b question, so my apologies if this is obvious.

Aren



More information about the R-help mailing list