[R] Drawing a histogram from a massive dataset

Tue Jul 19 01:30:18 CEST 2011

On Mon, Jul 18, 2011 at 2:08 PM, Paul Smith <phhs80 at gmail.com> wrote:
> On Mon, Jul 18, 2011 at 9:11 PM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
>>> [snip] I guess that I must have a data frame to plot a histogram.
>>
>> Not at all!
>>
>> ## a *vector* of 100 million observation
>> x <- rnorm(10^8)
>> ## a histogram for it (see attached for the result from my system)
>> hist(x)
>>
>> No data frame required.  I would not try this straight in anything but
>> traditional graphics for a 100 million observation vector, but if you
>> wanted it made in ggplot2 or something, you could prebin the data and
>> THEN plot bars corresponding to the bins.
>
> Thanks, Joshua, for your answer.
>
> True: A vector is enough to supply data for hist(). But my point is:
> Can a histogram be drawn without having all data on the computer
> memory? You partially answer this question by suggesting to prebind
> the data. Can this prebinning process be done transparently but chunk
> by chunk of data underneath?

Sure, as long as you can figure out some basic details about the full
dataset.  Just define your breaks, and then for chunks of the data at
a time, count how many fall into any particular bin.  Once you are
done, add up all the counts for each bin, and voila.

## Get these values from the full data (using SQL)
x <- rnorm(1000)
n <- length(x)
minx <- min(x)
maxx <- max(x)

## Sturges style breaks
breaks <- pretty(c(minx, maxx), n = ceiling(log2(n) + 1))
nB <- length(breaks)

fuzz <- rep(1e-07 * median(diff(breaks)), nB)
fuzz[1] <- fuzz[1] * -1
fuzzybreaks <- breaks + fuzz

chunks <- 10

counts <- matrix(NA, nrow = chunks, ncol = nB - 1,
  dimnames = list(paste("Sec", 1:chunks, sep = ''),
    as.character(fuzzybreaks[-1])))

for(i in 1:chunks) {
  index <- seq(1, n/chunks) + (n/chunks * (i - 1))
  counts[i, ] <- hist(x[index], breaks = fuzzybreaks)$counts
}

## The heights of your bars
colSums(counts)
## results using hist() on x all at once
hist(x)$counts

You would not even need to know the number of chunks you were going to
split your data into before hand, I just did it for convenience and to
instatiate a full sized matrix to hold the results.  If you are
selecting subsets of your data using SQL rather than R, it becomes
even simpler.  Once you have your fuzzybreaks, you just keep calling
hist on your new data with using the predefined breaks and saving the
results.  Still, I do not break about 4.5 GB of memory used to just
plot a histogram on a 100 million observation vector, and it is
difficult to imagine the shape of the distribution changing
appreciably using a random sample of 100 million observations.  It
also takes less than 10 seconds to calculate and draw the histogram on
my computer.  The point being, I suspect you will spend more time
getting everything setup and working than seems worth it because you
can easily and quickly create a histogram on so large of vectors
already, the distribution is unlikely to vary anyway.  Whatever floats
your boat, though.

Cheers,

Josh

>
> Paul
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
https://joshuawiley.com/