[R] qqnorm & huge datasets

cberry at tajo.ucsd.edu cberry at tajo.ucsd.edu
Thu Dec 22 22:13:26 CET 2011


Bert Gunter <gunter.berton at gene.com> writes:

> Chuck:
>
> A bad idea, I think: Rounding to unique values loses data density,
> while sampling preserves it (to display resolution -- also a form of
> rounding).

Bert,

The subject is "qqnorm & huge datasets".

Downsampling for qqplots produces unreliable tails. Try downsampling the
example below, for example.

The distortion induced by unique(round(...,4)) in this context is
invisible on any display much smaller than a billboard (and not
misleading even in that case as the tails are fully shown).

When 'data density in huge datasets' is the subject of interest, then a
scatterplot would be a bad idea --- unless you have a billboard or a
movie screen to display it on.

Chuck
>
> -- Bert
>
> On Thu, Dec 22, 2011 at 11:10 AM,  <cberry at tajo.ucsd.edu> wrote:
>> Sam Steingold <sds at gnu.org> writes:
>>
>>> Hi,
>>> When qqnorm on a vector of length 10M+ I get a huge pdf file which
>>> cannot be loaded by acroread or evince.
>>> Any suggestions? (apart from sampling the data).
>>> Thanks.
>>
>> Following the other suggestions, I did not notice mention of another
>> trick for slimming down graphs of many points. viz.
>>
>> Do not plot points that substantially overlap:
>>
>>> xx <- rexp(1e05)
>>> qq.results <- qqnorm(xx, plot.it=FALSE)
>>> qq.slim <- unique(round(as.data.frame(qq.results),3))
>>> dim(qq.slim)
>> [1] 10233     2
>>> plot(qq.slim)
>>>
>>
>> Choose the digits arg in round to be large enough to allow for points that do not overlap
>> to be seen and small enough to slim down the number of plotted
>> points. In the example above, 10233 vs 100000.
>>
>> HTH,
>>
>> Chuck
>>
>> --
>> Charles C. Berry                            Dept of Family/Preventive Medicine
>> cberry at ucsd edu                          UC San Diego
>> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list