[R] qqnorm & huge datasets

peter dalgaard pdalgd at gmail.com
Thu Dec 22 19:09:15 CET 2011


On Dec 22, 2011, at 18:17 , Sam Steingold wrote:

>> * peter dalgaard <cqnytq at tznvy.pbz> [2011-12-21 23:59:18 +0100]:
>> On Dec 21, 2011, at 23:10 , Sam Steingold wrote:
>>> When qqnorm on a vector of length 10M+ I get a huge pdf file which
>>> cannot be loaded by acroread or evince.
>>> Any suggestions? (apart from sampling the data).
>> 
>> Sample intelligently? Things like
>> 
>>> qq <- seq(-4,4,,10001)
>>> qqplot(qq,quantile(x,pnorm(qq)),type="l")
> 
> Perfect! Thanks!
> 
>  m <- mean(x); s <- sd(x);
>  qq <- seq(min(x), max(x),, sqrt(length(x)));
>  qu <- quantile(x, pnorm(qq, mean=m, sd=s));
>  qqplot(qq, qu, type="l", xlab=paste("normal(",m,",",s,")"),
>         ylab="log scaled weights",
>         main="log scaled weight quantile");

I tried this with exponentially distributed x, and it did reveal a weakness. If qq has values way off the normal range, you end up with the last bit of your curve being a horizontal line through max(x) because pnorm(qq,...) is essentially 1.00. 

So somehow you should restrict the range of qq to what is compatible with a normal distribution, rather than what is observed in data. 


> 
> Now, how do I add the perfect line there?

abline(0,1), perhaps? Or maybe retrace the logic of qqline and work out the line through the quartiles. Lessee... Does this do it?

qua <- quantile(x, c(.25,.75))
slope <- diff(qua)/diff(qnorm(c(.25.,75),mean=m,sd=s)
int <- mean(qua)-slope*m
abline(int,slope) 

> Why do neither qqline(qq) nor qqline(qu) add anything to the plot?

Why should they? I suspect that if you do qqnorm(qq) and qqnorm(qu), you'll realize that the scales don't match...

> 
> -- 
> Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
> http://www.memritv.org http://palestinefacts.org
> http://thereligionofpeace.com http://mideasttruth.com http://pmw.org.il
> nobody's life, liberty or property are safe while the legislature is in session

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list