[R] Re: an interesting qqnorm question

Sun Apr 24 00:07:12 CEST 2005

If I understand your problem, you are computing the difference between
your data and the quantiles of a standard gaussian variable -- in
other words, the difference between the data and the red line, in the
following picture.

  N <- 100  # Sample size
  m <- 1    # Mean
  s <- 2    # dispersion
  x <- m + s * rt(N, df=2)  # Non-gaussian data

  qqnorm(x)
  abline(0,1, col="red") 

And you get 

  y <- sort(x) - qnorm(ppoints(N))
  hist(y)

This is probably not the right line (not only because your mean is 1, 
the slope is wrong as well -- if the data were gaussian, you could
estimate it with the standard deviation).

You can use the "qqline" function to get the line passing throught the
first and third quartiles, which is probably closer to what you have
in mind.

  qqnorm(x)
  abline(0,1, col="red") 
  qqline(x, col="blue")

The differences are 

  x1 <- quantile(x, .25)
  x2 <- quantile(x, .75)
  b <- (x2-x1) / (qnorm(.75)-qnorm(.25))
  a <- x1 - b * qnorm(.25)
  y <- sort(x) - (a + b * qnorm(ppoints(N)))
  hist(y)

And you want to know when the differences ceases to be "significantly"
different from zero.

  plot(y)
  abline(h=0, lty=3)

You can use the plot fo fix a threshold, but unless you have a model
describing how non-gaussian you data are, this will be empirical. 

You will note that, in those simulations, the differences (either
yours or those from the lines through the first and third quartiles)
are not gaussian at all.

-- Vincent

On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote:
> hope it is not b/c some central limit therory, otherwise my initial
> plan will fail :)
> 
> On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote:
> > Hi, r-gurus:
> > 
> > I happened to have a question in my work:
> > 
> > I have a dataset, which has only one dimention, like
> > 0.99037297527605
> > 0.991179836732708
> > 0.995635340631367
> > 0.997186769599305
> > 0.991632565640424
> > 0.984047197106486
> > 0.99225943762649
> > 1.00555642128421
> > 0.993725402926564
> > ....
> > 
> > the data is saved in a file called f392.txt.
> > 
> > I used the following codes to play around :)
> > 
> > k<-read.table("f392.txt", header=F)    # read into k
> > kk<-k[[1]]
> > l<-qqnorm(kk)
> > diff=c()
> > lenk<-length(kk)
> > i=1
> > while (i<=lenk){
> > diff[i]=l$y[i]-l$x[i]   # save the difference of therotical quantile
> > and sample quantile
> >                            # remember, my sample mean is around 1
> > while the therotical one, 0
> > i<-i+1
> > }
> > hist(diff, breaks=300)  # analyze the distr of such diff
> > qqnorm(diff)
> > 
> > my question is:
> > from l<-qqnorm(kk), I wanted to know, from which point (or cut), the
> > sample points start to become away from therotical ones. That's the
> > reason I played around the "diff" list, which gives me the difference.
> > To my surprise, the diff is perfectly normal. I tried to use some
> > kk<-c(1, 2, 5, -1 , ...) to test, I concluded it must be some
> > distribution my sample follows gives this finding.
> > 
> > So, any suggestion on the distribution of my sample?   I think there
> > might be some mathematical inference which can leads this observation,
> > but not quite sure.
> > 
> > btw,
> > > fitdistr(kk, 't')
> >         m              s              df
> >   9.999965e-01   7.630770e-03   3.742244e+00
> >  (5.317674e-05) (5.373884e-05) (8.584725e-02)
> > 
> > btw2, can anyone suggest a way to find the "cut" or "threshold" from
> > my sample to discretize them into 3 groups: two tail-group and one
> > main group.--------- my focus.
> > 
> > Thanks,
> > 
> > Ed
> >
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>