[R] have to point it out again: a distribution question

Thu Apr 28 22:18:14 CEST 2005

Here is summary of
l<-qqnorm(kk) # kk is my sample 
l$y (which is my sample)
l$x (which is therotical quantile)
diff<-l$y-l$x

and 
> summary(l$y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.9007  0.9942  0.9998  0.9999  1.0060  1.1070
> summary(l$x)
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
-4.145e+00 -6.745e-01  0.000e+00  2.383e-17  6.745e-01  4.145e+00
> summary(diff)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
-3.0380  0.3311  0.9998  0.9999  1.6690  5.0460

Comparing diff with l$x, though the 1st Qu. and 3rd Qu. are different,
diff and l$x seem similar to each other, which are proved by
qqnorm(l$x) and qqnorm(diff).

running the following codes:

r<-rnorm(1000)+1 # since my sample shift from zero to 1
qq(r[r>0.9 & r<1.2])  # select the central part

this gives me a straight line now.

Thanks for the good explanation for the phenomena.

Then, Reid, or other r-gurus, is there a good way to descritize the
sample into 3 category: 2 tails and the body?

Thanks again,

Weiwei

On 4/28/05, Huntsinger, Reid <reid_huntsinger at merck.com> wrote:
> Stock returns and other financial data have often found to be heavy-tailed.
> Even Cauchy distributions (without even a first absolute moment) have been
> entertained as models.
> 
> Your qq function subtracts numbers on the scale of a normal (0,1)
> distribution from the input data. When the input data are scaled so that
> they are insignificant compared to 1, say, then you get essentially the
> "theoretical quantiles" ie the "x" component of the list back from l$x -
> l$y. l$x is basically a sample from a normal(0,1) distribution so they do
> line up perfectly in the second qqnorm(). Is that what's happening?
> 
> Reid Huntsinger
> 
> 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi
> Sent: Thursday, April 28, 2005 1:38 PM
> To: Vincent ZOONEKYND
> Cc: R-help at stat.math.ethz.ch
> Subject: [R] have to point it out again: a distribution question
> 
> Dear R-helpers:
> I pointed out my question last time but it is only partially solved.
> So I would like to point it out again since I think  it is very
> interesting, at least to me.
> It is a question not about how to use R, instead it is a kind of
> therotical plus practical question, represented by R.
> 
> I came with this question when I built model for some stock returns.
> That's the reason I cannot post the complete data here. But I would
> like to attach some plots here (I zipped them since the original ones
> are too big).
> 
> The first plot qq1, is qqnorm plot of my sample, giving me some
> "S"-shape. Since I am not very experienced, I am not sure what kind of
> distribution my sample follows.
> 
> The second plot, qq2, is obtained via
> qqnorm(rt(10000, 4)) since I run
> fitdistr(kk, 't') and got
>         m              s              df
>   9.998789e-01   7.663799e-03   3.759726e+00
>  (5.332631e-05) (5.411400e-05) (8.684956e-02)
> 
> The second plot seems to say my sample distr follows t-distr. (not sure of
> this)
> 
> BTW, what the commands for simulating other distr like log-norm,
> exponential, and so on?
> 
> The third one was obtained by running the following R code:
> 
> Suppose my data is read into dataset k from file "f392.txt":
> k<-read.table("f392.txt", header=F)    # read into k
> kk<-k[[1]]
> qq(kk)
> 
> qq function is defined as below:
> qq<-function(dataset){
> l<-qqnorm(dataset, plot.it=F)
> diff<-l$y-l$x # difference b/w sample and it's therotical quantile
> qqnorm(diff)
> }
> 
> The most interesting thing is (if there is not any stupid game here,
> and if my sample follows some kind of distribution (no matter if such
> distr has been found or not)), my qq function seems like a way to
> evaluate it. But what I am worried about, the line is too "perfect",
> which indiates there is something goofy here, which can be proved via
> some mathematical inference to get it. However I used
> qq(rnorm(10000))
> qq(rt(10000, 3.7)
> qq(rf(....))
> 
> None of them gave me this perfect line!
> 
> Sorry for the long question but I want to make it clear to everybody
> about my question. I tried my best :)
> 
> Thanks for your reading,
> 
> Weiwei (Ed) Shi, Ph.D
> 
> On 4/23/05, Vincent ZOONEKYND <zoonek at gmail.com> wrote:
> > If I understand your problem, you are computing the difference between
> > your data and the quantiles of a standard gaussian variable -- in
> > other words, the difference between the data and the red line, in the
> > following picture.
> >
> >   N <- 100  # Sample size
> >   m <- 1    # Mean
> >   s <- 2    # dispersion
> >   x <- m + s * rt(N, df=2)  # Non-gaussian data
> >
> >   qqnorm(x)
> >   abline(0,1, col="red")
> >
> > And you get
> >
> >   y <- sort(x) - qnorm(ppoints(N))
> >   hist(y)
> >
> > This is probably not the right line (not only because your mean is 1,
> > the slope is wrong as well -- if the data were gaussian, you could
> > estimate it with the standard deviation).
> >
> > You can use the "qqline" function to get the line passing throught the
> > first and third quartiles, which is probably closer to what you have
> > in mind.
> >
> >   qqnorm(x)
> >   abline(0,1, col="red")
> >   qqline(x, col="blue")
> >
> > The differences are
> >
> >   x1 <- quantile(x, .25)
> >   x2 <- quantile(x, .75)
> >   b <- (x2-x1) / (qnorm(.75)-qnorm(.25))
> >   a <- x1 - b * qnorm(.25)
> >   y <- sort(x) - (a + b * qnorm(ppoints(N)))
> >   hist(y)
> >
> > And you want to know when the differences ceases to be "significantly"
> > different from zero.
> >
> >   plot(y)
> >   abline(h=0, lty=3)
> >
> > You can use the plot fo fix a threshold, but unless you have a model
> > describing how non-gaussian you data are, this will be empirical.
> >
> > You will note that, in those simulations, the differences (either
> > yours or those from the lines through the first and third quartiles)
> > are not gaussian at all.
> >
> > -- Vincent
> >
> >
> > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote:
> > > hope it is not b/c some central limit therory, otherwise my initial
> > > plan will fail :)
> > >
> > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote:
> > > > Hi, r-gurus:
> > > >
> > > > I happened to have a question in my work:
> > > >
> > > > I have a dataset, which has only one dimention, like
> > > > 0.99037297527605
> > > > 0.991179836732708
> > > > 0.995635340631367
> > > > 0.997186769599305
> > > > 0.991632565640424
> > > > 0.984047197106486
> > > > 0.99225943762649
> > > > 1.00555642128421
> > > > 0.993725402926564
> > > > ....
> > > >
> > > > the data is saved in a file called f392.txt.
> > > >
> > > > I used the following codes to play around :)
> > > >
> > > > k<-read.table("f392.txt", header=F)    # read into k
> > > > kk<-k[[1]]
> > > > l<-qqnorm(kk)
> > > > diff=c()
> > > > lenk<-length(kk)
> > > > i=1
> > > > while (i<=lenk){
> > > > diff[i]=l$y[i]-l$x[i]   # save the difference of therotical quantile
> > > > and sample quantile
> > > >                            # remember, my sample mean is around 1
> > > > while the therotical one, 0
> > > > i<-i+1
> > > > }
> > > > hist(diff, breaks=300)  # analyze the distr of such diff
> > > > qqnorm(diff)
> > > >
> > > > my question is:
> > > > from l<-qqnorm(kk), I wanted to know, from which point (or cut), the
> > > > sample points start to become away from therotical ones. That's the
> > > > reason I played around the "diff" list, which gives me the difference.
> > > > To my surprise, the diff is perfectly normal. I tried to use some
> > > > kk<-c(1, 2, 5, -1 , ...) to test, I concluded it must be some
> > > > distribution my sample follows gives this finding.
> > > >
> > > > So, any suggestion on the distribution of my sample?   I think there
> > > > might be some mathematical inference which can leads this observation,
> > > > but not quite sure.
> > > >
> > > > btw,
> > > > > fitdistr(kk, 't')
> > > >         m              s              df
> > > >   9.999965e-01   7.630770e-03   3.742244e+00
> > > >  (5.317674e-05) (5.373884e-05) (8.584725e-02)
> > > >
> > > > btw2, can anyone suggest a way to find the "cut" or "threshold" from
> > > > my sample to discretize them into 3 groups: two tail-group and one
> > > > main group.--------- my focus.
> > > >
> > > > Thanks,
> > > >
> > > > Ed
> > > >
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > >
> >
> 
> ------------------------------------------------------------------------------
> Notice:  This e-mail message, together with any attachment...{{dropped}}