[R] have to point it out again: a distribution question

Huntsinger, Reid reid_huntsinger at merck.com
Thu Apr 28 20:40:58 CEST 2005


Stock returns and other financial data have often found to be heavy-tailed.
Even Cauchy distributions (without even a first absolute moment) have been
entertained as models.

Your qq function subtracts numbers on the scale of a normal (0,1)
distribution from the input data. When the input data are scaled so that
they are insignificant compared to 1, say, then you get essentially the
"theoretical quantiles" ie the "x" component of the list back from l$x -
l$y. l$x is basically a sample from a normal(0,1) distribution so they do
line up perfectly in the second qqnorm(). Is that what's happening?

Reid Huntsinger



-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi
Sent: Thursday, April 28, 2005 1:38 PM
To: Vincent ZOONEKYND
Cc: R-help at stat.math.ethz.ch
Subject: [R] have to point it out again: a distribution question


Dear R-helpers:
I pointed out my question last time but it is only partially solved.
So I would like to point it out again since I think  it is very
interesting, at least to me.
It is a question not about how to use R, instead it is a kind of
therotical plus practical question, represented by R.

I came with this question when I built model for some stock returns.
That's the reason I cannot post the complete data here. But I would
like to attach some plots here (I zipped them since the original ones
are too big).

The first plot qq1, is qqnorm plot of my sample, giving me some
"S"-shape. Since I am not very experienced, I am not sure what kind of
distribution my sample follows.

The second plot, qq2, is obtained via
qqnorm(rt(10000, 4)) since I run
fitdistr(kk, 't') and got
        m              s              df
  9.998789e-01   7.663799e-03   3.759726e+00
 (5.332631e-05) (5.411400e-05) (8.684956e-02)

The second plot seems to say my sample distr follows t-distr. (not sure of
this)

BTW, what the commands for simulating other distr like log-norm,
exponential, and so on?

The third one was obtained by running the following R code:

Suppose my data is read into dataset k from file "f392.txt":
k<-read.table("f392.txt", header=F)    # read into k
kk<-k[[1]]
qq(kk)


qq function is defined as below:
qq<-function(dataset){
l<-qqnorm(dataset, plot.it=F)
diff<-l$y-l$x # difference b/w sample and it's therotical quantile
qqnorm(diff)
}


The most interesting thing is (if there is not any stupid game here,
and if my sample follows some kind of distribution (no matter if such
distr has been found or not)), my qq function seems like a way to
evaluate it. But what I am worried about, the line is too "perfect",
which indiates there is something goofy here, which can be proved via
some mathematical inference to get it. However I used
qq(rnorm(10000))
qq(rt(10000, 3.7)
qq(rf(....))

None of them gave me this perfect line!

Sorry for the long question but I want to make it clear to everybody
about my question. I tried my best :)

Thanks for your reading,

Weiwei (Ed) Shi, Ph.D



On 4/23/05, Vincent ZOONEKYND <zoonek at gmail.com> wrote:
> If I understand your problem, you are computing the difference between
> your data and the quantiles of a standard gaussian variable -- in
> other words, the difference between the data and the red line, in the
> following picture.
> 
>   N <- 100  # Sample size
>   m <- 1    # Mean
>   s <- 2    # dispersion
>   x <- m + s * rt(N, df=2)  # Non-gaussian data
> 
>   qqnorm(x)
>   abline(0,1, col="red")
> 
> And you get
> 
>   y <- sort(x) - qnorm(ppoints(N))
>   hist(y)
> 
> This is probably not the right line (not only because your mean is 1,
> the slope is wrong as well -- if the data were gaussian, you could
> estimate it with the standard deviation).
> 
> You can use the "qqline" function to get the line passing throught the
> first and third quartiles, which is probably closer to what you have
> in mind.
> 
>   qqnorm(x)
>   abline(0,1, col="red")
>   qqline(x, col="blue")
> 
> The differences are
> 
>   x1 <- quantile(x, .25)
>   x2 <- quantile(x, .75)
>   b <- (x2-x1) / (qnorm(.75)-qnorm(.25))
>   a <- x1 - b * qnorm(.25)
>   y <- sort(x) - (a + b * qnorm(ppoints(N)))
>   hist(y)
> 
> And you want to know when the differences ceases to be "significantly"
> different from zero.
> 
>   plot(y)
>   abline(h=0, lty=3)
> 
> You can use the plot fo fix a threshold, but unless you have a model
> describing how non-gaussian you data are, this will be empirical.
> 
> You will note that, in those simulations, the differences (either
> yours or those from the lines through the first and third quartiles)
> are not gaussian at all.
> 
> -- Vincent
> 
> 
> On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote:
> > hope it is not b/c some central limit therory, otherwise my initial
> > plan will fail :)
> >
> > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote:
> > > Hi, r-gurus:
> > >
> > > I happened to have a question in my work:
> > >
> > > I have a dataset, which has only one dimention, like
> > > 0.99037297527605
> > > 0.991179836732708
> > > 0.995635340631367
> > > 0.997186769599305
> > > 0.991632565640424
> > > 0.984047197106486
> > > 0.99225943762649
> > > 1.00555642128421
> > > 0.993725402926564
> > > ....
> > >
> > > the data is saved in a file called f392.txt.
> > >
> > > I used the following codes to play around :)
> > >
> > > k<-read.table("f392.txt", header=F)    # read into k
> > > kk<-k[[1]]
> > > l<-qqnorm(kk)
> > > diff=c()
> > > lenk<-length(kk)
> > > i=1
> > > while (i<=lenk){
> > > diff[i]=l$y[i]-l$x[i]   # save the difference of therotical quantile
> > > and sample quantile
> > >                            # remember, my sample mean is around 1
> > > while the therotical one, 0
> > > i<-i+1
> > > }
> > > hist(diff, breaks=300)  # analyze the distr of such diff
> > > qqnorm(diff)
> > >
> > > my question is:
> > > from l<-qqnorm(kk), I wanted to know, from which point (or cut), the
> > > sample points start to become away from therotical ones. That's the
> > > reason I played around the "diff" list, which gives me the difference.
> > > To my surprise, the diff is perfectly normal. I tried to use some
> > > kk<-c(1, 2, 5, -1 , ...) to test, I concluded it must be some
> > > distribution my sample follows gives this finding.
> > >
> > > So, any suggestion on the distribution of my sample?   I think there
> > > might be some mathematical inference which can leads this observation,
> > > but not quite sure.
> > >
> > > btw,
> > > > fitdistr(kk, 't')
> > >         m              s              df
> > >   9.999965e-01   7.630770e-03   3.742244e+00
> > >  (5.317674e-05) (5.373884e-05) (8.584725e-02)
> > >
> > > btw2, can anyone suggest a way to find the "cut" or "threshold" from
> > > my sample to discretize them into 3 groups: two tail-group and one
> > > main group.--------- my focus.
> > >
> > > Thanks,
> > >
> > > Ed
> > >
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
>




More information about the R-help mailing list