[R] have to point it out again: a distribution question

Huntsinger, Reid reid_huntsinger at merck.com
Fri Apr 29 22:28:41 CEST 2005


There are many ways to discretize data. That's one way of looking at
clustering ("vector quantization"). You might also look into modelling
approaches which don't require it: splines, trees, etc. What sort of data
mining are you trying to do?

Reid Huntsinger

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi
Sent: Friday, April 29, 2005 3:22 PM
To: bogdan romocea
Cc: R-help at stat.math.ethz.ch
Subject: Re: [R] have to point it out again: a distribution question


discretization from continuous domain to categorical one so that some
data mining algorithm can be applied on it.  Maybe there should be
more than 3 categories, I don't know.
I googled some papers in financial field, and any more suggestions or
references will be helpful.

Ed


On 4/29/05, bogdan romocea <br44114 at gmail.com> wrote:
> > Then, Reid, or other r-gurus, is there a good way to descritize
> > the sample into 3 category: 2 tails and the body?
> 
> Out of curiosity, how do you plan to use that information? What would
> you do if you knew that the 'body' starts here and ends there?
> 
> 
> -----Original Message-----
> From: WeiWei Shi [mailto:helprhelp at gmail.com]
> Sent: Thursday, April 28, 2005 4:18 PM
> To: Huntsinger, Reid
> Cc: R-help at stat.math.ethz.ch
> Subject: Re: [R] have to point it out again: a distribution question
> 
> Here is summary of
> l<-qqnorm(kk) # kk is my sample
> l$y (which is my sample)
> l$x (which is therotical quantile)
> diff<-l$y-l$x
> 
> and
> > summary(l$y)
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>  0.9007  0.9942  0.9998  0.9999  1.0060  1.1070
> > summary(l$x)
>       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
> -4.145e+00 -6.745e-01  0.000e+00  2.383e-17  6.745e-01  4.145e+00
> > summary(diff)
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> -3.0380  0.3311  0.9998  0.9999  1.6690  5.0460
> 
> Comparing diff with l$x, though the 1st Qu. and 3rd Qu. are different,
> diff and l$x seem similar to each other, which are proved by
> qqnorm(l$x) and qqnorm(diff).
> 
> running the following codes:
> 
> r<-rnorm(1000)+1 # since my sample shift from zero to 1
> qq(r[r>0.9 & r<1.2])  # select the central part
> 
> this gives me a straight line now.
> 
> Thanks for the good explanation for the phenomena.
> 
> Then, Reid, or other r-gurus, is there a good way to descritize the
> sample into 3 category: 2 tails and the body?
> 
> Thanks again,
> 
> Weiwei
> 
> On 4/28/05, Huntsinger, Reid <reid_huntsinger at merck.com> wrote:
> > Stock returns and other financial data have often found to be
heavy-tailed.
> > Even Cauchy distributions (without even a first absolute moment) have
been
> > entertained as models.
> >
> > Your qq function subtracts numbers on the scale of a normal (0,1)
> > distribution from the input data. When the input data are scaled so that
> > they are insignificant compared to 1, say, then you get essentially the
> > "theoretical quantiles" ie the "x" component of the list back from l$x -
> > l$y. l$x is basically a sample from a normal(0,1) distribution so they
do
> > line up perfectly in the second qqnorm(). Is that what's happening?
> >
> > Reid Huntsinger
> >
> >
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch
> > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi
> > Sent: Thursday, April 28, 2005 1:38 PM
> > To: Vincent ZOONEKYND
> > Cc: R-help at stat.math.ethz.ch
> > Subject: [R] have to point it out again: a distribution question
> >
> > Dear R-helpers:
> > I pointed out my question last time but it is only partially solved.
> > So I would like to point it out again since I think  it is very
> > interesting, at least to me.
> > It is a question not about how to use R, instead it is a kind of
> > therotical plus practical question, represented by R.
> >
> > I came with this question when I built model for some stock returns.
> > That's the reason I cannot post the complete data here. But I would
> > like to attach some plots here (I zipped them since the original ones
> > are too big).
> >
> > The first plot qq1, is qqnorm plot of my sample, giving me some
> > "S"-shape. Since I am not very experienced, I am not sure what kind of
> > distribution my sample follows.
> >
> > The second plot, qq2, is obtained via
> > qqnorm(rt(10000, 4)) since I run
> > fitdistr(kk, 't') and got
> >         m              s              df
> >   9.998789e-01   7.663799e-03   3.759726e+00
> >  (5.332631e-05) (5.411400e-05) (8.684956e-02)
> >
> > The second plot seems to say my sample distr follows t-distr. (not sure
of
> > this)
> >
> > BTW, what the commands for simulating other distr like log-norm,
> > exponential, and so on?
> >
> > The third one was obtained by running the following R code:
> >
> > Suppose my data is read into dataset k from file "f392.txt":
> > k<-read.table("f392.txt", header=F)    # read into k
> > kk<-k[[1]]
> > qq(kk)
> >
> > qq function is defined as below:
> > qq<-function(dataset){
> > l<-qqnorm(dataset, plot.it=F)
> > diff<-l$y-l$x # difference b/w sample and it's therotical quantile
> > qqnorm(diff)
> > }
> >
> > The most interesting thing is (if there is not any stupid game here,
> > and if my sample follows some kind of distribution (no matter if such
> > distr has been found or not)), my qq function seems like a way to
> > evaluate it. But what I am worried about, the line is too "perfect",
> > which indiates there is something goofy here, which can be proved via
> > some mathematical inference to get it. However I used
> > qq(rnorm(10000))
> > qq(rt(10000, 3.7)
> > qq(rf(....))
> >
> > None of them gave me this perfect line!
> >
> > Sorry for the long question but I want to make it clear to everybody
> > about my question. I tried my best :)
> >
> > Thanks for your reading,
> >
> > Weiwei (Ed) Shi, Ph.D
> >
> > On 4/23/05, Vincent ZOONEKYND <zoonek at gmail.com> wrote:
> > > If I understand your problem, you are computing the difference between
> > > your data and the quantiles of a standard gaussian variable -- in
> > > other words, the difference between the data and the red line, in the
> > > following picture.
> > >
> > >   N <- 100  # Sample size
> > >   m <- 1    # Mean
> > >   s <- 2    # dispersion
> > >   x <- m + s * rt(N, df=2)  # Non-gaussian data
> > >
> > >   qqnorm(x)
> > >   abline(0,1, col="red")
> > >
> > > And you get
> > >
> > >   y <- sort(x) - qnorm(ppoints(N))
> > >   hist(y)
> > >
> > > This is probably not the right line (not only because your mean is 1,
> > > the slope is wrong as well -- if the data were gaussian, you could
> > > estimate it with the standard deviation).
> > >
> > > You can use the "qqline" function to get the line passing throught the
> > > first and third quartiles, which is probably closer to what you have
> > > in mind.
> > >
> > >   qqnorm(x)
> > >   abline(0,1, col="red")
> > >   qqline(x, col="blue")
> > >
> > > The differences are
> > >
> > >   x1 <- quantile(x, .25)
> > >   x2 <- quantile(x, .75)
> > >   b <- (x2-x1) / (qnorm(.75)-qnorm(.25))
> > >   a <- x1 - b * qnorm(.25)
> > >   y <- sort(x) - (a + b * qnorm(ppoints(N)))
> > >   hist(y)
> > >
> > > And you want to know when the differences ceases to be "significantly"
> > > different from zero.
> > >
> > >   plot(y)
> > >   abline(h=0, lty=3)
> > >
> > > You can use the plot fo fix a threshold, but unless you have a model
> > > describing how non-gaussian you data are, this will be empirical.
> > >
> > > You will note that, in those simulations, the differences (either
> > > yours or those from the lines through the first and third quartiles)
> > > are not gaussian at all.
> > >
> > > -- Vincent
> > >
> > >
> > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote:
> > > > hope it is not b/c some central limit therory, otherwise my initial
> > > > plan will fail :)
> > > >
> > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote:
> > > > > Hi, r-gurus:
> > > > >
> > > > > I happened to have a question in my work:
> > > > >
> > > > > I have a dataset, which has only one dimention, like
> > > > > 0.99037297527605
> > > > > 0.991179836732708
> > > > > 0.995635340631367
> > > > > 0.997186769599305
> > > > > 0.991632565640424
> > > > > 0.984047197106486
> > > > > 0.99225943762649
> > > > > 1.00555642128421
> > > > > 0.993725402926564
> > > > > ....
> > > > >
> > > > > the data is saved in a file called f392.txt.
> > > > >
> > > > > I used the following codes to play around :)
> > > > >
> > > > > k<-read.table("f392.txt", header=F)    # read into k
> > > > > kk<-k[[1]]
> > > > > l<-qqnorm(kk)
> > > > > diff=c()
> > > > > lenk<-length(kk)
> > > > > i=1
> > > > > while (i<=lenk){
> > > > > diff[i]=l$y[i]-l$x[i]   # save the difference of therotical
quantile
> > > > > and sample quantile
> > > > >                            # remember, my sample mean is around 1
> > > > > while the therotical one, 0
> > > > > i<-i+1
> > > > > }
> > > > > hist(diff, breaks=300)  # analyze the distr of such diff
> > > > > qqnorm(diff)
> > > > >
> > > > > my question is:
> > > > > from l<-qqnorm(kk), I wanted to know, from which point (or cut),
the
> > > > > sample points start to become away from therotical ones. That's
the
> > > > > reason I played around the "diff" list, which gives me the
difference.
> > > > > To my surprise, the diff is perfectly normal. I tried to use some
> > > > > kk<-c(1, 2, 5, -1 , ...) to test, I concluded it must be some
> > > > > distribution my sample follows gives this finding.
> > > > >
> > > > > So, any suggestion on the distribution of my sample?   I think
there
> > > > > might be some mathematical inference which can leads this
observation,
> > > > > but not quite sure.
> > > > >
> > > > > btw,
> > > > > > fitdistr(kk, 't')
> > > > >         m              s              df
> > > > >   9.999965e-01   7.630770e-03   3.742244e+00
> > > > >  (5.317674e-05) (5.373884e-05) (8.584725e-02)
> > > > >
> > > > > btw2, can anyone suggest a way to find the "cut" or "threshold"
from
> > > > > my sample to discretize them into 3 groups: two tail-group and one
> > > > > main group.--------- my focus.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Ed
> > > > >
> > > >
> > > > ______________________________________________
> > > > R-help at stat.math.ethz.ch mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide!
> > > > http://www.R-project.org/posting-guide.html
> > > >
> > >
> >
> >
----------------------------------------------------------------------------
--
> > Notice:  This e-mail message, together with any attachment...{{dropped}}
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html




More information about the R-help mailing list