[R] have to point it out again: a distribution question

Fri Apr 29 22:42:49 CEST 2005

In general, I have been trying "tree" algorithms, which include many
variates of course. I did some clustering (like EM, K-mean, K-median,
etc). But since the original distr is not normal, IMHO, those methods
might not work.

Weiwei

On 4/29/05, Huntsinger, Reid <reid_huntsinger at merck.com> wrote:
> There are many ways to discretize data. That's one way of looking at
> clustering ("vector quantization"). You might also look into modelling
> approaches which don't require it: splines, trees, etc. What sort of data
> mining are you trying to do?
> 
> Reid Huntsinger
> 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi
> Sent: Friday, April 29, 2005 3:22 PM
> To: bogdan romocea
> Cc: R-help at stat.math.ethz.ch
> Subject: Re: [R] have to point it out again: a distribution question
> 
> discretization from continuous domain to categorical one so that some
> data mining algorithm can be applied on it.  Maybe there should be
> more than 3 categories, I don't know.
> I googled some papers in financial field, and any more suggestions or
> references will be helpful.
> 
> Ed
> 
> On 4/29/05, bogdan romocea <br44114 at gmail.com> wrote:
> > > Then, Reid, or other r-gurus, is there a good way to descritize
> > > the sample into 3 category: 2 tails and the body?
> >
> > Out of curiosity, how do you plan to use that information? What would
> > you do if you knew that the 'body' starts here and ends there?
> >
> >
> > -----Original Message-----
> > From: WeiWei Shi [mailto:helprhelp at gmail.com]
> > Sent: Thursday, April 28, 2005 4:18 PM
> > To: Huntsinger, Reid
> > Cc: R-help at stat.math.ethz.ch
> > Subject: Re: [R] have to point it out again: a distribution question
> >
> > Here is summary of
> > l<-qqnorm(kk) # kk is my sample
> > l$y (which is my sample)
> > l$x (which is therotical quantile)
> > diff<-l$y-l$x
> >
> > and
> > > summary(l$y)
> >    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> >  0.9007  0.9942  0.9998  0.9999  1.0060  1.1070
> > > summary(l$x)
> >       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
> > -4.145e+00 -6.745e-01  0.000e+00  2.383e-17  6.745e-01  4.145e+00
> > > summary(diff)
> >    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> > -3.0380  0.3311  0.9998  0.9999  1.6690  5.0460
> >
> > Comparing diff with l$x, though the 1st Qu. and 3rd Qu. are different,
> > diff and l$x seem similar to each other, which are proved by
> > qqnorm(l$x) and qqnorm(diff).
> >
> > running the following codes:
> >
> > r<-rnorm(1000)+1 # since my sample shift from zero to 1
> > qq(r[r>0.9 & r<1.2])  # select the central part
> >
> > this gives me a straight line now.
> >
> > Thanks for the good explanation for the phenomena.
> >
> > Then, Reid, or other r-gurus, is there a good way to descritize the
> > sample into 3 category: 2 tails and the body?
> >
> > Thanks again,
> >
> > Weiwei
> >
> > On 4/28/05, Huntsinger, Reid <reid_huntsinger at merck.com> wrote:
> > > Stock returns and other financial data have often found to be
> heavy-tailed.
> > > Even Cauchy distributions (without even a first absolute moment) have
> been
> > > entertained as models.
> > >
> > > Your qq function subtracts numbers on the scale of a normal (0,1)
> > > distribution from the input data. When the input data are scaled so that
> > > they are insignificant compared to 1, say, then you get essentially the
> > > "theoretical quantiles" ie the "x" component of the list back from l$x -
> > > l$y. l$x is basically a sample from a normal(0,1) distribution so they
> do
> > > line up perfectly in the second qqnorm(). Is that what's happening?
> > >
> > > Reid Huntsinger
> > >
> > >
> > > -----Original Message-----
> > > From: r-help-bounces at stat.math.ethz.ch
> > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi
> > > Sent: Thursday, April 28, 2005 1:38 PM
> > > To: Vincent ZOONEKYND
> > > Cc: R-help at stat.math.ethz.ch
> > > Subject: [R] have to point it out again: a distribution question
> > >
> > > Dear R-helpers:
> > > I pointed out my question last time but it is only partially solved.
> > > So I would like to point it out again since I think  it is very
> > > interesting, at least to me.
> > > It is a question not about how to use R, instead it is a kind of
> > > therotical plus practical question, represented by R.
> > >
> > > I came with this question when I built model for some stock returns.
> > > That's the reason I cannot post the complete data here. But I would
> > > like to attach some plots here (I zipped them since the original ones
> > > are too big).
> > >
> > > The first plot qq1, is qqnorm plot of my sample, giving me some
> > > "S"-shape. Since I am not very experienced, I am not sure what kind of
> > > distribution my sample follows.
> > >
> > > The second plot, qq2, is obtained via
> > > qqnorm(rt(10000, 4)) since I run
> > > fitdistr(kk, 't') and got
> > >         m              s              df
> > >   9.998789e-01   7.663799e-03   3.759726e+00
> > >  (5.332631e-05) (5.411400e-05) (8.684956e-02)
> > >
> > > The second plot seems to say my sample distr follows t-distr. (not sure
> of
> > > this)
> > >
> > > BTW, what the commands for simulating other distr like log-norm,
> > > exponential, and so on?
> > >
> > > The third one was obtained by running the following R code:
> > >
> > > Suppose my data is read into dataset k from file "f392.txt":
> > > k<-read.table("f392.txt", header=F)    # read into k
> > > kk<-k[[1]]
> > > qq(kk)
> > >
> > > qq function is defined as below:
> > > qq<-function(dataset){
> > > l<-qqnorm(dataset, plot.it=F)
> > > diff<-l$y-l$x # difference b/w sample and it's therotical quantile
> > > qqnorm(diff)
> > > }
> > >
> > > The most interesting thing is (if there is not any stupid game here,
> > > and if my sample follows some kind of distribution (no matter if such
> > > distr has been found or not)), my qq function seems like a way to
> > > evaluate it. But what I am worried about, the line is too "perfect",
> > > which indiates there is something goofy here, which can be proved via
> > > some mathematical inference to get it. However I used
> > > qq(rnorm(10000))
> > > qq(rt(10000, 3.7)
> > > qq(rf(....))
> > >
> > > None of them gave me this perfect line!
> > >
> > > Sorry for the long question but I want to make it clear to everybody
> > > about my question. I tried my best :)
> > >
> > > Thanks for your reading,
> > >
> > > Weiwei (Ed) Shi, Ph.D
> > >
> > > On 4/23/05, Vincent ZOONEKYND <zoonek at gmail.com> wrote:
> > > > If I understand your problem, you are computing the difference between
> > > > your data and the quantiles of a standard gaussian variable -- in
> > > > other words, the difference between the data and the red line, in the
> > > > following picture.
> > > >
> > > >   N <- 100  # Sample size
> > > >   m <- 1    # Mean
> > > >   s <- 2    # dispersion
> > > >   x <- m + s * rt(N, df=2)  # Non-gaussian data
> > > >
> > > >   qqnorm(x)
> > > >   abline(0,1, col="red")
> > > >
> > > > And you get
> > > >
> > > >   y <- sort(x) - qnorm(ppoints(N))
> > > >   hist(y)
> > > >
> > > > This is probably not the right line (not only because your mean is 1,
> > > > the slope is wrong as well -- if the data were gaussian, you could
> > > > estimate it with the standard deviation).
> > > >
> > > > You can use the "qqline" function to get the line passing throught the
> > > > first and third quartiles, which is probably closer to what you have
> > > > in mind.
> > > >
> > > >   qqnorm(x)
> > > >   abline(0,1, col="red")
> > > >   qqline(x, col="blue")
> > > >
> > > > The differences are
> > > >
> > > >   x1 <- quantile(x, .25)
> > > >   x2 <- quantile(x, .75)
> > > >   b <- (x2-x1) / (qnorm(.75)-qnorm(.25))
> > > >   a <- x1 - b * qnorm(.25)
> > > >   y <- sort(x) - (a + b * qnorm(ppoints(N)))
> > > >   hist(y)
> > > >
> > > > And you want to know when the differences ceases to be "significantly"
> > > > different from zero.
> > > >
> > > >   plot(y)
> > > >   abline(h=0, lty=3)
> > > >
> > > > You can use the plot fo fix a threshold, but unless you have a model
> > > > describing how non-gaussian you data are, this will be empirical.
> > > >
> > > > You will note that, in those simulations, the differences (either
> > > > yours or those from the lines through the first and third quartiles)
> > > > are not gaussian at all.
> > > >
> > > > -- Vincent
> > > >
> > > >
> > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote:
> > > > > hope it is not b/c some central limit therory, otherwise my initial
> > > > > plan will fail :)
> > > > >
> > > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote:
> > > > > > Hi, r-gurus:
> > > > > >
> > > > > > I happened to have a question in my work:
> > > > > >
> > > > > > I have a dataset, which has only one dimention, like
> > > > > > 0.99037297527605
> > > > > > 0.991179836732708
> > > > > > 0.995635340631367
> > > > > > 0.997186769599305
> > > > > > 0.991632565640424
> > > > > > 0.984047197106486
> > > > > > 0.99225943762649
> > > > > > 1.00555642128421
> > > > > > 0.993725402926564
> > > > > > ....
> > > > > >
> > > > > > the data is saved in a file called f392.txt.
> > > > > >
> > > > > > I used the following codes to play around :)
> > > > > >
> > > > > > k<-read.table("f392.txt", header=F)    # read into k
> > > > > > kk<-k[[1]]
> > > > > > l<-qqnorm(kk)
> > > > > > diff=c()
> > > > > > lenk<-length(kk)
> > > > > > i=1
> > > > > > while (i<=lenk){
> > > > > > diff[i]=l$y[i]-l$x[i]   # save the difference of therotical
> quantile
> > > > > > and sample quantile
> > > > > >                            # remember, my sample mean is around 1
> > > > > > while the therotical one, 0
> > > > > > i<-i+1
> > > > > > }
> > > > > > hist(diff, breaks=300)  # analyze the distr of such diff
> > > > > > qqnorm(diff)
> > > > > >
> > > > > > my question is:
> > > > > > from l<-qqnorm(kk), I wanted to know, from which point (or cut),
> the
> > > > > > sample points start to become away from therotical ones. That's
> the
> > > > > > reason I played around the "diff" list, which gives me the
> difference.
> > > > > > To my surprise, the diff is perfectly normal. I tried to use some
> > > > > > kk<-c(1, 2, 5, -1 , ...) to test, I concluded it must be some
> > > > > > distribution my sample follows gives this finding.
> > > > > >
> > > > > > So, any suggestion on the distribution of my sample?   I think
> there
> > > > > > might be some mathematical inference which can leads this
> observation,
> > > > > > but not quite sure.
> > > > > >
> > > > > > btw,
> > > > > > > fitdistr(kk, 't')
> > > > > >         m              s              df
> > > > > >   9.999965e-01   7.630770e-03   3.742244e+00
> > > > > >  (5.317674e-05) (5.373884e-05) (8.584725e-02)
> > > > > >
> > > > > > btw2, can anyone suggest a way to find the "cut" or "threshold"
> from
> > > > > > my sample to discretize them into 3 groups: two tail-group and one
> > > > > > main group.--------- my focus.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Ed
> > > > > >
> > > > >
> > > > > ______________________________________________
> > > > > R-help at stat.math.ethz.ch mailing list
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > PLEASE do read the posting guide!
> > > > > http://www.R-project.org/posting-guide.html
> > > > >
> > > >
> > >
> > >
> ----------------------------------------------------------------------------
> --
> > > Notice:  This e-mail message, together with any attachment...{{dropped}}
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> >
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> 
> ------------------------------------------------------------------------------
> Notice:  This e-mail message, together with any attachment...{{dropped}}