[R] outliers using Random Forest
Liaw, Andy
andy_liaw at merck.com
Mon Apr 19 14:30:42 CEST 2004
> From: Edgar Acuna [mailto:edgar at cs.uprm.edu]
>
> Dear Andy,
> Thanks for your quick answer. I increased the number of trees and the
> outlyingness measure got more stable. But still I do not know if I am
> working with the raw measure or with the normalized measure mentioned
> in the Breiman's Wald lecture. The normalized measure nout is
>
> nout=(nout-med)/mean(abs(nout-med))
> where med is the median of the class containing the case correponding
> to nout.
Looking at the Fortran subroutine `locateout' in rfsub.f, yes, they are
normalized. (That part of the code is not changed from Breiman & Cutler's
original.)
Andy
> Best regards
> Edgar Acuna
>
> On Sun, 18 Apr 2004, Liaw, Andy wrote:
>
> > The thing to do is probably:
> >
> > 1. Use fairly large number of trees (e.g., 1000).
> > 2. Run a few times and average the results.
> >
> > The reason for the instability is sort of two fold:
> >
> > 1. The random forest algorithm itself is based on
> randomization. That's why
> > it's probably a good idea to have 500-1000 trees to get more stable
> > proximity measures (of which the outlying measures are based on).
> >
> > 2. If you are running randomForest in unsupervised mode
> (i.e., not giving it
> > the class labels), then the program treats the data as
> "class 1", creates a
> > synthetic "class 2", and run the classification algorithm to get the
> > proximity measures. You probably need to run the algorithm
> a few times so
> > that the result will be based on several simulated data,
> instead of just
> > one.
> >
> > HTH,
> > Andy
> >
> > > From: Edgar Acuna
> > >
> > > Hello,
> > > Does anybody know if the outscale option of randomForest
> yields the
> > > standarized version of the outlier measure for each case? or
> > > the results
> > > are only the raw values. Also I have notice that this
> measure presents
> > > very high variability. I mean if I repeat the experiment I am
> > > getting very
> > > different values for this measure and it is hard to flag
> the outliers.
> > > This does not happen with two other criteria than I am
> using: LOF and
> > > Bay's Orca. I am getting several cases that can be considered
> > > as outliers
> > > with both approaches.
> > > I run my experiments using Bupa and Diabetes available at
> > > UCI Machine database repository.
> > >
> > > Thanks in advance for any response.
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > >
> > >
> >
> >
> >
> --------------------------------------------------------------
> ----------------
> > Notice: This e-mail message, together with any
> attachments, contains
> > information of Merck & Co., Inc. (One Merck Drive,
> Whitehouse Station, New
> > Jersey, USA 08889), and/or its affiliates (which may be
> known outside the
> > United States as Merck Frosst, Merck Sharp & Dohme or MSD
> and in Japan as
> > Banyu) that may be confidential, proprietary copyrighted
> and/or legally
> > privileged. It is intended solely for the use of the
> individual or entity
> > named on this message. If you are not the intended
> recipient, and have
> > received this message in error, please notify us
> immediately by reply e-mail
> > and then delete it from your system.
> >
> --------------------------------------------------------------
> ----------------
> >
>
>
>
------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments,...{{dropped}}
More information about the R-help
mailing list