[R] outliers using Random Forest

Edgar Acuna edgar at cs.uprm.edu
Mon Apr 19 10:58:38 CEST 2004


Dear Andy,
Thanks for your quick answer. I increased the number of trees and the
outlyingness measure got more stable. But still I do not know if I am
working with the raw measure or with the normalized measure mentioned
in the Breiman's Wald lecture. The normalized measure nout is

nout=(nout-med)/mean(abs(nout-med))
where med is the median of the class containing the case correponding
to nout.

Best regards
Edgar Acuna

On Sun, 18 Apr 2004, Liaw, Andy wrote:

> The thing to do is probably:
>
> 1. Use fairly large number of trees (e.g., 1000).
> 2. Run a few times and average the results.
>
> The reason for the instability is sort of two fold:
>
> 1. The random forest algorithm itself is based on randomization.  That's why
> it's probably a good idea to have 500-1000 trees to get more stable
> proximity measures (of which the outlying measures are based on).
>
> 2. If you are running randomForest in unsupervised mode (i.e., not giving it
> the class labels), then the program treats the data as "class 1", creates a
> synthetic "class 2", and run the classification algorithm to get the
> proximity measures.  You probably need to run the algorithm a few times so
> that the result will be based on several simulated data, instead of just
> one.
>
> HTH,
> Andy
>
> > From: Edgar Acuna
> >
> > Hello,
> > Does anybody know if the outscale option of randomForest yields the
> > standarized version of the outlier measure for each case? or
> > the results
> > are only the raw values. Also I have notice that this measure presents
> > very high variability. I mean if I repeat the experiment I am
> > getting very
> > different values for this measure and it is hard to flag the outliers.
> > This does not happen with two other criteria than I am using: LOF and
> > Bay's Orca. I am getting several cases that can be considered
> > as outliers
> > with both approaches.
> >  I run my experiments  using Bupa and Diabetes available at
> > UCI Machine database repository.
> >
> > Thanks in advance for any response.
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
> >
>
>
> ------------------------------------------------------------------------------
> Notice:  This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New
> Jersey, USA 08889), and/or its affiliates (which may be known outside the
> United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan as
> Banyu) that may be confidential, proprietary copyrighted and/or legally
> privileged. It is intended solely for the use of the individual or entity
> named on this message.  If you are not the intended recipient, and have
> received this message in error, please notify us immediately by reply e-mail
> and then delete it from your system.
> ------------------------------------------------------------------------------
>




More information about the R-help mailing list