[R-SIG-Finance] really puzzled by this R script
R. Michael Weylandt
michael.weylandt at gmail.com
Sun Feb 26 02:25:21 CET 2012
This isn't really a finance question...
Your problem is that you use names() instead of just getting the row
numbers from outlierTest but then, when you convert the names to an
integer, your attempts to remove the row by that number and so doesn't
actually get the right row, then an outlier remains and the infinite
loop is triggered.
To put it concretely, add a browser() at the top of the while loop and
note this:
x <- 1:20
y <- x; y[c(3, 14)] <- 1000*c(1, 1.05); y <- jitter(y);
dat <- data.frame(x = x, y = y)
rmOutlier(y ~ x, dat[-1,])
## Once in browser note this:
ret # should give "14" because the 14th spot is a problem by construction
fullData[ - ret, ] # Still has the outlier at "14" because it's not in
the 14th row!
If you just use the names and index appropriately, you should be fine.
Further help is more suited to R-help though as this isn't very
financial...just a heads up though: you'll also probably get told off
for mentioning outlier removal on R-help: it's something of a rite of
passage
Michael
On Sat, Feb 25, 2012 at 1:27 AM, <johnzli at comcast.net> wrote:
> Dear all,
>
> I wrote a R script that basically trying to identify outliers, and returns a non-empty vector containing the index to the outliers or a NULL object if there is no outliers.
> I have been puzzled by the strange behavior of this function. Let's say we have 10 outliers in a data frame of 1000 row samples.
> 1) If I run rmOutlier(y ~ x1 + x2, xyData), where xyData is a data frame with column names "y", "x1", "x2". The program runs fine, and returns the indices of those 10 outliers.
> 2) If I run rmOutlier(y ~ x1 + x2, xyData[1:200, ]), or any subset starts from the first row (i.e., 1:xx), the program runs fine.
> 3) If I run the script start from subsets of data not starting from the first row, e.g., rumOutlier
> (y ~ x1 + x2, xyData[100:1000, ] ), if there is no outlier falls within xyData[100:1000, ], the program runs fine.
>
> However, in case 3), if there is any outlier falls within xyData[100:1000, ], the program runs in infinite loop (the "while" loop in the script). Trouble shooting indicates that outlierTest( lm(lm_form, data = fullData[-ret, ]) will always return the same set of outliers index, and fullData[-ret, ] seems have no effects.
>
> What went wrong here? Any help will be greatly appreciated.
>
> Thank you.
>
>
> John Li
>
>
> This is the script:
>
>
> "rmOutlier" <- function(lm_form, fullData) {
> # Find and return Outliers indices based on Bonferroni Outlier Test
> # The program returns a non-empty vector or a NULL object
>
>
> # AUTHOR:
> # John Li
> # Date: Feb 17, 2012
> # Revised: Feb 24, 2012
> #
>
> require(car, quietly = TRUE)
>
> #sanity check
> stopifnot(is.data.frame(fullData), ncol(fullData) > 1, length(names(fullData)) == ncol(fullData))
>
> outlier <- outlierTest( lm(lm_form, data = fullData), n.max = Inf)
> if ( outlier$signif ) {
> ex <- c(as.numeric(names(outlier$rstudent)))
> ret <- ex
> }
> else {
> ret <- NULL
> return(ret)
> }
>
> while( outlier$signif ) {
> outlier <- outlierTest( lm(lm_form, data = fullData[-ret, ]), n.max = Inf) # fullData[-ret, ] seems not work
> if ( outlier$signif ) {
> ex <- as.numeric(names(outlier$rstudent))
> ret <- c(ret, ex)
> }
> }
> return(ret)
> }
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-SIG-Finance at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-finance
> -- Subscriber-posting only. If you want to post, subscribe first.
> -- Also note that this is not the r-help list where general R questions should go.
More information about the R-SIG-Finance
mailing list