[R] Removing Outliers Function
Ravi Varadhan
rvaradhan at jhmi.edu
Wed Feb 9 04:36:15 CET 2011
David,
Please allow me to digress a lot here. You are one of the few (inlcuding yours truly!) that uses the phrase "shallow learning curve" to indicate difficulty of learning (I assume this is what you meant). I always felt that "steep learning curve" was incorrect. If you plotted the amount of learning on the Y-axis and time on the X-axis, a steep learning curve means that one learns very quickly, but this is just the opposite of what is actually meant.
Best,
Ravi.
____________________________________________________________________
Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University
Ph. (410) 502-2619
email: rvaradhan at jhmi.edu
----- Original Message -----
From: David Winsemius <dwinsemius at comcast.net>
Date: Tuesday, February 8, 2011 10:09 pm
Subject: Re: [R] Removing Outliers Function
To: kirtau <kirtau at live.com>
Cc: r-help at r-project.org
> On Feb 8, 2011, at 9:11 PM, kirtau wrote:
>
> >
> >I am working on a function that will remove outliers for regression
> analysis.
> >I am stating that a data point is an outlier if its studentized
> residual is
> >above or below 3 and -3, respectively. The code below is what i have
> thus
> >far for the function
> >
> >x = c(1:20)
> >y = c(1,3,4,2,5,6,18,8,10,8,11,13,14,14,15,85,17,19,19,20)
> >data1 = data.frame(x,y)
> >
> >
> >rm.outliers = function(dataset,dependent,independent){
> > dataset$predicted = predict(lm(dependent~independent))
> > dataset$stdres = rstudent(lm(dependent~independent))
> > m = 1
> > for(i in 1:length(dataset$stdres)){
> > dataset$outlier_counter[i] = if(dataset$stdres[i] >= 3 |
> >dataset$stdres[i] <= -3) {m} else{0}
> > }
> > j = length(which(dataset$outlier_counter >= 1))
> > while(j>=1){
> > print(dataset[which(dataset$outlier_counter >= 1),])
> > dataset = dataset[which(dataset$outlier_counter == 0),]
> > dataset$predicted = predict(lm(dependent~independent))
> > dataset$stdres = rstudent(lm(dependent~independent))
> > m = m+1
> > for(k in 1:length(dataset$stdres)){
> > dataset$outlier_counter[k] = if(dataset$stdres[k] >= 3 |
> >dataset$stdres[k] <= -3) {m} else{0}
> > }
> > j = length(which(dataset$outlier_counter >= 1))
> > }
> > return(dataset)
> >}
> >
> >The problem that I run into is that i receive this error when i type
> >
> >rm.outliers(data1,data1$y,data1$x)
> >
> >" x y predicted stdres outlier_counter
> >16 16 85 22.98647 24.04862 1
> >Error in `$<-.data.frame`(`*tmp*`, "predicted", value = c(0.114285714285714,
> >:
> > replacement has 20 rows, data has 19"
> >
> >Note: the outlier_counter variable is used to state which "round" of
> the
> >loop the datapoint was marked as an outlier.
> >
> >This would be a HUGE help to me and a few buddies who run a lot of different
> >regression tests.
>
> The solution is about 3 or 4 lines of code to make the function, but
> removing outliers like this is simply statistical malpractice. Maybe
> it's a good thing that R has a shallow learning curve.
>
> --
>
> David Winsemius, MD
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
>
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list