[R] Removing Outliers Function

David Winsemius dwinsemius at comcast.net
Wed Feb 9 05:43:45 CET 2011


Exactly right. I use the phrase to catch the unwary's attention. I  
think the effect is properly placed on the y-axis.

IIRC, Ben Bolker (or was it Bert Gunter?)  has also commented in the R- 
help or r-devel pages this curious inversion of functional meaning.

-- 
David
On Feb 8, 2011, at 10:36 PM, Ravi Varadhan wrote:

> David,
>
> Please allow me to digress a lot here.  You are one of the few  
> (inlcuding yours truly!) that uses the phrase "shallow learning  
> curve" to indicate difficulty of learning (I assume this is what you  
> meant). I always felt that "steep learning curve" was incorrect.  If  
> you plotted the amount of learning on the Y-axis and time on the X- 
> axis, a steep learning curve means that one learns very quickly, but  
> this is just the opposite of what is actually meant.
>
> Best,
> Ravi.
> ____________________________________________________________________
>
> Ravi Varadhan, Ph.D.
> Assistant Professor,
> Division of Geriatric Medicine and Gerontology
> School of Medicine
> Johns Hopkins University
>
> Ph. (410) 502-2619
> email: rvaradhan at jhmi.edu
>
>
> ----- Original Message -----
> From: David Winsemius <dwinsemius at comcast.net>
> Date: Tuesday, February 8, 2011 10:09 pm
> Subject: Re: [R] Removing Outliers Function
> To: kirtau <kirtau at live.com>
> Cc: r-help at r-project.org
>
>
>> On Feb 8, 2011, at 9:11 PM, kirtau wrote:
>>
>>>
>>> I am working on a function that will remove outliers for regression
>> analysis.
>>> I am stating that a data point is an outlier if its studentized
>> residual is
>>> above or below 3 and -3, respectively. The code below is what i have
>> thus
>>> far for the function
>>>
>>> x = c(1:20)
>>> y = c(1,3,4,2,5,6,18,8,10,8,11,13,14,14,15,85,17,19,19,20)
>>> data1 = data.frame(x,y)
>>>
>>>
>>> rm.outliers = function(dataset,dependent,independent){
>>>  dataset$predicted = predict(lm(dependent~independent))
>>>  dataset$stdres = rstudent(lm(dependent~independent))
>>>  m = 1
>>>  for(i in 1:length(dataset$stdres)){
>>>    dataset$outlier_counter[i] = if(dataset$stdres[i] >= 3 |
>>> dataset$stdres[i] <= -3) {m} else{0}
>>>  }
>>>  j = length(which(dataset$outlier_counter >= 1))
>>>  while(j>=1){
>>>    print(dataset[which(dataset$outlier_counter >= 1),])
>>>    dataset = dataset[which(dataset$outlier_counter == 0),]
>>>    dataset$predicted = predict(lm(dependent~independent))
>>>    dataset$stdres = rstudent(lm(dependent~independent))
>>>      m = m+1
>>>      for(k in 1:length(dataset$stdres)){
>>>        dataset$outlier_counter[k] = if(dataset$stdres[k] >= 3 |
>>> dataset$stdres[k] <= -3) {m} else{0}
>>>      }
>>>    j = length(which(dataset$outlier_counter >= 1))
>>>  }
>>>  return(dataset)
>>> }
>>>
>>> The problem that I run into is that i receive this error when i type
>>>
>>> rm.outliers(data1,data1$y,data1$x)
>>>
>>> "    x  y predicted   stdres outlier_counter
>>> 16 16 85  22.98647 24.04862               1
>>> Error in `$<-.data.frame`(`*tmp*`, "predicted", value =  
>>> c(0.114285714285714,
>>> :
>>> replacement has 20 rows, data has 19"
>>>
>>> Note: the outlier_counter variable is used to state which "round" of
>> the
>>> loop the datapoint was marked as an outlier.
>>>
>>> This would be a HUGE help to me and a few buddies who run a lot of  
>>> different
>>> regression tests.
>>
>> The solution is about 3 or 4 lines of code to make the function, but
>> removing outliers like this is simply statistical malpractice. Maybe
>> it's a good thing that R has a shallow learning curve.
>>
>> -- 
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>>
>> PLEASE do read the posting guide
>> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list