[R] detect and replace outliers by the average
Richard O'Keefe
r@oknz @end|ng |rom gm@||@com
Sat Apr 22 01:30:43 CEST 2023
This can be seen as three steps:
(1) identify outliers
(2) replace them with NA (trivial)
(3) impute missing values.
There are packages for imputing missing data.
See
https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
Here I just want to address the first step.
An observation is only an outlier relative to some model.
Outliers can indicate
- data that are just wrong (data entry error, failing battery in measurement
device, all sorts of stuff). In this case, deletion + imputation makes
sense.
- data that are generated by a mixture of two or more processes,
not the single process you thought was there. In this case,
deletion + imputation is dangerous. The world is trying to tell
you something and you are ignoring it.
- the model is wrong. Here again, deletion + imputation is
dangerous. You need a better model.
"Detecting outliers in R" as a web query turned up
https://statsandr.com/blog/outliers-detection-in-r/
on the first page of results. There's plenty of material
about finding outliers.
But please give very VERY serious consideration to the
possibility that some or even all of your outliers are
actually GOOD data telling you something you need to know.
On Fri, 21 Apr 2023 at 06:38, AbouEl-Makarim Aboueissa <
abouelmakarim1962 using gmail.com> wrote:
> Dear All:
>
>
>
> *Re:* detect and replace outliers by the average
>
>
>
> The dataset, please see attached, contains a group factoring column “
> *factor*” and two columns of data “x1” and “x2” with some NA values. I need
> some help to detect the outliers and replace it and the NAs with the
> average within each level (0,1,2) for each variable “x1” and “x2”.
>
>
>
> I tried the below code, but it did not accomplish what I want to do.
>
>
>
>
>
> data<-read.csv("G:/20-Spring_2023/Outliers/data.csv", header=TRUE)
>
> data
>
> replace_outlier_with_mean <- function(x) {
>
> replace(x, x %in% boxplot.stats(x)$out, mean(x, na.rm=TRUE)) #### ,
> na.rm=TRUE NOT working
>
> }
>
> data[] <- lapply(data, replace_outlier_with_mean)
>
>
>
>
>
> Thank you all very much for your help in advance.
>
>
>
>
>
> with many thanks
>
> abou
>
>
> ______________________
>
>
> *AbouEl-Makarim Aboueissa, PhD*
>
> *Professor, Mathematics and Statistics*
> *Graduate Coordinator*
>
> *Department of Mathematics and Statistics*
> *University of Southern Maine*
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list