[R] Replacing NA s with the average
Martin Maechler
m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Tue Oct 19 09:32:00 CEST 2021
>>>>> Richard O'Keefe
>>>>> on Tue, 19 Oct 2021 14:22:53 +1300 writes:
> It *sounds* as though you are trying to impute missing data.
> There are better approaches than just plugging in means.
> You might want to look into CALIBERrfimpute or missForest.
Yes, indeed!
Put even more strongly: "Imputation" has been an
important topic for decennia and it has been shown since the
1980s that plugging in columns means can be *very misleading*
for everything you do later with that modified data set.
The Wikipedia page is quite good as short intro
https://en.wikipedia.org/wiki/Imputation_(statistics)
When I've been teaching about this, I've strongly recommended
multiple imputation and the "state-of-the-art" package 'mice'
which comes with a really good text book:
Stef van Buuren (2012) -- Flexible Imputation of Missing Data
https://doi.org/10.1201/b11826
(= reference [12] in the Wikipedia article)
where in the first chapter you see a nice example on how bad
mean imputation typically will be ..
The JSS paper on mice is a more technical (I'd say "to be used
once you are already aware that 'mean imputation' should rarely be used):
> citation(package="mice")
To cite mice in publications use:
Stef van Buuren, Karin Groothuis-Oudshoorn (2011). mice: Multivariate
Imputation by Chained Equations in R. Journal of Statistical Software, 45(3),
1-67. URL https://www.jstatsoft.org/v45/i03/.
Best regards,
Martin Maechler
ETH Zurich and R Core team
> On Tue, 19 Oct 2021 at 01:39, Admire Tarisirayi Chirume
> <atchirume using gmail.com> wrote:
>>
>> Good day colleagues. Below is a csv file attached which i am using in my
>> > analysis.
>> >
>> >
>> >
>> > household.id <http://hh.id>
>> >
>> > hd17.perm
>> >
>> > hd17employ
>> >
>> > health.exp
>> >
>> > total.food.exp
>> >
>> > total.nfood.exp
>> >
>> > 1
>> >
>> > 2
>> >
>> > yes
>> >
>> > 1654
>> >
>> > 23654
>> >
>> > 23655
>> >
>> > 2
>> >
>> > 2
>> >
>> > yes
>> >
>> > NA
>> >
>> > NA
>> >
>> > 65984
>> >
>> > 3
>> >
>> > 6
>> >
>> > no
>> >
>> > 2547
>> >
>> > 123311
>> >
>> > 52416
>> >
>> > 4
>> >
>> > 8
>> >
>> > NA
>> >
>> > 2365
>> >
>> > 13648
>> >
>> > 12544
>> >
>> > 5
>> >
>> > 6
>> >
>> > NA
>> >
>> > 1254
>> >
>> > 36549
>> >
>> > 12365
>> >
>> > 6
>> >
>> > 8
>> >
>> > yes
>> >
>> > 1236
>> >
>> > 236541
>> >
>> > 26522
>> >
>> > 7
>> >
>> > 8
>> >
>> > no
>> >
>> > NA
>> >
>> > 13264
>> >
>> > 23698
>> >
>> >
>> >
>> >
>> >
>> > So I created a df using the above and its a csv file as follows
>> >
>> > wbpractice <- read.csv("world_practice.csv")
>> >
>> > Now i am doing data cleaning and trying to replace all missing values with
>> > the averages of the respective columns.
>> >
>> > the dimension of the actual dataset is;
>> >
>> > dim(wbpractice)
>> [1] 31998 6
>>
>> I used the following script which i executed by i got some error messages
>>
>> for(i in 1:ncol( wbpractice )){
>> wbpractice [is.na( wbpractice [,i]), i] <- mean( wbpractice [,i],
>> na.rm = TRUE)
>> }
>>
>> Any help to replace all NAs with average values in my dataframe?
>>
More information about the R-help
mailing list