[R] Imputing missing values using "LSmeans" (i.e., population marginal means) - advice in R?
Jenn Barrett
jsbarret at sfu.ca
Fri Apr 6 18:30:28 CEST 2012
Thanks Andy. I did read that posting, but didn't find that it answered my questions.
Ok - so I've confirmed that I can use popMeans in the doBy package to obtain the LSmeans as described in my e-mail below; however, the output has me puzzled.
Recall that my data consists of counts at various sites (n=93 this time) over about 35 years. Each site was only counted once per year; however, not all sites were counted in all years (i.e., data is unbalanced).
Sample code is as follows:
> dat<-read.csv("CountData.csv”)
> dat$COUNTYR_F<-as.factor(dat$COUNT_YR) # Convert year variable to factor
> lmMod<-lm(log(COUNT+0.5)~SITE + COUNTYR_F, data=dat) # Run lm on log transformed counts
> library(doBy)
> pM<-popMeans(lmMod, c("SITE","COUNTYR_F"))
> LMest<-exp(pM$Estimate) # Transform LS estimates back to count and save to a vector
> head(LMest,10)
[1] 2.3006217 0.7012738 0.6707810 8.4331212 4.6810141 0.5902387 1.2535870 903.2004994 31.7064744 351.7324390
# e.g., compare above output to actual counts: 2, 0, 0, NA, 14, 0, NA, 1031, NA, NA
# This output looks alright. However:
> pM.year<-popMeans(lmMod,"COUNTYR_F")
> LMest.year<-exp(pM.year$Estimate)
> LMest.year
[1] 35.94605 52.21187 38.26182 45.04494 48.26065 31.57805 38.20253 29.08914 27.01732 32.25929 32.54706 25.29704 31.99606 35.86583 [...]
> range(LMest.year)
[1] 20.34141 52.21187
If I calculate the mean for 1975 (which corresponds to the first value in above vector --> 35.94605) using the ouput from popMeans(lmMod, c("SITE","COUNTYR_F")) above, I obtain 1530.253, which makes sense given my data. So why are the population marginal means for year (i.e., averaging across sites) so low in the vector immediately above? I'm obviously missing something crucial here...(and I know I'll feel like a dimwit when it hits me).
Note that if I use popMeans to obtain the marginal mean across years for each site the output looks just fine:
> pM.site<-popMeans(lmMod,"SITE")
> LMest.site<-exp(pM.site$Estimate)
> round(LMest.site,3)
[1] 2.108 0.643 0.615 7.728 4.289 0.541 1.149 827.645 29.054 322.309 73.696 1.067 27116.446 44.367 0.885 17.267
[17] 0.743 2529.955 114.254 5624.021 2.652 167.986 6181.059 32.175 0.728 0.685 0.590 6.184 2399.361 0.633 6943.247 0.902
[33] 0.740 3.934 0.831 11.362 0.843 733.442 18.123 1807.352 2361.726 2.260 0.650 226.013 1.037 3808.097 294.388 1.161
[49] 2.428 16.572 3006.224 0.776 0.946 4587.312 30.342 0.628 0.986 8.147 0.798 241.995 40.880 466.779 395.398 0.688
[65] 45.668 34.119 529.253 0.615 67.455 6.129 883.504 1487.803 2133.575 298.472 31.981 907.871 0.982 1.271 3.636 9.387
[81] 110.531 1129.330 31.332 2905.735 23512.168 88.917 8666.546 7276.974 215.724 1740.381 21.530 29.327 1.388
> mean(LMest.site)
[1] 1319.349
My ultimate goal is to use the marginal means to impute the missing values; however, I'm concerned that I'm doing something wrong given the output for the marginal means for years (i.e., averaging across sites).
Thanks in advance for any input.
Cheers,
Jenn
----- Original Message ----
From: "Andy Liaw" <andy_liaw at merck.com>
To: "Jenn Barrett" <jsbarret at sfu.ca>, r-help at r-project.org
Sent: Thursday, April 5, 2012 8:40:04 AM
Subject: RE: [R] Imputing missing values using "LSmeans" (i.e., population marginal means) - advice in R?
Don't know how you searched, but perhaps this might help:
https://stat.ethz.ch/pipermail/r-help/2007-March/128064.html
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Jenn Barrett
> Sent: Tuesday, April 03, 2012 1:23 AM
> To: r-help at r-project.org
> Subject: [R] Imputing missing values using "LSmeans" (i.e.,
> population marginal means) - advice in R?
>
> Hi folks,
>
> I have a dataset that consists of counts over a ~30 year
> period at multiple (>200) sites. Only one count is conducted
> at each site in each year; however, not all sites are
> surveyed in all years. I need to impute the missing values
> because I need an estimate of the total population size
> (i.e., sum of counts across all sites) in each year as input
> to another model.
>
> > head(newdat,40)
> SITE YEAR COUNT
> 1 1 1975 12620
> 2 1 1976 13499
> 3 1 1977 45575
> 4 1 1978 21919
> 5 1 1979 33423
> ...
> 37 2 1975 40000
> 38 2 1978 40322
> 39 2 1979 70000
> 40 2 1980 16244
>
>
> It was suggested to me by a statistician to use LSmeans to do
> this; however, I do not have SAS, nor do I know anything much
> about SAS. I have spent DAYS reading about these "LSmeans"
> and while (I think) I understand what they are, I have
> absolutely no idea how to a) calculate them in R and b) how
> to use them to impute my missing values in R. Again, I've
> searched the mail lists, internet and literature and have not
> found any documentation to advise on how to do this - I'm lost.
>
> I've looked at popMeans, but have no clue how to use this
> with predict() - if this is even the route to go. Any advice
> would be much appreciated. Note that YEAR will be treated as
> a factor and not a linear variable (i.e., the relationship
> between COUNT and YEAR is not linear - rather there are highs
> and lows about every 10 or so years).
>
> One thought I did have was to just set up a loop to calculate
> the least-squares estimates as:
>
> Yij = (IYi + JYj - Y)/[(I-1)(J-1)]
> where I = number of treatments and J = number of blocks (so
> I = sites and J = years). I found this formula in some stats
> lecture handouts by UC Davis on unbalanced data and
> LSMeans...but does it yield the same thing as using the
> LSmeans estimates? Does it make any sense? Thoughts?
>
> Many thanks in advance.
>
> Jenn
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Notice: This e-mail message, together with any attachme...{{dropped:11}}
More information about the R-help
mailing list