[R] Imputing missing values using "LSmeans" (i.e., population marginal means) - advice in R?

Søren Højsgaard sorenh at math.aau.dk
Sun Apr 8 01:32:36 CEST 2012


Dear Jenn,

Could you please provide a reproducible example. In your case, you could for example provide the data (using dput(dat)).

Regards
Søren


-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Jenn Barrett
Sent: 6. april 2012 18:30
To: Andy Liaw
Cc: r-help at r-project.org
Subject: Re: [R] Imputing missing values using "LSmeans" (i.e., population marginal means) - advice in R?

Thanks Andy. I did read that posting, but didn't find that it answered my questions.

Ok - so I've confirmed that I can use popMeans in the doBy package to obtain the LSmeans as described in my e-mail below; however, the output has me puzzled.

Recall that my data consists of counts at various sites (n=93 this time) over about 35 years. Each site was only counted once per year; however, not all sites were counted in all years (i.e., data is unbalanced). 

Sample code is as follows:

> dat<-read.csv("CountData.csv”)
> dat$COUNTYR_F<-as.factor(dat$COUNT_YR) # Convert year variable to 
> factor lmMod<-lm(log(COUNT+0.5)~SITE + COUNTYR_F, data=dat)  # Run lm 
> on log transformed counts
> library(doBy)
> pM<-popMeans(lmMod, c("SITE","COUNTYR_F"))
> LMest<-exp(pM$Estimate)  # Transform LS estimates back to count and 
> save to a vector
> head(LMest,10)
 [1]   2.3006217   0.7012738   0.6707810   8.4331212   4.6810141   0.5902387   1.2535870 903.2004994  31.7064744 351.7324390

# e.g., compare above output to actual counts: 2, 0, 0, NA, 14, 0, NA, 1031, NA, NA # This output looks alright. However:

> pM.year<-popMeans(lmMod,"COUNTYR_F")
> LMest.year<-exp(pM.year$Estimate)
> LMest.year
 [1] 35.94605 52.21187 38.26182 45.04494 48.26065 31.57805 38.20253 29.08914 27.01732 32.25929 32.54706 25.29704 31.99606 35.86583 [...]
> range(LMest.year)
[1] 20.34141 52.21187
 
If I calculate the mean for 1975 (which corresponds to the first value in above vector --> 35.94605) using the ouput from popMeans(lmMod, c("SITE","COUNTYR_F")) above, I obtain 1530.253, which makes sense given my data. So why are the population marginal means for year (i.e., averaging across sites) so low in the vector immediately above?  I'm obviously missing something crucial here...(and I know I'll feel like a dimwit when it hits me). 

Note that if I use popMeans to obtain the marginal mean across years for each site the output looks just fine:
> pM.site<-popMeans(lmMod,"SITE")
> LMest.site<-exp(pM.site$Estimate)
> round(LMest.site,3)
 [1]     2.108     0.643     0.615     7.728     4.289     0.541     1.149   827.645    29.054   322.309    73.696     1.067 27116.446    44.367     0.885    17.267
[17]     0.743  2529.955   114.254  5624.021     2.652   167.986  6181.059    32.175     0.728     0.685     0.590     6.184  2399.361     0.633  6943.247     0.902
[33]     0.740     3.934     0.831    11.362     0.843   733.442    18.123  1807.352  2361.726     2.260     0.650   226.013     1.037  3808.097   294.388     1.161
[49]     2.428    16.572  3006.224     0.776     0.946  4587.312    30.342     0.628     0.986     8.147     0.798   241.995    40.880   466.779   395.398     0.688
[65]    45.668    34.119   529.253     0.615    67.455     6.129   883.504  1487.803  2133.575   298.472    31.981   907.871     0.982     1.271     3.636     9.387
[81]   110.531  1129.330    31.332  2905.735 23512.168    88.917  8666.546  7276.974   215.724  1740.381    21.530    29.327     1.388

> mean(LMest.site)
[1] 1319.349 

My ultimate goal is to use the marginal means to impute the missing values; however, I'm concerned that I'm doing something wrong given the output for the marginal means for years (i.e., averaging across sites).

Thanks in advance for any input. 

Cheers,
Jenn

----- Original Message ----
From: "Andy Liaw" <andy_liaw at merck.com>
To: "Jenn Barrett" <jsbarret at sfu.ca>, r-help at r-project.org
Sent: Thursday, April 5, 2012 8:40:04 AM
Subject: RE: [R] Imputing missing values using "LSmeans" (i.e., population marginal means) - advice in R?

Don't know how you searched, but perhaps this might help:

https://stat.ethz.ch/pipermail/r-help/2007-March/128064.html 

> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Jenn Barrett
> Sent: Tuesday, April 03, 2012 1:23 AM
> To: r-help at r-project.org
> Subject: [R] Imputing missing values using "LSmeans" (i.e., population 
> marginal means) - advice in R?
> 
> Hi folks,
> 
> I have a dataset that consists of counts over a ~30 year period at 
> multiple (>200) sites. Only one count is conducted at each site in 
> each year; however, not all sites are surveyed in all years. I need to 
> impute the missing values because I need an estimate of the total 
> population size (i.e., sum of counts across all sites) in each year as 
> input to another model.
> 
> > head(newdat,40)
>    SITE YEAR COUNT
> 1     1 1975 12620
> 2     1 1976 13499
> 3     1 1977 45575
> 4     1 1978 21919
> 5     1 1979 33423
> ...
> 37    2 1975 40000
> 38    2 1978 40322
> 39    2 1979 70000
> 40    2 1980 16244
> 
> 
> It was suggested to me by a statistician to use LSmeans to do this; 
> however, I do not have SAS, nor do I know anything much about SAS. I 
> have spent DAYS reading about these "LSmeans"
> and while (I think) I understand what they are, I have absolutely no 
> idea how to a) calculate them in R and b) how to use them to impute my 
> missing values in R. Again, I've searched the mail lists, internet and 
> literature and have not found any documentation to advise on how to do 
> this - I'm lost.
> 
> I've looked at popMeans, but have no clue how to use this with 
> predict() - if this is even the route to go. Any advice would be much 
> appreciated. Note that YEAR will be treated as a factor and not a 
> linear variable (i.e., the relationship between COUNT and YEAR is not 
> linear - rather there are highs and lows about every 10 or so years).
> 
> One thought I did have was to just set up a loop to calculate the 
> least-squares estimates as:
> 
> Yij = (IYi + JYj - Y)/[(I-1)(J-1)]
> where  I = number of treatments and J = number of blocks (so I = sites 
> and J = years). I found this formula in some stats lecture handouts by 
> UC Davis on unbalanced data and LSMeans...but does it yield the same 
> thing as using the LSmeans estimates? Does it make any sense? 
> Thoughts?
> 
> Many thanks in advance.
> 
> Jenn
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


More information about the R-help mailing list