[R-sig-ME] simulation of grouped data

Fri Jun 20 21:26:25 CEST 2014

mervat mohamed <mervat_moh2006 at ...> writes:

> 
> Dear R group

> I simulate 100 data sets, each data set consists of 50000
> observations (X, Y) and 100 groups, each group contains 500
> observations (data file=indiv).  Then I use the simulated population
> to draw a sample of 25 observations from 5 groups (s.grp) (5
> observations from each group (p)) (data file=indiv.str). And ,I am
> doing lmer on the drawing sample with these two variables Y & X and
> group as random effect. I estimate the slope of the model of y on
> x. But, I have a problem which is the difference between the
> estimated slope (beta) and the real value (r.beta=0.2) is sometimes
> large and I do not know the reason.

(It would help substantially if you could format your code more nicely ...)

in any case (see end of this message), if you draw a histogram of
your results you'll see that they're approximately normally distributed
with a mean near the true value (0.2) and a s.d. of about 0.25.
If you saw occasional outliers that might suggest that the fit was
sometimes going wrong, but this result strongly suggests that everything
is exactly as expected (if you look at the estimated standard errors
of your slope parameter they should be around 0.25 as well).  The
difference is sometimes large just through the effects of stochastic
sampling.

You can probably run your simulations considerably more efficiently/
with more compact code (although since most of the computation is
taken by lmer(), it might not matter much).

 The program code is as follows:

hh=100              ##  the number of simulated data sets
beta<-rep(NA,hh)    ## the estimated values of the slope of the model
for (h in 1:hh){
     n=50000  ##  the total size of the simulated population
     grp=100  ##  the number of groups of the simulated population
     s.grp=5  ##  the number of sampled groups from the simulated popualtion
     p=5      ##  the number of sampled observations 
              ##      from each group of the sampled groups
     r.beta=0.2  ## the real value of beta 
     x=c(0)
     y=c(0)
     group=c(0)
     e=c(0)   ##  the error term within each group
     u=c(0)  ## the error term between the groups
     for (j in 1:100){
          u[j] <- rnorm (1,0,1)
          for (i in 1:500){
              group[i+j*500-500]=j
              e[i+j*500-500]<-rnorm(1,0,1)
              x[i+j*500-500]=rnorm(1,3,1)
              y[i+j*500-500]=2+b*x[i+j*500-500]+e[i+j*500-500]+u[j]
           }
     }
     indiv<- data.frame(group,x,y)
     s.id<-sample(1:100,s.grp)
     str.rows<-c(0)
     r1<-c(0)
     r2<-c(0)
     for (j in 1:s.grp){
         r2[j]=500*s.id[j]
         r1[j]=r2[j]-500+1
         for (i in 1:p){
             str.rows[i+j*p-p] <-c(sample(r1[j]:r2[j],1))
       }
     }
     str <-indiv[str.rows,]
     indiv.str=data.frame(str$group,str$x,str$y)
     names(indiv.str)<-c("group","x","y")
     fcn=lmer(y~x+(1|group),data=indiv.str)
     beta[h]<-fixef(fcn)["x"]
   }

## results reformatted
res <- c(0.231995519,-0.188624285,0.018167679,0.053498977,0.534498744,
-0.066925736,0.080323888,0.386649648,0.516902831,0.515848003,
0.203306100,-0.068883270,-0.267411862,0.390552183,0.276073648,
0.007330362,0.104371287,-0.160514069,-0.195189352,0.387713063,
0.184829681,0.144895545,0.631665413,0.646678912,0.015387116,
-0.192252053,0.181388009,0.171305460,-0.015506910,-0.058342395,
0.507789283,0.169497841,-0.307642411,0.523006541,0.404130042,
0.195879055,0.229919059,-0.069286463,0.447006382,0.244852676,
-0.186131780,0.308434630,0.178955713,0.043044122,-0.095523847,
0.165376798,0.638022903,0.264380374,-0.054814189,0.022157603,
0.219759274,-0.034772547,0.758598418,0.052701825,0.439542452,
0.130938211,0.025220602,0.183105063,0.125483056,0.365605925,
-0.105069737,-0.007282673,-0.140160646,0.415598284,0.073348887,
-0.055605028,0.238275730,-0.052132121,0.063960296,0.085801721,
0.093794361,-0.204151040,-0.159957827,0.351359718,0.390467192,
0.683148716,0.382928456,0.174347587,0.430329912,0.202229216,
0.397045769,0.305609822,0.125820851,0.164247981,0.253652445,
0.237743071,0.219102084,0.474830159,-0.181714623,0.384835726,
0.361376723,0.032047569,0.097940523,0.287583642,0.572813309,
0.338947802,-0.030983428,0.334985479,0.533231095,0.283882400)

par(las=1)
hist(sort(res),main="",col="gray")
abline(v=0.2,col=2,lwd=2)