[R] simulate an gradually increase of the number of subjects based on two variables

Tue Mar 13 20:44:00 CET 2012

Suggestion below:

On Tue, Mar 13, 2012 at 1:24 PM, guillaume chaumet
<guillaumechaumet at gmail.com> wrote:
> I omit to precise that I already try to generate data based on the mean and
> sd of two variables.
>
> x=rnorm(20,1,5)+1:20
>
> y=rnorm(20,1,7)+41:60
>
> simu<-function(x,y,n) {
>    simu=vector("list",length=n)
>
>    for(i in 1:n) {
>        x=c(x,rnorm(1,mean(x),sd(x)))
>        y=c(y,rnorm(1,mean(y),sd(y)))
>        simu[[i]]$x<-x
>        simu[[i]]$y<-y
>
>
>    }
>
>    return(simu)
> }
>
> test=simu(x,y,60)
> lapply(test, function(x) cor.test(x$x,x$y))
>
> As you could see, the correlation is disappearing with increasing N.
> Perhaps, a bootstrap with lm or cor.test could solve my problem.
>

In this case, you should consider creating the LARGEST sample first,
and then remove cases to create the smaller samples.

The problem now is that you are drawing a completely fresh sample
every time, so you are getting not only the effect of sample size, but
also that extra randomness when case 1 is replaced every time.

 I am fairly confident (80%)  that if you approach it my way, the
mystery you see will start to clarify itself.  That is, draw the big
sample with the desired characteristic, and once you understand the
sampling distribution of cor for that big sample,  you will also
understand what happens when each large sample is reduced  by a few
cases.

BTW, if you were doing this on a truly massive scale, my way would run
much faster.  Allocate memory once, then don't need to manually delete
lines, just trim down the index on the rows.  (Same data access
concept as bootstrap).

pj

-- 
Paul E. Johnson
Professor, Political Science    Assoc. Director
1541 Lilac Lane, Room 504     Center for Research Methods
University of Kansas               University of Kansas
http://pj.freefaculty.org            http://quant.ku.edu