[R] Generating a binomial random variable correlated with a

Mon Apr 18 15:57:15 CEST 2005

You may generate a single standard normal random variable Z and set
X = (Z>x). According to back-of-the-envelope calculations, the two
have a correlation of

  exp(-x^2/2)/sqrt(2*pi*pnorm(x)*(1-pnorm(x)))

which goes from a maximum of about 0.79 at x=0 to 0 for x going to
infinity.

If you aim at a correlation of 0.7, this is how to go:

> rt <- uniroot(function(x) exp(-x^2/2)/sqrt(2*pi*pnorm(x)*(1-pnorm(x)))-0.7,
+ lower=0,upper=3)
> rt$root
[1] 0.841188
> x <- rnorm(1000)
> y <- as.numeric(x>rt$root)
> cor(x,y)
[1] 0.7069753

Best,
Giovanni

-- 

 __________________________________________________
[                                                  ]
[ Giovanni Petris                 GPetris at uark.edu ]
[ Department of Mathematical Sciences              ]
[ University of Arkansas - Fayetteville, AR 72701  ]
[ Ph: (479) 575-6324, 575-8630 (fax)               ]
[ http://definetti.uark.edu/~gpetris/              ]
[__________________________________________________]

> Date: Sun, 17 Apr 2005 11:52:49 +0100 (BST)
> From: Ted.Harding at nessie.mcc.ac.uk (Ted Harding)
> Sender: r-help-bounces at stat.math.ethz.ch
> Cc: r-help at stat.math.ethz.ch
> Precedence: list
> 
> On 16-Apr-05 Ashraf Chaudhary wrote:
> > Ted:
> > Thank you for your help. All I want is a binomial random
> > variable that is correlated with a normal random variable
> > with specified correlation. By linear I mean the ordinary
> > Pearson correlation. I tried the following two methods,
> > in each case the resulting correlation is substantially
> > less than the one specified.   
> > 
> > Method I: Generate two correlated normals (using Cholesky
> > decomposition method) and dichotomize one (0/1) to get the
> > binomial. Method II: Generate two correlated variables, one
> > binomial and one normal using the Cholesky decomposition methods.
> > 
> > Here is how I did:
> > 
> > X <- rnorm(100)          
> > Y <- rnorm(100)           
> > r<- 0.7
> > Y1 <- X*r+Y*sqrt(1-r**2)     
> > cor(X,Y1)         # Correlated normals using Cholesky decomposition
> > cor(X>0.84,Y1)  # Method I
> > 
> >##
> > X1 <- rbinom(100,1,0.5)    
> > Y2 <- X1*r+Y*sqrt(1-r**2)  
> > cor(X1,Y2);     # Method II
> 
> Hello Ashraf,
> 
> The above more explicit explanation certainly helps to clarify
> your question! (I echo Bill's counsel to spell out essential
> detail).
> 
> I'll lead on from your "second method" above, which makes it
> explicit that, to each of your 100 values (Y) sampled from rnorm
> you are sampling from {0,1} to get one of your 100 binomial
> (n=1) variables (X1). However, in this second method you also
> create the derived variable Y2 = X1*r+Y*sqrt(1-r**2), and
> apparently you want this (Y2) to be the Normal variate which
> is correlated with X1.
> 
> Unfortunately, given the way Y2 is constructed, it does not
> have a Normal distribution. This is almost obvious from the
> fact that, conditional on the number of 1s in X1, the values
> of Y2 are a mixture of N(0,1-r^2) and N(r,1-r^2).
> 
> You can explore this in R by looking at the histogram of a
> much larger sample of Y2. Thus:
> 
>   r<- 0.7
>   Y <- rnorm(100000)
>   X1 <- rbinom(100000,1,0.5)
>   Y2 <- X1*r+Y*sqrt(1-r**2)
>   m<-mean(Y2) ; s<-sd(Y2)
>   hist(Y2,breaks=0.1*(-45:45))
>   lines(0.1*(-45:45),0.1*100000*dnorm(0.1*(-45:45),m,s))
> 
> Do it once, and the fit looks pretty good. However, if you
> repeat the above commands several times, you will observe
> that, near the peak of the curve, there is a definite
> tendency for the peak of the histogram to lie below the peak
> of the fitted curve. This illustrates the fact that you are
> looking at a superposition of two Normal curves, with identical
> (since you have p=0.5 in the Binomial sample) proportions
> and variances, but slightly shifted relative to each other
> (by an amount r which is in magnitude less than 1). This
> superposition is flatter at the top than any single Normal
> curve would be, which is being reflected in the "deficiency"
> of the histogram relative to the fitted curve
> 
> You can also see this theoretically, since your Y2 is
> 
>   Y2 = A*X1 + B*Y
> 
> so that
> 
>   P[Y2 <= y] = p*P[B*Y <= y] + (1-p)*P[B*Y <= y - A]
> 
>              = p*P[Y <= y/B] + (1-p)*P[Y <= (y-A)/B]
> 
> where p is the Binomial P[X1 = 1].
> 
> Hence the density function of Y2 is the derivative of this
> w.r.to y, namely
> 
>   p*f(y/B)/B + (1-p)*f((y-A)/B)/B
> 
> where f(x) is the density function of the standard N(0,1).
> 
> While in the case of your example (see histograms) the
> distribution of Y2 might be close enough to Normal for
> practical purposes (but this depends on your practical
> purpose), it is nonetheless better to try to avoid the
> theoretical difficulty if possible.
> 
> So I'd suggest experimenting on the following lines.
> 
> 1. Let X1 be a sample of size N using rbinom(N,1,p)
>    (where, in general, p need not be 0.5)
> 
> 2. Let Y be a sample of size N using rnorm(N,mu,sigma)
>    (and again, in general, mu need not be 0 nor sigma 1).
> 
> This is as in your example. So far, you have a true
> Binomial variate X1 and a true Normal variate Y.
> 
> 3. X1 will have r 1s in it, and (N-r) 0s. Now consider
>    setting up a correspondence between the 1s and 0s in
>    X1 and the values in Y, in such a way that the 1s
>    tend to get allocated to the higher values of Y and
>    the 0s to lower values of Y. The resulting set of
>    pairs (X1,Y) is then your correlated sample.
> 
>    In other words, permute the sample X1 and cbind the
>    result to Y.
> 
> This is similar to your "dichotomy" method (and would
> be almost identical if you simply allocated the r 1s
> to the r largest Ys), but is more flexible since I'm
> only saying "tend". In other words, consider sampling
> from the N Binomial 0s and 1s according to a non-uniform
> probability distribution.
> 
> The theoretical benefit from doing it this way (provided
> you can arrange the sampling in (3) so as to get your
> desired correlation) is that, by construction, the marginal
> distributions of X1 and Y are exactly as they were to start
> with, namely Binomial and Normal respectively.
> 
> Here is an example of a possible approach:
> 
>   X0 <- rbinom(100,1,0.5)
>   Y <- rnorm(100) ; Y<-sort(Y)
>   p0<-((1:100)-0.5)/100 ; p0<-p0/sum(p0); p<-cumsum(p0)
>   r<-sum(X0); ix<-sample((1:100),r,replace=FALSE,prob=p)
>   X1<-numeric(100); X1[ix]<-1;sum(X1)
>   ## [1] 50
>   cor(X1,Y)
>   ## [1] 0.6150937
> 
> showing that it's quite easy to get close to your (example)
> correlation of 0..7 by simple means. (Note that X1, as
> constructed above, is equivalent to the original Binomial
> sample X0, since it has the same number of 0s and 1s; it's
> just a matter of rearanging these so as to tend to line
> up with the higher values of Y).
> 
> To refine the method (on this approach) play with the way
> that p is derived from p0 (the second line above). You might
> look at some of the suggestions in Bill Venables' (private)
> mail.
> 
> Even something as banal as
> 
>   X1 <- sort(rbinom(100,1,0.5))
>   Y <- sort(rnorm(100))
>   cor(X1,Y)
>   ## [1] 0.768373
> 
> gives an interesting result, though this is far too inflexible
> for general purposes.
> 
> There may even be a theoretical way of arranging the 1s
> so as to get the desired correlation exactly on average
> -- my nose indicates that this may be so, but I have not
> followed it!
> 
> Best wishes,
> Ted.
> 
> 
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 17-Apr-05                                       Time: 11:52:49
> ------------------------------ XFMail ------------------------------
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> 
>