[R] Generating a binomial random variable correlated with a
Giovanni Petris
GPetris at uark.edu
Mon Apr 18 15:57:15 CEST 2005
You may generate a single standard normal random variable Z and set
X = (Z>x). According to back-of-the-envelope calculations, the two
have a correlation of
exp(-x^2/2)/sqrt(2*pi*pnorm(x)*(1-pnorm(x)))
which goes from a maximum of about 0.79 at x=0 to 0 for x going to
infinity.
If you aim at a correlation of 0.7, this is how to go:
> rt <- uniroot(function(x) exp(-x^2/2)/sqrt(2*pi*pnorm(x)*(1-pnorm(x)))-0.7,
+ lower=0,upper=3)
> rt$root
[1] 0.841188
> x <- rnorm(1000)
> y <- as.numeric(x>rt$root)
> cor(x,y)
[1] 0.7069753
Best,
Giovanni
--
__________________________________________________
[ ]
[ Giovanni Petris GPetris at uark.edu ]
[ Department of Mathematical Sciences ]
[ University of Arkansas - Fayetteville, AR 72701 ]
[ Ph: (479) 575-6324, 575-8630 (fax) ]
[ http://definetti.uark.edu/~gpetris/ ]
[__________________________________________________]
> Date: Sun, 17 Apr 2005 11:52:49 +0100 (BST)
> From: Ted.Harding at nessie.mcc.ac.uk (Ted Harding)
> Sender: r-help-bounces at stat.math.ethz.ch
> Cc: r-help at stat.math.ethz.ch
> Precedence: list
>
> On 16-Apr-05 Ashraf Chaudhary wrote:
> > Ted:
> > Thank you for your help. All I want is a binomial random
> > variable that is correlated with a normal random variable
> > with specified correlation. By linear I mean the ordinary
> > Pearson correlation. I tried the following two methods,
> > in each case the resulting correlation is substantially
> > less than the one specified.
> >
> > Method I: Generate two correlated normals (using Cholesky
> > decomposition method) and dichotomize one (0/1) to get the
> > binomial. Method II: Generate two correlated variables, one
> > binomial and one normal using the Cholesky decomposition methods.
> >
> > Here is how I did:
> >
> > X <- rnorm(100)
> > Y <- rnorm(100)
> > r<- 0.7
> > Y1 <- X*r+Y*sqrt(1-r**2)
> > cor(X,Y1) # Correlated normals using Cholesky decomposition
> > cor(X>0.84,Y1) # Method I
> >
> >##
> > X1 <- rbinom(100,1,0.5)
> > Y2 <- X1*r+Y*sqrt(1-r**2)
> > cor(X1,Y2); # Method II
>
> Hello Ashraf,
>
> The above more explicit explanation certainly helps to clarify
> your question! (I echo Bill's counsel to spell out essential
> detail).
>
> I'll lead on from your "second method" above, which makes it
> explicit that, to each of your 100 values (Y) sampled from rnorm
> you are sampling from {0,1} to get one of your 100 binomial
> (n=1) variables (X1). However, in this second method you also
> create the derived variable Y2 = X1*r+Y*sqrt(1-r**2), and
> apparently you want this (Y2) to be the Normal variate which
> is correlated with X1.
>
> Unfortunately, given the way Y2 is constructed, it does not
> have a Normal distribution. This is almost obvious from the
> fact that, conditional on the number of 1s in X1, the values
> of Y2 are a mixture of N(0,1-r^2) and N(r,1-r^2).
>
> You can explore this in R by looking at the histogram of a
> much larger sample of Y2. Thus:
>
> r<- 0.7
> Y <- rnorm(100000)
> X1 <- rbinom(100000,1,0.5)
> Y2 <- X1*r+Y*sqrt(1-r**2)
> m<-mean(Y2) ; s<-sd(Y2)
> hist(Y2,breaks=0.1*(-45:45))
> lines(0.1*(-45:45),0.1*100000*dnorm(0.1*(-45:45),m,s))
>
> Do it once, and the fit looks pretty good. However, if you
> repeat the above commands several times, you will observe
> that, near the peak of the curve, there is a definite
> tendency for the peak of the histogram to lie below the peak
> of the fitted curve. This illustrates the fact that you are
> looking at a superposition of two Normal curves, with identical
> (since you have p=0.5 in the Binomial sample) proportions
> and variances, but slightly shifted relative to each other
> (by an amount r which is in magnitude less than 1). This
> superposition is flatter at the top than any single Normal
> curve would be, which is being reflected in the "deficiency"
> of the histogram relative to the fitted curve
>
> You can also see this theoretically, since your Y2 is
>
> Y2 = A*X1 + B*Y
>
> so that
>
> P[Y2 <= y] = p*P[B*Y <= y] + (1-p)*P[B*Y <= y - A]
>
> = p*P[Y <= y/B] + (1-p)*P[Y <= (y-A)/B]
>
> where p is the Binomial P[X1 = 1].
>
> Hence the density function of Y2 is the derivative of this
> w.r.to y, namely
>
> p*f(y/B)/B + (1-p)*f((y-A)/B)/B
>
> where f(x) is the density function of the standard N(0,1).
>
> While in the case of your example (see histograms) the
> distribution of Y2 might be close enough to Normal for
> practical purposes (but this depends on your practical
> purpose), it is nonetheless better to try to avoid the
> theoretical difficulty if possible.
>
> So I'd suggest experimenting on the following lines.
>
> 1. Let X1 be a sample of size N using rbinom(N,1,p)
> (where, in general, p need not be 0.5)
>
> 2. Let Y be a sample of size N using rnorm(N,mu,sigma)
> (and again, in general, mu need not be 0 nor sigma 1).
>
> This is as in your example. So far, you have a true
> Binomial variate X1 and a true Normal variate Y.
>
> 3. X1 will have r 1s in it, and (N-r) 0s. Now consider
> setting up a correspondence between the 1s and 0s in
> X1 and the values in Y, in such a way that the 1s
> tend to get allocated to the higher values of Y and
> the 0s to lower values of Y. The resulting set of
> pairs (X1,Y) is then your correlated sample.
>
> In other words, permute the sample X1 and cbind the
> result to Y.
>
> This is similar to your "dichotomy" method (and would
> be almost identical if you simply allocated the r 1s
> to the r largest Ys), but is more flexible since I'm
> only saying "tend". In other words, consider sampling
> from the N Binomial 0s and 1s according to a non-uniform
> probability distribution.
>
> The theoretical benefit from doing it this way (provided
> you can arrange the sampling in (3) so as to get your
> desired correlation) is that, by construction, the marginal
> distributions of X1 and Y are exactly as they were to start
> with, namely Binomial and Normal respectively.
>
> Here is an example of a possible approach:
>
> X0 <- rbinom(100,1,0.5)
> Y <- rnorm(100) ; Y<-sort(Y)
> p0<-((1:100)-0.5)/100 ; p0<-p0/sum(p0); p<-cumsum(p0)
> r<-sum(X0); ix<-sample((1:100),r,replace=FALSE,prob=p)
> X1<-numeric(100); X1[ix]<-1;sum(X1)
> ## [1] 50
> cor(X1,Y)
> ## [1] 0.6150937
>
> showing that it's quite easy to get close to your (example)
> correlation of 0..7 by simple means. (Note that X1, as
> constructed above, is equivalent to the original Binomial
> sample X0, since it has the same number of 0s and 1s; it's
> just a matter of rearanging these so as to tend to line
> up with the higher values of Y).
>
> To refine the method (on this approach) play with the way
> that p is derived from p0 (the second line above). You might
> look at some of the suggestions in Bill Venables' (private)
> mail.
>
> Even something as banal as
>
> X1 <- sort(rbinom(100,1,0.5))
> Y <- sort(rnorm(100))
> cor(X1,Y)
> ## [1] 0.768373
>
> gives an interesting result, though this is far too inflexible
> for general purposes.
>
> There may even be a theoretical way of arranging the 1s
> so as to get the desired correlation exactly on average
> -- my nose indicates that this may be so, but I have not
> followed it!
>
> Best wishes,
> Ted.
>
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 17-Apr-05 Time: 11:52:49
> ------------------------------ XFMail ------------------------------
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
>
More information about the R-help
mailing list