[R] Ruofei Mo - How can I generate correlated data with non-normal distribution?

Dalthorp, Daniel ddalthorp at usgs.gov
Thu Mar 3 17:31:07 CET 2016


Ruofei,

Ben's suggestion is simple and gets you close:

require(MASS)
nsim <- 1000000
rho <- -.9
Z <- mvrnorm(nsim, mu=c(0,0),Sigma = cbind(c(1,rho),c(rho, 1)))
U <- pnorm(Z);
a <- Z[,1]
b <- qunif(U[,2])
cor(a,b)

Pearson correlation characterizes the linear relationship between normal
r.v.'s, but there's always a question about what Pearson correlation means
for non-normal r.v.'s....the "close" approach that Ben gave gives a pair of
r.v.'s with a non-linear relationship (because of the pnorm non-linear
transformation), which can be seen if you plot the means of a in bins
across the range of b.

plot(0,0,type='n',xlim=range(a), ylim=c(-.2,1.1))
for (i in seq(0,.99,by=0.01)){
  ind<-which(b>min(b)+i*diff(range(b)) & b<=min(b)+(i+.01)*diff(range(b)))
  points(mean(a[ind]),i-.005,pch=3) # means of a in 100 bins that span the
range of b
}

# so Spearman rank correlation is often used for "correlation" between
non-normal r.v.'s. Exact Spearman correlations can be attained using the
same approach, but adjusting the target correlation.

# to get Spearman correlation of rho = -0.90, use 2*sin(rho*pi/6) in place
of rho in the mvrnorm simulation:
nsim <- 1000000
rho <- -.9
Z <- mvrnorm(nsim, mu=c(0,0),Sigma =
cbind(c(1,2*sin(rho*pi/6)),c(2*sin(rho*pi/6), 1)))
a<-Z[,1]; b<-qunif(pnorm(Z[,2]))
cor(rank(a),rank(b))

# this gives r.v.'s with exact Spearman correlations as desired, while
aiming at Pearson correlation tends to be off by some amount

TargetCorrelation<-seq(-.99, -.1, by=.01)
SimulatedCorrelationP<-numeric(length(TargetCorrelation))
SimulatedCorrelationS<-numeric(length(TargetCorrelation))

nsim<-100000
for (i in 1:length(TargetCorrelation)){
  rho<-TargetCorrelation[i]
  Z <- mvrnorm(nsim, mu=c(0,0),Sigma = cbind(c(1,rho),c(rho, 1)))
  U <- pnorm(Z);
  a <- Z[,1]
  b <- qunif(U[,2])
  SimulatedCorrelationP[i]<-cor(a,b)

  Z <- mvrnorm(nsim, mu=c(0,0),Sigma =
cbind(c(1,2*sin(rho*pi/6)),c(2*sin(rho*pi/6), 1)))
  U <- pnorm(Z);
  a <- Z[,1]
  b <- qunif(U[,2])
  SimulatedCorrelationS[i]<-cor(rank(a),rank(b))
}
plot(TargetCorrelation,SimulatedCorrelationP)
points(TargetCorrelation,SimulatedCorrelationS,pch=20)
lines(c(-1,0),c(-1,0),col=2)
plot(TargetCorrelation,SimulatedCorrelationP)
points(TargetCorrelation,SimulatedCorrelationS,pch=20)
lines(c(-1,0),c(-1,0),col=2)


-Dan

On Wed, Mar 2, 2016 at 3:46 PM, Ben Bolker <bbolker at gmail.com> wrote:

> Ruofei Mo【莫若飞】 <911mruofei <at> tongji.edu.cn> writes:
>
> >
> > Hi, All,
> >
> > I have a question about how to generate correlated data with non-normal
> > distribution? Basic, I have a variable a that follows a normal
> distribution,
> > a ~ N(0,1), then I want to generate another variable b that follows a
> > uniform distribution, b ~ U(0, 1). Most importantly, I want the
> correlation
> > between a and b to be fixed at -.9, cor(a,b) = -.90
> >
> > I tried the following code,
> >
> > ### Correlation matrix rmvnorm() function ###
> >
>
>   I don't know that there's a closed-form solution to this problem.
> Here's an attempt to do it by brute force.  By eyeball, you need to
> set the nominal rho to about -0.92 to get a realized rho of -0.9.
>
> simfun <- function(rho,n=10000) {
>     cormat <- matrix(c(1, rho, rho, 1), ncol = 2)
>     dd <- setNames(data.frame(MASS::mvrnorm(1000, mu=c(0,0),
> Sigma=cormat)),
>                    c("a","trans"))
>     dd <- transform(dd,
>                b=pnorm(trans,mean(trans),sd(trans)))
>     dd[,c("a","b")]
> }
>
> cvec <- seq(-0.999,-0.85,length=51)
> res <- expand.grid(rho=cvec,rep=1:10)
> set.seed(101)
> res$cor <- sapply(res$rho,
>                   function(r) cor(simfun(rho=r,n1e6))[1,2])
>
> par(las=1,bty="l")
> plot(cor~rho,data=res)
> abline(a=0,b=1,col=2)
> abline(h=-0.9,col=4)
> abline(v=-0.92,col=4)
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




-- 
Dan Dalthorp, PhD
USGS Forest and Rangeland Ecosystem Science Center
Forest Sciences Lab, Rm 189
3200 SW Jefferson Way
Corvallis, OR 97331
ph: 541-750-0953
ddalthorp at usgs.gov

	[[alternative HTML version deleted]]



More information about the R-help mailing list