[R] Generating uniformly distributed correlated data.

Mon Feb 21 23:58:04 CET 2011

[OOPS! The example I gave before does not correspond to the stated
 seed. At the end is a revised example where I have checked that
 it does correspond. Sorry.
]

On 21-Feb-11 21:38:11, Ted Harding wrote:
> [See at end]
> 
> On 21-Feb-11 17:13:03, Larry Hotchkiss wrote:
>>   Hi list:
>> Here is one approach that will generate two uniformly distributed 
>> variables with a correlation of  0.5 --
>> [snip]
>>> >  -----Urspr?ngliche Nachricht-----
>>> >  Von:r-help-bounces at r-project.org
>>> >  [mailto:r-help-bounces at r-project.org] Im Auftrag von S??ren Faurby
>>> >  Gesendet: Sonntag, 20. Februar 2011 03:18
>>> >  An:r-help at r-project.org
>>> >  Betreff: [R] Generating uniformly distributed correlated data.
>>> >  
>>> >  I wish to generate a vector of uniformly distributed data
>>> >  with a defined correlation to another vector
>>> >  
>>> >  The only function I have been able to find doing something
>>> >  similar is corgen from the library ecodist.
>>> >  
>>> >  The following code generates data with the desired
>>> >  correlation to the vector x but the resulting vector y is
>>> >  normal and not uniform distributed
>>> >  
>>> >  library(ecodist)
>>> >  x<- runif(105)
>>> >  y<- corgen(x=x, r=.5)$y
>>> >  
>>> >  Do anyone know a similar function generating uniform
>>> >  distributed data or a way of transforming y to the desired
>>> >  distribution while keeping the correlation between x and y
>>> >  
>>> >  Kind regards, Soren
> 
> I have got intrigued by this problem! In his original query,
> Søren Faurby did not specify the original vector 'x' for
> which he wanted to generate uniformly distributed 'y' with
> a given correlation 'r':
> 
>   "I wish to generate a vector of uniformly distributed
>    data with a defined correlation to another vector"
> 
> In particular he did not state whether it should have any
> specified distriobution -- he only specified that 'y' should
> be uniformly distributed.
> 
> So I'm addressing the general problem in which 'x' may have
> an arbitrary distribution (or be an arbitrary given vector),
> and y is to have a uniform distribution and have a given
> correlation 'r' with 'x'. In the following, I assume r>0.
> 
> I have an idea for an algorithm, though not sure how best to
> write a general function for it. It is based on the idea that
> you first sort 'x'; then generate 'y' as runif(length(x))
> and sort 'y'. This generates a vector 'y' with maximal
> correlation with 'x'. Now, you start at the two extreme ends
> of 'y', and swap those two elements of 'y'. If the result
> still exceeds the desired correlation 'x', you move in
> and swap again, and so on. If you drop below the desired
> correlation, then you back-track, and try swapping a pair
> closer to the middle. And so on. Once you get really close,
> but still just above 'r', you switch to working out from
> the middle to fine-tune the result.
> 
> The basic idea is to permute the elements of 'y' to get
> as close as possible to 'r'. But one needs to avoid an
> exhaustive search through all permutations!
> 
> The following is an example, with length(x)=100 and 'x'
> Normally distributed, and the desired correlation r=0.5,
> was done by hand in about 20 minutes. Only the highlights
> are presented.
> 
> set.seed(54321)
> X0 <- rnorm(100) ; Y0 <- runif(100)
> r <- 0.5
> 
> X1 <- sort(X0) ; Y1 <- sort(Y0)
> 
># First try (goes a bit too far):
> X <- X1 ; Y <- Y1
> N <- length(Y)
> T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
> T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
> T <- Y[3] ; Y[3] <- Y[N-2] ; Y[N-2] <- T # cor(X,Y) = 0.4803957
> 
># So back up and skip the (3<->98) swap)
> X <- X1 ; Y <- Y1
> T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
> T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
> T <- Y[4] ; Y[4] <- Y[N-3] ; Y[N-3] <- T # cor(X,Y) = 0.5031032
> T <- Y[5] ; Y[5] <- Y[N-4] ; Y[N-4] <- T # cor(X,Y) =  0.400471
> 
># Following a few tries at (20,81), (30,71), (40,61):
> X <- X1 ; Y <- Y1
> T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
> T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
> T <- Y[4] ; Y[4] <- Y[N-3] ; Y[N-3] <- T # cor(X,Y) = 0.5031032
> T <- Y[45]; Y[45]<- Y[N-44]; Y[N-44]<- T # cor(X,Y) = 0.5017024
> 
># Now grope more finely:
> X <- X1 ; Y <- Y1
> T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
> T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
> T <- Y[4] ; Y[4] <- Y[N-3] ; Y[N-3] <- T # cor(X,Y) = 0.5031032
> T <- Y[43]; Y[43]<- Y[N-42]; Y[N-42]<- T # cor(X,Y) = 0.5001594
> 
># And again:
> X <- X1 ; Y <- Y1
> T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
> T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
> T <- Y[4] ; Y[4] <- Y[N-3] ; Y[N-3] <- T # cor(X,Y) = 0.5031032
> T<- Y[43] ; Y[43]<- Y[N-42]; Y[N-42]<- T # cor(X,Y) = 0.5001594
> T<- Y[49] ; Y[49]<- Y[N-48]; Y[N-48]<- T # cor(X,Y) = 0.500079
> T<- Y[50] ; Y[50]<- Y[N-49]; Y[N-49]<- T # cor(X,Y) = 0.5000634
> 
>### And this is probably as close as one can get. Pretty close!
> 
> Now one can plot(X,Y). Maybe the resulting cross-shape (more
> or less inevitable from such a procedure) is undesirable.
> But at least the desiderata stated in the query have been met.
> 
> Ted.
> --------------------------------------------------------------------

set.seed(54321)
X0 <- rnorm(100) ; Y0 <- runif(100)
r <- 0.5

X1 <- sort(X0) ; Y1 <- sort(Y0)

X <- X1 ; Y <- Y1
N <- length(Y)
T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.8231312
T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6779782
T <- Y[3] ; Y[3] <- Y[N-2] ; Y[N-2] <- T # cor(X,Y) = 0.5470702
T <- Y[4] ; Y[4] <- Y[N-3] ; Y[N-3] <- T # cor(X,Y) = 0.4284003

X <- X1 ; Y <- Y1
T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.8231312
T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6779782
T <- Y[3] ; Y[3] <- Y[N-2] ; Y[N-2] <- T # cor(X,Y) = 0.5470702
T <- Y[18]; Y[18]<- Y[N-18]; Y[N-18]<- T # cor(X,Y) = 0.5062198
T <- Y[25]; Y[25]<- Y[N-24]; Y[N-24]<- T # cor(X,Y) = 0.4819134

X <- X1 ; Y <- Y1
T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.8231312
T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6779782
T <- Y[3] ; Y[3] <- Y[N-2] ; Y[N-2] <- T # cor(X,Y) = 0.5470702
T <- Y[18]; Y[18]<- Y[N-18]; Y[N-18]<- T # cor(X,Y) = 0.5062198
T <- Y[40]; Y[40]<- Y[N-39]; Y[N-39]<- T # cor(X,Y) = 0.5018108
T <- Y[45]; Y[45]<- Y[N-44]; Y[N-44]<- T # cor(X,Y) = 0.5007854
T <- Y[46]; Y[46]<- Y[N-45]; Y[N-45]<- T # cor(X,Y) = 0.5001801
T <- Y[49]; Y[49]<- Y[N-48]; Y[N-48]<- T # cor(X,Y) = 0.5001178
T <- Y[50]; Y[50]<- Y[N-49]; Y[N-49]<- T # cor(X,Y) = 0.5001087

Once again, the above shows only a selection of the intermediate
results (various local probings omitted). OInce again done by
hand in about 20 minutes.

Apologies for the mistake.
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at wlandres.net>
Fax-to-email: +44 (0)870 094 0861
Date: 21-Feb-11                                       Time: 22:58:00
------------------------------ XFMail ------------------------------