[R] Generating uniformly distributed correlated data.
(Ted Harding)
ted.harding at wlandres.net
Mon Feb 21 23:58:04 CET 2011
[OOPS! The example I gave before does not correspond to the stated
seed. At the end is a revised example where I have checked that
it does correspond. Sorry.
]
On 21-Feb-11 21:38:11, Ted Harding wrote:
> [See at end]
>
> On 21-Feb-11 17:13:03, Larry Hotchkiss wrote:
>> Hi list:
>> Here is one approach that will generate two uniformly distributed
>> variables with a correlation of 0.5 --
>> [snip]
>>> > -----Urspr?ngliche Nachricht-----
>>> > Von:r-help-bounces at r-project.org
>>> > [mailto:r-help-bounces at r-project.org] Im Auftrag von S??ren Faurby
>>> > Gesendet: Sonntag, 20. Februar 2011 03:18
>>> > An:r-help at r-project.org
>>> > Betreff: [R] Generating uniformly distributed correlated data.
>>> >
>>> > I wish to generate a vector of uniformly distributed data
>>> > with a defined correlation to another vector
>>> >
>>> > The only function I have been able to find doing something
>>> > similar is corgen from the library ecodist.
>>> >
>>> > The following code generates data with the desired
>>> > correlation to the vector x but the resulting vector y is
>>> > normal and not uniform distributed
>>> >
>>> > library(ecodist)
>>> > x<- runif(105)
>>> > y<- corgen(x=x, r=.5)$y
>>> >
>>> > Do anyone know a similar function generating uniform
>>> > distributed data or a way of transforming y to the desired
>>> > distribution while keeping the correlation between x and y
>>> >
>>> > Kind regards, Soren
>
> I have got intrigued by this problem! In his original query,
> Søren Faurby did not specify the original vector 'x' for
> which he wanted to generate uniformly distributed 'y' with
> a given correlation 'r':
>
> "I wish to generate a vector of uniformly distributed
> data with a defined correlation to another vector"
>
> In particular he did not state whether it should have any
> specified distriobution -- he only specified that 'y' should
> be uniformly distributed.
>
> So I'm addressing the general problem in which 'x' may have
> an arbitrary distribution (or be an arbitrary given vector),
> and y is to have a uniform distribution and have a given
> correlation 'r' with 'x'. In the following, I assume r>0.
>
> I have an idea for an algorithm, though not sure how best to
> write a general function for it. It is based on the idea that
> you first sort 'x'; then generate 'y' as runif(length(x))
> and sort 'y'. This generates a vector 'y' with maximal
> correlation with 'x'. Now, you start at the two extreme ends
> of 'y', and swap those two elements of 'y'. If the result
> still exceeds the desired correlation 'x', you move in
> and swap again, and so on. If you drop below the desired
> correlation, then you back-track, and try swapping a pair
> closer to the middle. And so on. Once you get really close,
> but still just above 'r', you switch to working out from
> the middle to fine-tune the result.
>
> The basic idea is to permute the elements of 'y' to get
> as close as possible to 'r'. But one needs to avoid an
> exhaustive search through all permutations!
>
> The following is an example, with length(x)=100 and 'x'
> Normally distributed, and the desired correlation r=0.5,
> was done by hand in about 20 minutes. Only the highlights
> are presented.
>
> set.seed(54321)
> X0 <- rnorm(100) ; Y0 <- runif(100)
> r <- 0.5
>
> X1 <- sort(X0) ; Y1 <- sort(Y0)
>
># First try (goes a bit too far):
> X <- X1 ; Y <- Y1
> N <- length(Y)
> T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
> T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
> T <- Y[3] ; Y[3] <- Y[N-2] ; Y[N-2] <- T # cor(X,Y) = 0.4803957
>
># So back up and skip the (3<->98) swap)
> X <- X1 ; Y <- Y1
> T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
> T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
> T <- Y[4] ; Y[4] <- Y[N-3] ; Y[N-3] <- T # cor(X,Y) = 0.5031032
> T <- Y[5] ; Y[5] <- Y[N-4] ; Y[N-4] <- T # cor(X,Y) = 0.400471
>
># Following a few tries at (20,81), (30,71), (40,61):
> X <- X1 ; Y <- Y1
> T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
> T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
> T <- Y[4] ; Y[4] <- Y[N-3] ; Y[N-3] <- T # cor(X,Y) = 0.5031032
> T <- Y[45]; Y[45]<- Y[N-44]; Y[N-44]<- T # cor(X,Y) = 0.5017024
>
># Now grope more finely:
> X <- X1 ; Y <- Y1
> T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
> T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
> T <- Y[4] ; Y[4] <- Y[N-3] ; Y[N-3] <- T # cor(X,Y) = 0.5031032
> T <- Y[43]; Y[43]<- Y[N-42]; Y[N-42]<- T # cor(X,Y) = 0.5001594
>
># And again:
> X <- X1 ; Y <- Y1
> T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.7864082
> T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6210372
> T <- Y[4] ; Y[4] <- Y[N-3] ; Y[N-3] <- T # cor(X,Y) = 0.5031032
> T<- Y[43] ; Y[43]<- Y[N-42]; Y[N-42]<- T # cor(X,Y) = 0.5001594
> T<- Y[49] ; Y[49]<- Y[N-48]; Y[N-48]<- T # cor(X,Y) = 0.500079
> T<- Y[50] ; Y[50]<- Y[N-49]; Y[N-49]<- T # cor(X,Y) = 0.5000634
>
>### And this is probably as close as one can get. Pretty close!
>
> Now one can plot(X,Y). Maybe the resulting cross-shape (more
> or less inevitable from such a procedure) is undesirable.
> But at least the desiderata stated in the query have been met.
>
> Ted.
> --------------------------------------------------------------------
set.seed(54321)
X0 <- rnorm(100) ; Y0 <- runif(100)
r <- 0.5
X1 <- sort(X0) ; Y1 <- sort(Y0)
X <- X1 ; Y <- Y1
N <- length(Y)
T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.8231312
T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6779782
T <- Y[3] ; Y[3] <- Y[N-2] ; Y[N-2] <- T # cor(X,Y) = 0.5470702
T <- Y[4] ; Y[4] <- Y[N-3] ; Y[N-3] <- T # cor(X,Y) = 0.4284003
X <- X1 ; Y <- Y1
T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.8231312
T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6779782
T <- Y[3] ; Y[3] <- Y[N-2] ; Y[N-2] <- T # cor(X,Y) = 0.5470702
T <- Y[18]; Y[18]<- Y[N-18]; Y[N-18]<- T # cor(X,Y) = 0.5062198
T <- Y[25]; Y[25]<- Y[N-24]; Y[N-24]<- T # cor(X,Y) = 0.4819134
X <- X1 ; Y <- Y1
T <- Y[1] ; Y[1] <- Y[N-0] ; Y[N-0] <- T # cor(X,Y) = 0.8231312
T <- Y[2] ; Y[2] <- Y[N-1] ; Y[N-1] <- T # cor(X,Y) = 0.6779782
T <- Y[3] ; Y[3] <- Y[N-2] ; Y[N-2] <- T # cor(X,Y) = 0.5470702
T <- Y[18]; Y[18]<- Y[N-18]; Y[N-18]<- T # cor(X,Y) = 0.5062198
T <- Y[40]; Y[40]<- Y[N-39]; Y[N-39]<- T # cor(X,Y) = 0.5018108
T <- Y[45]; Y[45]<- Y[N-44]; Y[N-44]<- T # cor(X,Y) = 0.5007854
T <- Y[46]; Y[46]<- Y[N-45]; Y[N-45]<- T # cor(X,Y) = 0.5001801
T <- Y[49]; Y[49]<- Y[N-48]; Y[N-48]<- T # cor(X,Y) = 0.5001178
T <- Y[50]; Y[50]<- Y[N-49]; Y[N-49]<- T # cor(X,Y) = 0.5001087
Once again, the above shows only a selection of the intermediate
results (various local probings omitted). OInce again done by
hand in about 20 minutes.
Apologies for the mistake.
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at wlandres.net>
Fax-to-email: +44 (0)870 094 0861
Date: 21-Feb-11 Time: 22:58:00
------------------------------ XFMail ------------------------------
More information about the R-help
mailing list