[R] elimination duplicate elements sampling!

Tue Jul 12 19:25:30 CEST 2011

On 7/7/2011 3:23 PM, elephann wrote:
> Hi everyone!
> I have a data frame with 1112 time series and I am going to randomly
> sampling r samples for z times to compose different portfolio size(r
> securities portfolio). As for r=2 and z=10000,that's:
> z=10000
> A=seq(1:1112)
> x1=sample(A,z,replace =TRUE)
> x2=sample(A,z,replace =TRUE)
> M=cbind(x1,x2) # combination of 2 series
> Because in a portfolio with x1[i]=x2[i],(i=1,2,...,10000) means a 1
> securities' portfolio,not 2 securities',it should be eliminated and
> resampling. With r increase, for example r=k, how do I efficiently
> eliminated all such portfolio as x1[i]=x2[i]=...=xk[i]?

Why not sample without replacement the r portfolios, and replicate that 
z times?

z <- 10000 # number of replicates
r <- 2 # number in each replicate
A <- 1:1112 # space to sample from

M <- t(replicate(z, sample(A, r)))

> Besides, any r securities' portfolio with the same securities' combination
> means the same portfolio(given same weights as here), e.g.
> M(x1[i],x5[i],x7[i],x1000[i]) and M(x5[i],x7[i],x1[i],x1000[i]) or
> M(x1[i],x7[i],x5[i],x1000[i]) are the same, how do I efficiently eliminat
> these possibilities?

Do you mean you don't want any of the replicates to be the same?  You 
can eliminate duplicates

M <- t(replicate(z, sort(sample(A, r))))
M <- M[!duplicated(M),]

Or you can create all possible portfolios of size r, and sample z from 
that without replacement to do it in one pass.

cmb <- t(combn(A, r))
M <- cmb[sample(nrow(cmb), z),]

Note this is not practical for r > 2. cmb is an array of size r by 
choose(length(A), r) (which is 2 x 617716 in this case).  In fact, for r 
 > 3, this won't even work with the 1112 sample space.  For r = 3, cmb 
is 3 x 228554920.  But for the three portfolio case, the probability of 
getting a duplicate portfolio is small.

Better is to sample a few extra so that you still have sufficient after 
throwing out duplicates

M <- t(replicate(1.01*z, sort(sample(A, r))))
M <- M[!duplicated(M),][1:z,]

The 1.01 multiplier may not be big enough; there is no multiplier that 
will guarantee that you will have z samples when you are done.  Although 
the second line will throw an error if there are not z unique samples, 
so it may be easier to pick up.

> --
> View this message in context: http://r.789695.n4.nabble.com/elimination-duplicate-elements-sampling-tp3652791p3652791.html
> Sent from the R help mailing list archive at Nabble.com.

-- 
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University