[R] Deleting duplicate rows in a matrix at random

Thu Jun 3 20:04:50 CEST 2010

 > I need to remove all but one of each [row in a matrix], which
 > must be chosen at random.

This request (included in full at the bottom), has been unanswered for a 
while, but I had the same problem and ended up writing a function to 
solve it. I call it "duplicated.random()" and it does exactly the same 
thing as the "duplicated()" function apart from the fact that the choice 
of which of the duplicated observations gets a FALSE in the result is 
random, rather than always being the first. There is no way to specify 
any distribution probabilities; each duplicated observation is equally 
likely to be chosen.

The implementation is through permuting the original using "sample()", 
then running "duplicated()" and finally reversing the permutation on the 
result. So the randomization should have "similar properties" as 
sample(), probably including reproducibility by setting the random seed 
(although haven't tested that explicitly).

The function and some test code are included below. It handles vectors 
and matrices for now, but adding other data structures that are handled 
correctly by duplicated() should be a simple matter of ensuring that the 
indexing is handled correctly in the permutation process. If anyone 
makes any improvements to the function, I'd be grateful to be notified.

#############################################################

# This function returns a logical vector, the elements of which
# are FALSE, unless there are duplicated values in x, in which
# case all but one elements are TRUE (for each set of duplicates).
# The only difference between this function and the duplicated()
# function is that rather than always returning FALSE for the first
# instance of a duplicated value, the choice of instance is random.
duplicated.random = function(x, incomparables = FALSE, ...)
{
     if ( is.vector(x) )
     {
         permutation = sample(length(x))
         x.perm      = x[permutation]
         result.perm = duplicated(x.perm, incomparables, ...)
         result      = result.perm[order(permutation)]
         return(result)
     }
     else if ( is.matrix(x) )
     {
         permutation = sample(nrow(x))
         x.perm      = x[permutation,]
         result.perm = duplicated(x.perm, incomparables, ...)
         result      = result.perm[order(permutation)]
         return(result)
     }
     else
     {
         stop(paste("duplicated.random() only supports vectors",
		"matrices for now."))
     }
}

#############################################################

# Test code for vector case
x = sample(1:5,10,T)
d = duplicated(x)
r = duplicated.random(x)
cbind(x,d,r)
x[!d]
x[!r]

# Test code for matrix case
x = matrix(sample(1:2,30,T), ncol=3)
d = duplicated(x)
r = duplicated.random(x)
cbind(x,d,r)

#############################################################

On 3/24/2010 11:44 AM, jeff.m.ewers wrote:
>
> Hello,
>
> I am relatively new to R, and I've run into a problem formatting my data for
> input into the package RankAggreg.
>
> I have a matrix of gene titles and P-values (weights) in two columns:
>
> KCTD12	4.06904E-22
> UNC93A	9.91852E-22
> CDKN3	1.24695E-21
> CLEC2B	4.71759E-21
> DAB2	1.12062E-20
> HSPB1	1.23125E-20
> ...
>
> The data contains many, many duplicate gene titles, and I need to remove all
> but one of each, which must be chosen at random. I have looked for quite
> some time, and I've been unable to find a way to do this. Any help would be
> greatly appreciated!
>
> Thanks,
>
> Jeff