[R] Removing duplicated rows within a matrix, with missing data as wildcards
Dimitris Rizopoulos
dimitris.rizopoulos at med.kuleuven.be
Fri Mar 9 16:14:29 CET 2007
you could also try something like the following:
x <- matrix(c(1, NA, 3, NA, 2, 3, 1, 3, 2, 2, 1, 3, 1, NA, 2, 2, 1,
3), ncol=3, byrow=TRUE)
wildcardVals <- 1:3 # possible wildcard values
ind <- complete.cases(x)
nc <- ncol(x)
nr <- nrow(x[ind, ])
nwld <- length(wildcardVals)
posb <- apply(x[!ind, , drop = FALSE], 1, function(y){
out <- matrix(y, nwld, nc, by = TRUE)
out[, is.na(y)] <- wildcardVals
t(out)
})
posb <- matrix(c(posb), ncol = nc, by = TRUE)
keep.ind <- duplicated(rbind(x[ind, ], posb))
keep.ind[-(1:nr)] <- apply(matrix(keep.ind[-(1:nr)], nc = nwld, by =
TRUE),
1, function(x) if(any(x)) rep(TRUE, length(x)) else x)
out <- rbind(x[ind, ], matrix(rep(x[!ind, ], each = nwld), nc = nc))
unique(out[!keep.ind, ])
I hope it works ok.
Best,
Dimitris
----
Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven
Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/(0)16/336899
Fax: +32/(0)16/337015
Web: http://med.kuleuven.be/biostat/
http://www.student.kuleuven.be/~m0390867/dimitris.htm
----- Original Message -----
From: "stacey thompson" <stacey.lee.thompson at gmail.com>
To: <hpages at fhcrc.org>; <r-help at stat.math.ethz.ch>
Cc: <petr.pikal at precheza.cz>
Sent: Friday, March 09, 2007 3:09 PM
Subject: Re: [R] Removing duplicated rows within a matrix,with missing
data as wildcards
> Hi H.,
>
> Your response has improved the clarity of my thinking. Kind thanks.
> Also, your use of seq_len() prompted me to update from R version
> 2.3.1
> on this machine.
>
> For your matrix
>
> > x <- matrix(c(1, NA, 3, NA, 2, 3), ncol=3, byrow=TRUE)
> > x
> [,1] [,2] [,3]
> [1,] 1 NA 3
> [2,] NA 2 3
>
> I would want to delete either x[1,] or x[2,] but not both.
> Practically, your "removeLooseDupRows(x)"
>
> removeLooseDupRows <- function(x)
> {
> if (nrow(x) <= 1)
> return(x)
> ii <- do.call("order",
> args=lapply(seq_len(ncol(x)),
> function(col) x[ , col]))
> dup_index <- logical(nrow(x))
> i0 <- -1
> for (k in 1:length(ii)) {
> i <- ii[k]
> if (any(is.na(x[i, ]))) {
> if (i0 == -1)
> next
> if (any(x[i, ] != x[i0, ], na.rm=TRUE))
> next
> dup_index[i] <- TRUE
> } else {
> i0 <- i
> }
> }
> x[!dup_index, ]
> }
>
> should leave no such ambiguous cases for my data, as the nrow(x) are
> very high with few NA in each x. For example, a row of (1, 2, 3) is
> very likely to exist in my data.
>
> However, to find the row numbers of any remaining ambiguous matches,
> should they exist, using example:
>
>> x <- matrix(c(1, NA, 3, NA, 2, 3, 1, 3, 2, 2, 1, 3, 1, NA, 2, 2, 1,
>> 3), ncol=3, byrow=TRUE)
>> x
> [,1] [,2] [,3]
> [1,] 1 NA 3
> [2,] NA 2 3
> [3,] 1 3 2
> [4,] 2 1 3
> [5,] 1 NA 2
> [6,] 2 1 3
>
> after your suggested
>
>> removeLooseDupRows(x)
> [,1] [,2] [,3]
> [1,] 1 NA 3
> [2,] NA 2 3
> [3,] 1 3 2
> [4,] 2 1 3
> [5,] 2 1 3
>
>> q <- removeLooseDupRows(unique(x))
>> q
> [,1] [,2] [,3]
> [1,] 1 NA 3
> [2,] NA 2 3
> [3,] 1 3 2
> [4,] 2 1 3
>
> I could
>
>> # ambiguous matches in matrix form
>> apply(q, 1, function(row1) apply(q, 1, function(row2)
>> all(is.na(row1) | is.na(row2) | row1==row2)))
>
> [,1] [,2] [,3] [,4]
> [1,] TRUE TRUE FALSE FALSE
> [2,] TRUE TRUE FALSE FALSE
> [3,] FALSE FALSE TRUE FALSE
> [4,] FALSE FALSE FALSE TRUE
>
>> # indices of ambiguous matches
>> m <- which(apply(q, 1, function(row1) apply(q, 1, function(row2)
>> all(is.na(row1) | is.na(row2) | row1==row2))), arr=T)
>> m
> row col
> [1,] 1 1
> [2,] 2 1
> [3,] 1 2
> [4,] 2 2
> [5,] 3 3
> [6,] 4 4
>
>> #put in order and omit duplicates
>> m2 <- unique(t(apply(m, 1, sort)))
>> m2
> [,1] [,2]
> [1,] 1 1
> [2,] 1 2
> [3,] 2 2
> [4,] 3 3
> [5,] 4 4
>
>> # show the ambiguous matches
>> m2[m2[,1]!=m2[,2], drop=F]
> [1] 1 2
>
> ...and procede from there.
>
> This solution came from another helpful "R-help" respondant to my
> poorly-defined problem.
>
> Appreciative thanks to everyone for your instructive help.
>
> Cheers,
> stacey
>
> --
> -stacey lee thompson-
> Stagiaire post-doctorale
> Institut de recherche en biologie végétale
> Université de Montréal
> 4101 Sherbrooke Est
> Montréal, Québec H1X 2B2 Canada
> stacey.thompson at umontreal.ca
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm
More information about the R-help
mailing list