[R] Matching on multiple columns
Farrel Buchinsky
fjbuch at gmail.com
Thu Jan 11 21:56:52 CET 2007
Am I correct in believing that one cannot match on multiple columns?
One can indeed subset on multiple criteria from different variables
(or columns) but not from unique combinations thereof.
I need to exclude about 10000 rows from 108000 rows of data based on
several unique combinations of identifiers in two columns. Only
merge() seems to be able to do that. Merge would allow me to
positively select but it would not allow me to deselect (or exclude).
Look at how I got around the problem.
It is inelegant. Have a missed a more direct function?
x<-seq(from=1,to=10)
y <-rep(1:2,5)
data<-data.frame(cbind(x,y))
x<-seq(from=1,to=10)
y <-rep(1:5,2)
data1<-data.frame(cbind(x,y))
newdata<-rbind(data,data1)
newdata[10,2] <- 3
exclude<-newdata[18:20,]
#This is a simulation of real life problem
#We now have two dataframes. Newdata is the data set from an
experiment. In real #life it has an additional 9 columns of data
#exclude is a dataframe that was manually created after discovering
some quality #control issues
#Any row in newdata that matches any row in exclude must be discarded
match(newdata$x,exclude$x)# useless because it is only on one column
match(newdata$y,exclude$y)# useless because it is only on one column
newdata$x %in% exclude$x # useless because it is only on one column
newdata$y %in% exclude$y # useless because it is only on one column
newdata$x %in% exclude$x & newdata$y %in% exclude$y # useless,
#eventhough it is using both columns, because it is not
#using them in a synchronous manner. Row 10 in new data should not
have been #marked "TRUE"
#It was only labeled such because the 10 in the x column is indeed in
the exclude x #column and the 3 in the y column is indeed in the
exclude y column but not #together
which(newdata$x %in% exclude$x & newdata$y %in% exclude$y)#also
gets it #wrong
match(newdata,exclude)# intuitively this could have worked but alas
match can #only handle vectors and not dataframes. It cannot match on
multiple columns
#I have to stoop to the inelegant maneuver of creating a combined
variable of the #two columns, albeit only temporarily
paste(newdata$x,newdata$y,sep=":") %in% paste(exclude$x,exclude$y,sep=":")
#or one could do this
match(paste(newdata$x,newdata$y,sep=":"),paste(exclude$x,exclude$y,sep=":"))
#or
{which(paste(newdata$x,newdata$y,sep=":") %in% paste
(exclude$x,exclude$y,sep=":"))}
#Or if you want to use the result in an index term or a selection
argument in a subset command
{paste(newdata$x,newdata$y,sep=":") %in%
paste(exclude$x,exclude$y,sep=":")==FALSE}
#As in
{newdata[paste(newdata$x,newdata$y,sep=":") %in%
paste(exclude$x,exclude$y,sep=":")==FALSE,]}
#or
{subset(newdata,paste(newdata$x,newdata$y,sep=":") %in%
paste(exclude$x,exclude$y,sep=":")==FALSE)}
--
Farrel Buchinsky
More information about the R-help
mailing list