[R] union data in column

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Sat Jul 24 09:23:07 CEST 2010


Fahim Md wrote:
> Is there any function/way to merge/unite the following data
>
>   GENEID      col1          col2             col3                col4
>   G234064         1             0                  0                   0
>   G234064         1             0                  0                   0
>   G234064         1             0                  0                   0
>   G234064         0             1                  0                   0
>   G234065         0             1                  0                   0
>   G234065         0             1                  0                   0
>   G234065         0             1                  0                   0
>   G234065         0             0                  1                   0
>   G234065         0             0                  1                   0
>   G234065         0             0                  0                   1
>
>
> into
> GENEID      col1          col2             col3                col4
>   G234064         1             1                  0                   0
> // 1 appears in col1 and col2 above, rest are zero
>   G234065         0             1                  1                   1
> // 1 appears in col2 , 3 and 4 above.
>
>
> Thank

Warning on terminology: there is a "merge" function in R that lines up 
rows from different tables to make a new set of longer rows (more 
columns). The usual term for combining column values from multiple rows 
is "aggregation".

In addition to the example offered by Jim Holtzman, here are some other 
options in no particular order:

x <- read.table(textConnection(" GENEID col1 col2 col3 col4
G234064 1 0 0 0
G234064 1 0 0 0
G234064 1 0 0 0
G234064 0 1 0 0
G234065 0 1 0 0
G234065 0 1 0 0
G234065 0 1 0 0
G234065 0 0 1 0
G234065 0 0 1 0
G234065 0 0 0 1
"), header=TRUE, as.is=TRUE, row.names=NULL)
closeAllConnections()

# syntactic repackaging of Jim's basic approach
library(plyr)
ddply( x, .(GENEID), function(df) 
{with(as.integer(c(col1=any(col1),col2=any(col2),col3=any(col3),col4=any(col4))))} 
)

# if you are familiar with SQL, this approach allows you merge data 
frames and aggregate rows in the same step, but has a somewhat limited 
range of aggregating functions available
library(sqldf)
sqldf("SELECT GENEID, max(col1) AS col1, max(col2) AS col2, max(col3) AS 
col3, max(col4) AS col4 FROM x GROUP BY GENEID")

# when the data you want to crunch is separate from the key(s) or you 
can separate them yourself
crunch <- function(tmp){as.integer(any(tmp))}
aggregate( x[,c("col1","col2","col3","col4")], by=list(x$GENEID), 
FUN="crunch" )

# this is typically used if you want to aggregate many columns with the 
same set of operations
library(doBy)
summaryBy( .~GENEID, data=x, FUN=c(sum,crunch) )



More information about the R-help mailing list