[R] Sorting Data Frame Without Loop

jim holtman jholtman at gmail.com
Fri Jun 10 19:12:46 CEST 2011


Is this what you want:

> colA <- c("M-012847-00" ,"M-012847-00" ,"M-015544-00", "M-015544-00", "M-024202-00","M-024202-00")
> colB <- c("NM_057175", "NM_057175", "NM_153008", "NM_153027", "NM_207330", "NM_207330")
> colC <- c( 'NARG1', 'TBDN100', 'FLJ30277', 'FLJ31659', 'NIPAL1', 'NPAL1')
>
> dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC)
> require(sqldf)
> sqldf("
+     select ProteinID
+             , group_concat(GeneID) as GeneID
+             , group_concat(GeneName) as GeneName
+         from dupes
+         group by ProteinID
+ ", method = 'raw')
    ProteinID              GeneID          GeneName
1 M-012847-00 NM_057175,NM_057175     NARG1,TBDN100
2 M-015544-00 NM_153008,NM_153027 FLJ30277,FLJ31659
3 M-024202-00 NM_207330,NM_207330      NIPAL1,NPAL1
>


On Fri, Jun 10, 2011 at 4:35 AM, hi Berven <theberve at hotmail.com> wrote:
>
> Hello all!
>
> I am currently trying to sort a data frame in a particular way, but I am having some difficulties with this. Specifically I want to sort the below dataset in such a way that there is only one line per ProteinID and if there are multiple GeneID or GeneName entries for a single proteinID, that they be concatenated with a comma separating them. The way I have done it earlier worked fine for small datasets, but as I am working with around 30,000 entries, it proved too slow and I'm not sure how to do it in another way.
> Here is an example of the input.
>
>
>
>
>
> ProteinID
>
> GeneID
>
> GeneName
>
>
> M-012847-00
>
> NM_057175
>
> NARG1
>
>
> M-012847-00
>
> NM_057175
>
> TBDN100
>
>
> M-015544-00
>
> NM_153008
>
> FLJ30277
>
>
> M-015544-00
>
> NM_153027
>
> FLJ31659
>
>
> M-024202-00
>
> NM_207330
>
> NIPAL1
>
>
> M-024202-00
>
> NM_207330
>
> NPAL1
>
> Here is an example showing what I want:
>
>
>
>
> ProteinID
>
> GeneID
>
> GeneName
>
>
> M-012847-00
>
> NM_057175
>
> NARG1, TBDN100
>
>
> M-015544-00
>
> NM_153008, NM_153027
>
> FLJ30277, FLJ31659
>
>
> M-024202-00
>
> NM_207330
>
> NIPAL1, NPAL1
>
> Here is the code I have been using so far. I have only managed to get this to work by using a loop, which I know is not the best way, but at the moment I'm stuck.
>
> colA <- c("M-012847-00" ,"M-012847-00" ,"M-015544-00", "M-015544-00", "M-024202-00","M-024202-00")
> colB <- c("NM_057175", "NM_057175", "NM_153008", "NM_153027", "NM_207330", "NM_207330")
> colC <- c( 'NARG1', 'TBDN100', 'FLJ30277', 'FLJ31659', 'NIPAL1', 'NPAL1')
>
> dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC)
>
> idVec <- character()
> geneIDVec <- character()
> geneNameVec <- character()
> dataType <- "ProteinID"
> annotation <- data.frame()
>
> for (id in unique(dupes[[dataType]])) {
>       print (id)
>       idVec <- c(idVec, id)
>       geneIDVec <- c(geneIDVec, paste(unique(dupes$GeneID[dupes[[dataType]] == id]), collapse=", "))
>       geneNameVec <- c(geneNameVec, paste(unique(dupes$GeneName[dupes[[dataType]] == id]), collapse=", "))
>       annotation[[dataType]][annotation[[dataType]] == id] <- NA
> }
> filtered <- data.frame(ProteinID=idVec, GeneID=geneIDVec, GeneName=geneNameVec)
>
>
> Thanks!
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?



More information about the R-help mailing list