[BioC] clustering genes in GO categories
Martin Morgan
mtmorgan at fhcrc.org
Wed Aug 29 15:17:53 CEST 2012
On 08/29/2012 12:34 AM, Assa Yeroslaviz wrote:
> Hello bioC users,
>
> as you can see below, this was posted over a year ago. Unfortunately I
> tried the same today and for some mysterious it is not working correctly
> any more.
> What I have is the same data.frame:
>> dat
> id flybasename_gene flybase_gene_id entrezgene
> 1 1616608_a_at Gpdh FBgn0001128 33824
> 2 1622892_s_at CG33057 FBgn0053057 318833
> 3 1622892_s_at mkg-p FBgn0035889 38955
> 4 1622893_at IM3 FBgn0040736 50209
> 5 1622894_at CG15120 FBgn0034454 37248
>
> GOMF
> 1 carboxylesterase activity:hydrolase activity:3',5'-cyclic-nucleotide
> phosphodiesterase activity:protein binding:
> 2 nucleotide binding:protein binding:ATP binding:chaperone
> binding:ammonium transmembrane transporter activity
> 3 nucleotide binding:protein binding:ATP binding:chaperone
> binding:ammonium transmembrane transporter activity
> 4 aminopeptidase activity:metalloexopeptidase
> activity:hydrolase activity:manganese ion binding
> 5
> protein binding
>
> What I would like to have is a second data frame with the GO categories as
> row names and the gene IDs to be put in each of the GO categories they
> belong to. like that:
>
>
> GO genes
> protein binding FBgn0001128 FBgn0053057 FBgn0035889 etc.
> ammonium transmembrane transporter activity FBgn0053057 FBgn0035889
> hydrolayse activity FBgn0040736 FBgn0001128
>
>
> Below is the script I used before, and as far as I can remember it did work
> very good:
>
>
> lst <- tapply(1:nrow(dat), dat$flybase_gene_id, function(x) dat[x,"GOMF"])
> lst2 <- lapply(lst, function(x) unlist(strsplit(as.character(x), ":")))
>
> unlst <- cbind(rep(names(lst2), sapply(lst2, length)), unlist(lst2,
> use.names = FALSE))
> done <- tapply(1:nrow(unlst), unlst[,2], function(x) unlst[x,1])
> done_df <- lapply(done, paste, collapse = ",")
> out <- data.frame(GO = names(done_df), FBgn = unlist(done_df))
>
> But the result I am getting are not the GO categories, but a numbered list
> of the the number of gene IDs, which looks like that:
>
>> out
> GO FBgn
> 1 1 FBgn0040736
> 2 2 FBgn0001128
> 3 3 FBgn0035889,FBgn0053057
> 4 4 FBgn0034454
Probably GOMF is a factor, but was a character,
dat$GOMF <- as.character(dat$GOMF)
Here's a different code chunk, using Biobase::reverseSplit
map <- with(dat, strsplit(setNames(GOMF, flybase_gene_id), ":"))
revmap <- sapply(reverseSplit(map), paste, collapse=",")
data.frame(GO=names(revmap), FBgn = as.vector(revmap))
Martin
>
> I would like to know if something was changed in the apply command
> structure to prevent the same results as before. I would appreciate your
> help.
>
> Thanks
> Assa
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list