[BioC] clustering genes in GO categories

Wed Jan 12 17:32:58 CET 2011

Hi Assa,

OK, I see your point. This is still pretty easy.

lst <- tapply(1:nrow(dat), dat$flybase_gene_id, function(x) dat[x,"GOMF")

lst2 <- lapply(lst, function(x) unlist(strsplit(x, ":"))
unlst <- cbind(rep(names(lst2), sapply(lst2, length)), unlist(lst2, 
use.names = FALSE))

done <- tapply(1:nrow(unlst), unlst[,2], function(x) unlst[x,1])

There are assuredly other more elegant ways to do this, but this should 
suffice.

Best,

Jim

On 1/12/2011 7:28 AM, Assa Yeroslaviz wrote:
> Hi James,
>
> thanks for this idea, but unfortunately it wasn't exactly what I needed.
> This kind of transformation I was able to do on my own. Ye problem is, that
> I would like to split the third column into single GO categories.
>
> this waht I have until now, after applying the tapply command:
> "carboxylesterase activity:hydrolase activity:3',5'-cyclic-nucleotide
> phosphodiesterase activity:protein binding"    FBgn0001128
> aminopeptidase activity:metalloexopeptidase activity:hydrolase
> activity:manganese ion binding    FBgn0040736
> nucleotide binding:protein binding:ATP binding:chaperone binding:ammonium
> transmembrane transporter activity    FBgn0053057,FBgn0035889
> protein binding    FBgn0034454
>
> What I need is to split the first column (or in the original file the third
> column) in to separate names (in this column these are separated by ':').
> and concatenate ALL the right IDs to the ALL the right GO categories.
> As if to get something like:
> carboxylesterase activity    FBgn0001128   ....
> hydrolase activity    FBgn0001128    FBgn0040736  .....
> 3',5'-cyclic-nucleotide phosphodiesterase activity    FBgn0001128   ....
> protein binding    FBgn0001128    FBgn0034454   FBgn0053057    FBgn0035889
> ....
> nucleotide binding    FBgn0053057    FBgn0035889   ...
> ATP binding    FBgn0053057    FBgn0035889 ....
> chaperone binding    FBgn0053057    FBgn0035889  ....
> ammonium transmembrane transporter activity    FBgn0053057
> FBgn0035889     ....
> aminopeptidase activity    FBgn0040736  ....
> metalloexopeptidase activity    FBgn0040736  ....
> manganese ion binding    FBgn0040736     ....
> ....
>
> I would appreciate any help on that subject.
>
> THX
> Assa
>
> On Thu, Jan 6, 2011 at 22:09, James MacDonald<jmacdon at med.umich.edu>  wrote:
>
>> Hi Assa,
>>
>> I don't think you need a package for that. A call to tapply() followed by a
>> call to do.call() should get you where you want to go.
>>
>> Say you read your table into R, and call it 'dat'.
>>
>> thelist<- tapply(1:nrow(dat), dat$GOMF, function(x) dat[x, 3])
>>
>> then you will have a list, with the names being the GOMF and the list items
>> being all the gene ids. Collapsing that to a matrix is difficult because you
>> will have different numbers of columns. So you can either collapse all the
>> list items using commas, or directly write out to a file. Collapsing with
>> commas is easy:
>>
>> commalist<- lapply(thelist, paste, collapse = ",")
>> avector<- do.call("c", commalist)
>> names(vector)<- names(commalist)
>>
>> or you could just write out to a file using something like
>>
>> con<- file("mydata.txt", "w")
>>
>> for(i in seq(along = commalist)) cat(names(commalist)[i], commalist[[i]],
>> "\n", sep = "\t", file = con)
>>
>> close(con)
>>
>> All untested, so  you might have to fiddle a bit to get the results you
>> want.
>>
>> Best,
>>
>> Jim
>>
>> James W. MacDonald, M.S.
>> Biostatistician
>> Douglas Lab
>> 5912 Buhl
>> 1241 E. Catherine St.
>> Ann Arbor MI 48109-5618
>> 734-615-7826
>>>>> Assa Yeroslaviz  01/06/11 1:02 PM>>>
>> Hi, everybody,
>>
>> I was wondering whether there is a package to cluster a list of genes to
>> different GO categories
>>
>> my problem is as such:
>> i have a list of genes (a tab delimited file):
>> id    flybasename_gene    flybase_gene_id    entrezgene    GOMF
>>
>> 1616608_a_at    Gpdh    FBgn0001128    33824    carboxylesterase activity
>> hydrolase activity    3',5'-cyclic-nucleotide phosphodiesterase activity
>> protein binding
>> 1622892_s_at    CG33057    FBgn0053057    318833    nucleotide binding
>> protein binding    ATP binding    chaperone binding    ammonium
>> transmembrane transporter activity
>> 1622892_s_at    mkg-p    FBgn0035889    38955    nucleotide binding
>> protein binding    ATP binding    chaperone binding    ammonium
>> transmembrane transporter activity
>> 1622893_at    IM3    FBgn0040736    50209    aminopeptidase activity
>> metalloexopeptidase activity    hydrolase activity    manganese ion bindin
>> 1622894_at    CG15120    FBgn0034454    37248    protein binding
>>
>> I would like to try and group the genes in various GO categories, which are
>> mentioned here in the last columns. The GO categories take more than one
>> column and the number is not equal in each line, deending on the depth of
>> the annotation for each gene.
>> Is there a way of transforming the table, so that I in the first column a
>> list of my GO categories and than on each line a list with gene IDs (the
>> right ID are not important as I can change them as I wish).
>> I would like to have something like that:
>> GO    genes
>> protein binding     FBgn0001128    FBgn0053057     FBgn0035889 etc.
>> ammonium transmembrane transporter activity      FBgn0053057    FBgn0035889
>> hydrolayse activity   FBgn0040736     FBgn0001128
>>
>>
>> I would appriciate any kind of help or ideas
>>
>> Thanks
>> Assa
>>
>>      [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> **********************************************************
>> Electronic Mail is not secure, may not be read every day, and should not be
>> used for urgent or sensitive issues
>>
>>
>

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues