[BioC] clustering genes in GO categories
James W. MacDonald
jmacdon at med.umich.edu
Wed Jan 12 17:32:58 CET 2011
Hi Assa,
OK, I see your point. This is still pretty easy.
lst <- tapply(1:nrow(dat), dat$flybase_gene_id, function(x) dat[x,"GOMF")
lst2 <- lapply(lst, function(x) unlist(strsplit(x, ":"))
unlst <- cbind(rep(names(lst2), sapply(lst2, length)), unlist(lst2,
use.names = FALSE))
done <- tapply(1:nrow(unlst), unlst[,2], function(x) unlst[x,1])
There are assuredly other more elegant ways to do this, but this should
suffice.
Best,
Jim
On 1/12/2011 7:28 AM, Assa Yeroslaviz wrote:
> Hi James,
>
> thanks for this idea, but unfortunately it wasn't exactly what I needed.
> This kind of transformation I was able to do on my own. Ye problem is, that
> I would like to split the third column into single GO categories.
>
> this waht I have until now, after applying the tapply command:
> "carboxylesterase activity:hydrolase activity:3',5'-cyclic-nucleotide
> phosphodiesterase activity:protein binding" FBgn0001128
> aminopeptidase activity:metalloexopeptidase activity:hydrolase
> activity:manganese ion binding FBgn0040736
> nucleotide binding:protein binding:ATP binding:chaperone binding:ammonium
> transmembrane transporter activity FBgn0053057,FBgn0035889
> protein binding FBgn0034454
>
> What I need is to split the first column (or in the original file the third
> column) in to separate names (in this column these are separated by ':').
> and concatenate ALL the right IDs to the ALL the right GO categories.
> As if to get something like:
> carboxylesterase activity FBgn0001128 ....
> hydrolase activity FBgn0001128 FBgn0040736 .....
> 3',5'-cyclic-nucleotide phosphodiesterase activity FBgn0001128 ....
> protein binding FBgn0001128 FBgn0034454 FBgn0053057 FBgn0035889
> ....
> nucleotide binding FBgn0053057 FBgn0035889 ...
> ATP binding FBgn0053057 FBgn0035889 ....
> chaperone binding FBgn0053057 FBgn0035889 ....
> ammonium transmembrane transporter activity FBgn0053057
> FBgn0035889 ....
> aminopeptidase activity FBgn0040736 ....
> metalloexopeptidase activity FBgn0040736 ....
> manganese ion binding FBgn0040736 ....
> ....
>
> I would appreciate any help on that subject.
>
> THX
> Assa
>
> On Thu, Jan 6, 2011 at 22:09, James MacDonald<jmacdon at med.umich.edu> wrote:
>
>> Hi Assa,
>>
>> I don't think you need a package for that. A call to tapply() followed by a
>> call to do.call() should get you where you want to go.
>>
>> Say you read your table into R, and call it 'dat'.
>>
>> thelist<- tapply(1:nrow(dat), dat$GOMF, function(x) dat[x, 3])
>>
>> then you will have a list, with the names being the GOMF and the list items
>> being all the gene ids. Collapsing that to a matrix is difficult because you
>> will have different numbers of columns. So you can either collapse all the
>> list items using commas, or directly write out to a file. Collapsing with
>> commas is easy:
>>
>> commalist<- lapply(thelist, paste, collapse = ",")
>> avector<- do.call("c", commalist)
>> names(vector)<- names(commalist)
>>
>> or you could just write out to a file using something like
>>
>> con<- file("mydata.txt", "w")
>>
>> for(i in seq(along = commalist)) cat(names(commalist)[i], commalist[[i]],
>> "\n", sep = "\t", file = con)
>>
>> close(con)
>>
>> All untested, so you might have to fiddle a bit to get the results you
>> want.
>>
>> Best,
>>
>> Jim
>>
>> James W. MacDonald, M.S.
>> Biostatistician
>> Douglas Lab
>> 5912 Buhl
>> 1241 E. Catherine St.
>> Ann Arbor MI 48109-5618
>> 734-615-7826
>>>>> Assa Yeroslaviz 01/06/11 1:02 PM>>>
>> Hi, everybody,
>>
>> I was wondering whether there is a package to cluster a list of genes to
>> different GO categories
>>
>> my problem is as such:
>> i have a list of genes (a tab delimited file):
>> id flybasename_gene flybase_gene_id entrezgene GOMF
>>
>> 1616608_a_at Gpdh FBgn0001128 33824 carboxylesterase activity
>> hydrolase activity 3',5'-cyclic-nucleotide phosphodiesterase activity
>> protein binding
>> 1622892_s_at CG33057 FBgn0053057 318833 nucleotide binding
>> protein binding ATP binding chaperone binding ammonium
>> transmembrane transporter activity
>> 1622892_s_at mkg-p FBgn0035889 38955 nucleotide binding
>> protein binding ATP binding chaperone binding ammonium
>> transmembrane transporter activity
>> 1622893_at IM3 FBgn0040736 50209 aminopeptidase activity
>> metalloexopeptidase activity hydrolase activity manganese ion bindin
>> 1622894_at CG15120 FBgn0034454 37248 protein binding
>>
>> I would like to try and group the genes in various GO categories, which are
>> mentioned here in the last columns. The GO categories take more than one
>> column and the number is not equal in each line, deending on the depth of
>> the annotation for each gene.
>> Is there a way of transforming the table, so that I in the first column a
>> list of my GO categories and than on each line a list with gene IDs (the
>> right ID are not important as I can change them as I wish).
>> I would like to have something like that:
>> GO genes
>> protein binding FBgn0001128 FBgn0053057 FBgn0035889 etc.
>> ammonium transmembrane transporter activity FBgn0053057 FBgn0035889
>> hydrolayse activity FBgn0040736 FBgn0001128
>>
>>
>> I would appriciate any kind of help or ideas
>>
>> Thanks
>> Assa
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> **********************************************************
>> Electronic Mail is not secure, may not be read every day, and should not be
>> used for urgent or sensitive issues
>>
>>
>
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
More information about the Bioconductor
mailing list