[BioC] clustering genes in GO categories

James W. MacDonald jmacdon at med.umich.edu
Mon Jan 24 15:37:24 CET 2011


Hi Assa,

On 1/24/2011 4:52 AM, Assa Yeroslaviz wrote:
> Hello James and Bioconductor users,
>
> It starts to look better now. here is a short summary of my script:
> dat<- changedGenes.sub# changedGenes.sub is the complete data from the file
> FB_simulated_contrasts for Luke
>
> lst<- tapply(1:nrow(dat), dat$flybase_gene_id, function(x)
> dat[x,"bioProc"])
>
> lst2<- lapply(lst, function(x) unlist(strsplit(as.character(x), ":")))
> unlst<- cbind(rep(names(lst2), sapply(lst2, length)), unlist(lst2,
> use.names = FALSE))
>
> done<- tapply(1:nrow(unlst), unlst[,2], function(x) unlst[x,1])

The only way you will be able to get this into a data.frame is if you 
have a consistent number of columns. Since you can have an arbitrary 
number of Flybase genes associated with a particular GO term, you have 
to collapse each list item to length one.

This is easy enough to do, just collapse to a single string, separated 
by commas.

done <- lapply(done, paste, collapse = ",")
out <- data.frame(GO = names(done), FBgn = unlist(done))

Best,

Jim


>
> The result I get is a list of lists:
>> str(done)
> List of 103
>   $
> : chr [1:4] "FBgn0010359" "FBgn0021800" "FBgn0031420" "FBgn0034345"
>   $ actin cytoskeleton
> organization                                                         : chr
> "FBgn0000318"
>   $ actin filament
> organization                                                             :
> chr "FBgn0000318"
>   $ adenosine to inosine
> editing                                                            : chr
> "FBgn0044510"
>   $ adult
> behavior
> : chr "FBgn0044510"
>   $ adult locomotory
> behavior                                                               : chr
> "FBgn0044510"
>   $ antimicrobial humoral
> response                                                          : chr
> "FBgn0000318"
>   $
> apoptosis
> : chr "FBgn0016977"
>   $ apposition of dorsal and ventral imaginal disc-derived wing
> surfaces                    : chr "FBgn0034326"
>   $ asymmetric cell
> division                                                                :
> chr "FBgn0052484"
> ...
>
> Unfortunately I can't find a way of converting this list of lists into an
> exportable table/file to work with.
> What I would like to have is the same is in the same form as this list of
> lists, but as a data.frame with two columns.
> like that (this is a *hypothetical object,* which I couldn't generate until
> now) :
>> done
>    GO
> category
> Gene_IDs
>   no category
>
> : chr [1:4] "FBgn0010359" "FBgn0021800" "FBgn0031420" "FBgn0034345"
>   actin cytoskeleton
> organization                                                         : chr
> "FBgn0000318"
>   actin filament
> organization                                                             :
> chr "FBgn0000318"
>   adenosine to inosine
> editing                                                            : chr
> "FBgn0044510"
>   adult
> behavior
> : chr "FBgn0044510"
>   adult locomotory
> behavior                                                               : chr
> "FBgn0044510"
>   antimicrobial humoral
> response                                                          : chr
> "FBgn0000318"
>   apoptosis
> : chr "FBgn0016977"
>   apposition of dorsal and ventral imaginal disc-derived wing
> surfaces                    : chr "FBgn0034326"
>   asymmetric cell
> division                                                                :
> chr "FBgn0052484"
>
> Just using as.data.frame can't convert it as it still stays a list, which is
> not exportable.
> I tried to convert the list of lists using:
>> done.df<- do.call('rbind', lapply(names(done),
> function(.name){data.frame(done[[.name]], Name=.name)}))
>
> But I get the error message that I have different length of rows.
> Error in data.frame(done[[.name]], Name = .name) :
>    arguments imply differing number of rows: 0, 1
>
> I would like to know if there is a way of exporting a list of lists into a
> table, or to convert it into a data.frame.
>
> Thanks for any help
>
> Assa
>
>
> On Mon, Jan 17, 2011 at 16:56, Assa Yeroslaviz<frymor at gmail.com>  wrote:
>
>> Hi again,
>>
>> ok. I solved it. well to be honest, it wasn't that difficult. I just added
>>
>>> lst2<- lapply(list, function(x) unlist(strsplit(*as.character(x)*,
>> ":"))
>>
>> Assa
>>
>>
>> On Mon, Jan 17, 2011 at 16:42, Assa Yeroslaviz<frymor at gmail.com>  wrote:
>>
>>>
>>> Hi James,
>>>
>>> thanks for the help, but unfortunately I get an error message when running
>>> the second line
>>>
>>>> list<- tapply(1:nrow(dat), dat$flybase_gene_id, function(x)
>>> dat[x,"GOMF")
>>>
>>>> lst2<- lapply(list, function(x) unlist(strsplit(x, ":"))
>>>
>>> Error in strsplit(x, ":") : non-character argument
>>>
>>>> str(list)
>>> List of 13369
>>>   $ FBgn0000008: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>> activity:transferase activity:transferase activity, transferring glycosyl
>>> groups\"",..: NA
>>>   $ FBgn0000014: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>> activity:transferase activity:transferase activity, transferring glycosyl
>>> groups\"",..: 3330 NA
>>>   $ FBgn0000015: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>> activity:transferase activity:transferase activity, transferring glycosyl
>>> groups\"",..: 2546 880
>>>   $ FBgn0000017: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>> activity:transferase activity:transferase activity, transferring glycosyl
>>> groups\"",..: NA 35
>>>   $ FBgn0000018: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>> activity:transferase activity:transferase activity, transferring glycosyl
>>> groups\"",..: NA
>>>   $ FBgn0000022: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>> activity:transferase activity:transferase activity, transferring glycosyl
>>> groups\"",..: 893
>>>   $ FBgn0000024: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>> activity:transferase activity:transferase activity, transferring glycosyl
>>> groups\"",..: 2546
>>>   $ FBgn0000028: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>> activity:transferase activity:transferase activity, transferring glycosyl
>>> groups\"",..: NA
>>>
>>> I tried to convert the factor of the data.frame into characters, but it
>>> still give me the same error.
>>> list1<- data.frame(lapply(list, as.character), stringsAsFactors=FALSE)
>>>
>>> Is there a way of converting the lines to characters?
>>>
>>> THX
>>> Assa
>>>
>>>
>>>
>>>
>>> Hi Assa,
>>>>
>>>> OK, I see your point. This is still pretty easy.
>>>>
>>>> lst<- tapply(1:nrow(dat), dat$flybase_gene_id, function(x) dat[x,"GOMF")
>>>>
>>>> lst2<- lapply(lst, function(x) unlist(strsplit(x, ":"))
>>>>
>>>
>>>
>>>
>>>> unlst<- cbind(rep(names(lst2), sapply(lst2, length)), unlist(lst2,
>>>> use.names = FALSE))
>>>>
>>>> done<- tapply(1:nrow(unlst), unlst[,2], function(x) unlst[x,1])
>>>>
>>>> There are assuredly other more elegant ways to do this, but this should
>>>> suffice.
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>
>>>>
>>>> On 1/12/2011 7:28 AM, Assa Yeroslaviz wrote:
>>>>
>>>>> Hi James,
>>>>>
>>>>> thanks for this idea, but unfortunately it wasn't exactly what I needed.
>>>>> This kind of transformation I was able to do on my own. Ye problem is,
>>>>> that
>>>>> I would like to split the third column into single GO categories.
>>>>>
>>>>> this waht I have until now, after applying the tapply command:
>>>>> "carboxylesterase activity:hydrolase activity:3',5'-cyclic-nucleotide
>>>>> phosphodiesterase activity:protein binding"    FBgn0001128
>>>>> aminopeptidase activity:metalloexopeptidase activity:hydrolase
>>>>> activity:manganese ion binding    FBgn0040736
>>>>> nucleotide binding:protein binding:ATP binding:chaperone
>>>>> binding:ammonium
>>>>> transmembrane transporter activity    FBgn0053057,FBgn0035889
>>>>> protein binding    FBgn0034454
>>>>>
>>>>> What I need is to split the first column (or in the original file the
>>>>> third
>>>>> column) in to separate names (in this column these are separated by
>>>>> ':').
>>>>> and concatenate ALL the right IDs to the ALL the right GO categories.
>>>>> As if to get something like:
>>>>> carboxylesterase activity    FBgn0001128   ....
>>>>> hydrolase activity    FBgn0001128    FBgn0040736  .....
>>>>> 3',5'-cyclic-nucleotide phosphodiesterase activity    FBgn0001128   ....
>>>>> protein binding    FBgn0001128    FBgn0034454   FBgn0053057
>>>>>   FBgn0035889
>>>>> ....
>>>>> nucleotide binding    FBgn0053057    FBgn0035889   ...
>>>>> ATP binding    FBgn0053057    FBgn0035889 ....
>>>>> chaperone binding    FBgn0053057    FBgn0035889  ....
>>>>> ammonium transmembrane transporter activity    FBgn0053057
>>>>> FBgn0035889     ....
>>>>> aminopeptidase activity    FBgn0040736  ....
>>>>> metalloexopeptidase activity    FBgn0040736  ....
>>>>> manganese ion binding    FBgn0040736     ....
>>>>> ....
>>>>>
>>>>> I would appreciate any help on that subject.
>>>>>
>>>>> THX
>>>>> Assa
>>>>>
>>>>> On Thu, Jan 6, 2011 at 22:09, James MacDonald<jmacdon at med.umich.edu>
>>>>>   wrote:
>>>>>
>>>>>   Hi Assa,
>>>>>>
>>>>>> I don't think you need a package for that. A call to tapply() followed
>>>>>> by a
>>>>>> call to do.call() should get you where you want to go.
>>>>>>
>>>>>> Say you read your table into R, and call it 'dat'.
>>>>>>
>>>>>> thelist<- tapply(1:nrow(dat), dat$GOMF, function(x) dat[x, 3])
>>>>>>
>>>>>> then you will have a list, with the names being the GOMF and the list
>>>>>> items
>>>>>> being all the gene ids. Collapsing that to a matrix is difficult
>>>>>> because you
>>>>>> will have different numbers of columns. So you can either collapse all
>>>>>> the
>>>>>> list items using commas, or directly write out to a file. Collapsing
>>>>>> with
>>>>>> commas is easy:
>>>>>>
>>>>>> commalist<- lapply(thelist, paste, collapse = ",")
>>>>>> avector<- do.call("c", commalist)
>>>>>> names(vector)<- names(commalist)
>>>>>>
>>>>>> or you could just write out to a file using something like
>>>>>>
>>>>>> con<- file("mydata.txt", "w")
>>>>>>
>>>>>> for(i in seq(along = commalist)) cat(names(commalist)[i],
>>>>>> commalist[[i]],
>>>>>> "\n", sep = "\t", file = con)
>>>>>>
>>>>>> close(con)
>>>>>>
>>>>>> All untested, so  you might have to fiddle a bit to get the results you
>>>>>> want.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>> James W. MacDonald, M.S.
>>>>>> Biostatistician
>>>>>> Douglas Lab
>>>>>> 5912 Buhl
>>>>>> 1241 E. Catherine St.
>>>>>> Ann Arbor MI 48109-5618
>>>>>> 734-615-7826
>>>>>>
>>>>>>>   Assa Yeroslaviz  01/06/11 1:02 PM>>>
>>>>>>>>>
>>>>>>>> Hi, everybody,
>>>>>>
>>>>>> I was wondering whether there is a package to cluster a list of genes
>>>>>> to
>>>>>> different GO categories
>>>>>>
>>>>>> my problem is as such:
>>>>>> i have a list of genes (a tab delimited file):
>>>>>> id    flybasename_gene    flybase_gene_id    entrezgene    GOMF
>>>>>>
>>>>>> 1616608_a_at    Gpdh    FBgn0001128    33824    carboxylesterase
>>>>>> activity
>>>>>> hydrolase activity    3',5'-cyclic-nucleotide phosphodiesterase
>>>>>> activity
>>>>>> protein binding
>>>>>> 1622892_s_at    CG33057    FBgn0053057    318833    nucleotide binding
>>>>>> protein binding    ATP binding    chaperone binding    ammonium
>>>>>> transmembrane transporter activity
>>>>>> 1622892_s_at    mkg-p    FBgn0035889    38955    nucleotide binding
>>>>>> protein binding    ATP binding    chaperone binding    ammonium
>>>>>> transmembrane transporter activity
>>>>>> 1622893_at    IM3    FBgn0040736    50209    aminopeptidase activity
>>>>>> metalloexopeptidase activity    hydrolase activity    manganese ion
>>>>>> bindin
>>>>>> 1622894_at    CG15120    FBgn0034454    37248    protein binding
>>>>>>
>>>>>> I would like to try and group the genes in various GO categories, which
>>>>>> are
>>>>>> mentioned here in the last columns. The GO categories take more than
>>>>>> one
>>>>>> column and the number is not equal in each line, deending on the depth
>>>>>> of
>>>>>> the annotation for each gene.
>>>>>> Is there a way of transforming the table, so that I in the first column
>>>>>> a
>>>>>> list of my GO categories and than on each line a list with gene IDs
>>>>>> (the
>>>>>> right ID are not important as I can change them as I wish).
>>>>>> I would like to have something like that:
>>>>>> GO    genes
>>>>>> protein binding     FBgn0001128    FBgn0053057     FBgn0035889 etc.
>>>>>> ammonium transmembrane transporter activity      FBgn0053057
>>>>>>   FBgn0035889
>>>>>> hydrolayse activity   FBgn0040736     FBgn0001128
>>>>>>
>>>>>>
>>>>>> I would appriciate any kind of help or ideas
>>>>>>
>>>>>> Thanks
>>>>>> Assa
>>>>>>
>>>>>>      [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>> **********************************************************
>>>>>> Electronic Mail is not secure, may not be read every day, and should
>>>>>> not be
>>>>>> used for urgent or sensitive issues
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>> --
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> Douglas Lab
>>>> University of Michigan
>>>> Department of Human Genetics
>>>>
>>>> 5912 Buhl
>>>> 1241 E. Catherine St.
>>>> Ann Arbor MI 48109-5618
>>>> 734-615-7826
>>>> **********************************************************
>>>> Electronic Mail is not secure, may not be read every day, and should not
>>>> be used for urgent or sensitive issues
>>>>
>>>
>>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues 



More information about the Bioconductor mailing list