[BioC] clustering genes in GO categories

Martin Morgan mtmorgan at fhcrc.org
Mon Jan 24 18:18:25 CET 2011


On 01/24/2011 06:37 AM, James W. MacDonald wrote:
> Hi Assa,
> 
> On 1/24/2011 4:52 AM, Assa Yeroslaviz wrote:
>> Hello James and Bioconductor users,
>>
>> It starts to look better now. here is a short summary of my script:
>> dat<- changedGenes.sub# changedGenes.sub is the complete data from the
>> file
>> FB_simulated_contrasts for Luke
>>
>> lst<- tapply(1:nrow(dat), dat$flybase_gene_id, function(x)
>> dat[x,"bioProc"])
>>
>> lst2<- lapply(lst, function(x) unlist(strsplit(as.character(x), ":")))
>> unlst<- cbind(rep(names(lst2), sapply(lst2, length)), unlist(lst2,
>> use.names = FALSE))
>>
>> done<- tapply(1:nrow(unlst), unlst[,2], function(x) unlst[x,1])
> 
> The only way you will be able to get this into a data.frame is if you
> have a consistent number of columns. Since you can have an arbitrary
> number of Flybase genes associated with a particular GO term, you have
> to collapse each list item to length one.
> 
> This is easy enough to do, just collapse to a single string, separated
> by commas.
> 
> done <- lapply(done, paste, collapse = ",")
> out <- data.frame(GO = names(done), FBgn = unlist(done))
> 
> Best,
> 
> Jim
> 
> 
>>
>> The result I get is a list of lists:
>>> str(done)
>> List of 103
>>   $
>> : chr [1:4] "FBgn0010359" "FBgn0021800" "FBgn0031420" "FBgn0034345"
>>   $ actin cytoskeleton
>> organization                                                         :
>> chr
>> "FBgn0000318"
>>   $ actin filament
>> organization                                                            
>> :
>> chr "FBgn0000318"

Jumping in in the middle so perhaps not understanding, but...

To create a flat data frame that contains these data in a 'denormalized'
form you might

len <- sapply(done, length)
data.frame(Term=rep(names(done), len), unlist(done, use.names=FALSE)

use.names=FALSE is an efficiency that likely does not make a difference
in the current situation; it might be necessary to first filter out
elements that are not NULL, e.g., Filter(Negate(is.null), done)

Martin


>>   $ adenosine to inosine
>> editing                                                            : chr
>> "FBgn0044510"
>>   $ adult
>> behavior
>> : chr "FBgn0044510"
>>   $ adult locomotory
>> behavior                                                              
>> : chr
>> "FBgn0044510"
>>   $ antimicrobial humoral
>> response                                                          : chr
>> "FBgn0000318"
>>   $
>> apoptosis
>> : chr "FBgn0016977"
>>   $ apposition of dorsal and ventral imaginal disc-derived wing
>> surfaces                    : chr "FBgn0034326"
>>   $ asymmetric cell
>> division                                                                :
>> chr "FBgn0052484"
>> ...
>>
>> Unfortunately I can't find a way of converting this list of lists into an
>> exportable table/file to work with.
>> What I would like to have is the same is in the same form as this list of
>> lists, but as a data.frame with two columns.
>> like that (this is a *hypothetical object,* which I couldn't generate
>> until
>> now) :
>>> done
>>    GO
>> category
>> Gene_IDs
>>   no category
>>
>> : chr [1:4] "FBgn0010359" "FBgn0021800" "FBgn0031420" "FBgn0034345"
>>   actin cytoskeleton
>> organization                                                         :
>> chr
>> "FBgn0000318"
>>   actin filament
>> organization                                                            
>> :
>> chr "FBgn0000318"
>>   adenosine to inosine
>> editing                                                            : chr
>> "FBgn0044510"
>>   adult
>> behavior
>> : chr "FBgn0044510"
>>   adult locomotory
>> behavior                                                              
>> : chr
>> "FBgn0044510"
>>   antimicrobial humoral
>> response                                                          : chr
>> "FBgn0000318"
>>   apoptosis
>> : chr "FBgn0016977"
>>   apposition of dorsal and ventral imaginal disc-derived wing
>> surfaces                    : chr "FBgn0034326"
>>   asymmetric cell
>> division                                                                :
>> chr "FBgn0052484"
>>
>> Just using as.data.frame can't convert it as it still stays a list,
>> which is
>> not exportable.
>> I tried to convert the list of lists using:
>>> done.df<- do.call('rbind', lapply(names(done),
>> function(.name){data.frame(done[[.name]], Name=.name)}))
>>
>> But I get the error message that I have different length of rows.
>> Error in data.frame(done[[.name]], Name = .name) :
>>    arguments imply differing number of rows: 0, 1
>>
>> I would like to know if there is a way of exporting a list of lists
>> into a
>> table, or to convert it into a data.frame.
>>
>> Thanks for any help
>>
>> Assa
>>
>>
>> On Mon, Jan 17, 2011 at 16:56, Assa Yeroslaviz<frymor at gmail.com>  wrote:
>>
>>> Hi again,
>>>
>>> ok. I solved it. well to be honest, it wasn't that difficult. I just
>>> added
>>>
>>>> lst2<- lapply(list, function(x) unlist(strsplit(*as.character(x)*,
>>> ":"))
>>>
>>> Assa
>>>
>>>
>>> On Mon, Jan 17, 2011 at 16:42, Assa Yeroslaviz<frymor at gmail.com>  wrote:
>>>
>>>>
>>>> Hi James,
>>>>
>>>> thanks for the help, but unfortunately I get an error message when
>>>> running
>>>> the second line
>>>>
>>>>> list<- tapply(1:nrow(dat), dat$flybase_gene_id, function(x)
>>>> dat[x,"GOMF")
>>>>
>>>>> lst2<- lapply(list, function(x) unlist(strsplit(x, ":"))
>>>>
>>>> Error in strsplit(x, ":") : non-character argument
>>>>
>>>>> str(list)
>>>> List of 13369
>>>>   $ FBgn0000008: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: NA
>>>>   $ FBgn0000014: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: 3330 NA
>>>>   $ FBgn0000015: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: 2546 880
>>>>   $ FBgn0000017: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: NA 35
>>>>   $ FBgn0000018: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: NA
>>>>   $ FBgn0000022: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: 893
>>>>   $ FBgn0000024: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: 2546
>>>>   $ FBgn0000028: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: NA
>>>>
>>>> I tried to convert the factor of the data.frame into characters, but it
>>>> still give me the same error.
>>>> list1<- data.frame(lapply(list, as.character), stringsAsFactors=FALSE)
>>>>
>>>> Is there a way of converting the lines to characters?
>>>>
>>>> THX
>>>> Assa
>>>>
>>>>
>>>>
>>>>
>>>> Hi Assa,
>>>>>
>>>>> OK, I see your point. This is still pretty easy.
>>>>>
>>>>> lst<- tapply(1:nrow(dat), dat$flybase_gene_id, function(x)
>>>>> dat[x,"GOMF")
>>>>>
>>>>> lst2<- lapply(lst, function(x) unlist(strsplit(x, ":"))
>>>>>
>>>>
>>>>
>>>>
>>>>> unlst<- cbind(rep(names(lst2), sapply(lst2, length)), unlist(lst2,
>>>>> use.names = FALSE))
>>>>>
>>>>> done<- tapply(1:nrow(unlst), unlst[,2], function(x) unlst[x,1])
>>>>>
>>>>> There are assuredly other more elegant ways to do this, but this
>>>>> should
>>>>> suffice.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 1/12/2011 7:28 AM, Assa Yeroslaviz wrote:
>>>>>
>>>>>> Hi James,
>>>>>>
>>>>>> thanks for this idea, but unfortunately it wasn't exactly what I
>>>>>> needed.
>>>>>> This kind of transformation I was able to do on my own. Ye problem
>>>>>> is,
>>>>>> that
>>>>>> I would like to split the third column into single GO categories.
>>>>>>
>>>>>> this waht I have until now, after applying the tapply command:
>>>>>> "carboxylesterase activity:hydrolase activity:3',5'-cyclic-nucleotide
>>>>>> phosphodiesterase activity:protein binding"    FBgn0001128
>>>>>> aminopeptidase activity:metalloexopeptidase activity:hydrolase
>>>>>> activity:manganese ion binding    FBgn0040736
>>>>>> nucleotide binding:protein binding:ATP binding:chaperone
>>>>>> binding:ammonium
>>>>>> transmembrane transporter activity    FBgn0053057,FBgn0035889
>>>>>> protein binding    FBgn0034454
>>>>>>
>>>>>> What I need is to split the first column (or in the original file the
>>>>>> third
>>>>>> column) in to separate names (in this column these are separated by
>>>>>> ':').
>>>>>> and concatenate ALL the right IDs to the ALL the right GO categories.
>>>>>> As if to get something like:
>>>>>> carboxylesterase activity    FBgn0001128   ....
>>>>>> hydrolase activity    FBgn0001128    FBgn0040736  .....
>>>>>> 3',5'-cyclic-nucleotide phosphodiesterase activity   
>>>>>> FBgn0001128   ....
>>>>>> protein binding    FBgn0001128    FBgn0034454   FBgn0053057
>>>>>>   FBgn0035889
>>>>>> ....
>>>>>> nucleotide binding    FBgn0053057    FBgn0035889   ...
>>>>>> ATP binding    FBgn0053057    FBgn0035889 ....
>>>>>> chaperone binding    FBgn0053057    FBgn0035889  ....
>>>>>> ammonium transmembrane transporter activity    FBgn0053057
>>>>>> FBgn0035889     ....
>>>>>> aminopeptidase activity    FBgn0040736  ....
>>>>>> metalloexopeptidase activity    FBgn0040736  ....
>>>>>> manganese ion binding    FBgn0040736     ....
>>>>>> ....
>>>>>>
>>>>>> I would appreciate any help on that subject.
>>>>>>
>>>>>> THX
>>>>>> Assa
>>>>>>
>>>>>> On Thu, Jan 6, 2011 at 22:09, James MacDonald<jmacdon at med.umich.edu>
>>>>>>   wrote:
>>>>>>
>>>>>>   Hi Assa,
>>>>>>>
>>>>>>> I don't think you need a package for that. A call to tapply()
>>>>>>> followed
>>>>>>> by a
>>>>>>> call to do.call() should get you where you want to go.
>>>>>>>
>>>>>>> Say you read your table into R, and call it 'dat'.
>>>>>>>
>>>>>>> thelist<- tapply(1:nrow(dat), dat$GOMF, function(x) dat[x, 3])
>>>>>>>
>>>>>>> then you will have a list, with the names being the GOMF and the
>>>>>>> list
>>>>>>> items
>>>>>>> being all the gene ids. Collapsing that to a matrix is difficult
>>>>>>> because you
>>>>>>> will have different numbers of columns. So you can either
>>>>>>> collapse all
>>>>>>> the
>>>>>>> list items using commas, or directly write out to a file. Collapsing
>>>>>>> with
>>>>>>> commas is easy:
>>>>>>>
>>>>>>> commalist<- lapply(thelist, paste, collapse = ",")
>>>>>>> avector<- do.call("c", commalist)
>>>>>>> names(vector)<- names(commalist)
>>>>>>>
>>>>>>> or you could just write out to a file using something like
>>>>>>>
>>>>>>> con<- file("mydata.txt", "w")
>>>>>>>
>>>>>>> for(i in seq(along = commalist)) cat(names(commalist)[i],
>>>>>>> commalist[[i]],
>>>>>>> "\n", sep = "\t", file = con)
>>>>>>>
>>>>>>> close(con)
>>>>>>>
>>>>>>> All untested, so  you might have to fiddle a bit to get the
>>>>>>> results you
>>>>>>> want.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Jim
>>>>>>>
>>>>>>> James W. MacDonald, M.S.
>>>>>>> Biostatistician
>>>>>>> Douglas Lab
>>>>>>> 5912 Buhl
>>>>>>> 1241 E. Catherine St.
>>>>>>> Ann Arbor MI 48109-5618
>>>>>>> 734-615-7826
>>>>>>>
>>>>>>>>   Assa Yeroslaviz  01/06/11 1:02 PM>>>
>>>>>>>>>>
>>>>>>>>> Hi, everybody,
>>>>>>>
>>>>>>> I was wondering whether there is a package to cluster a list of
>>>>>>> genes
>>>>>>> to
>>>>>>> different GO categories
>>>>>>>
>>>>>>> my problem is as such:
>>>>>>> i have a list of genes (a tab delimited file):
>>>>>>> id    flybasename_gene    flybase_gene_id    entrezgene    GOMF
>>>>>>>
>>>>>>> 1616608_a_at    Gpdh    FBgn0001128    33824    carboxylesterase
>>>>>>> activity
>>>>>>> hydrolase activity    3',5'-cyclic-nucleotide phosphodiesterase
>>>>>>> activity
>>>>>>> protein binding
>>>>>>> 1622892_s_at    CG33057    FBgn0053057    318833    nucleotide
>>>>>>> binding
>>>>>>> protein binding    ATP binding    chaperone binding    ammonium
>>>>>>> transmembrane transporter activity
>>>>>>> 1622892_s_at    mkg-p    FBgn0035889    38955    nucleotide binding
>>>>>>> protein binding    ATP binding    chaperone binding    ammonium
>>>>>>> transmembrane transporter activity
>>>>>>> 1622893_at    IM3    FBgn0040736    50209    aminopeptidase activity
>>>>>>> metalloexopeptidase activity    hydrolase activity    manganese ion
>>>>>>> bindin
>>>>>>> 1622894_at    CG15120    FBgn0034454    37248    protein binding
>>>>>>>
>>>>>>> I would like to try and group the genes in various GO categories,
>>>>>>> which
>>>>>>> are
>>>>>>> mentioned here in the last columns. The GO categories take more than
>>>>>>> one
>>>>>>> column and the number is not equal in each line, deending on the
>>>>>>> depth
>>>>>>> of
>>>>>>> the annotation for each gene.
>>>>>>> Is there a way of transforming the table, so that I in the first
>>>>>>> column
>>>>>>> a
>>>>>>> list of my GO categories and than on each line a list with gene IDs
>>>>>>> (the
>>>>>>> right ID are not important as I can change them as I wish).
>>>>>>> I would like to have something like that:
>>>>>>> GO    genes
>>>>>>> protein binding     FBgn0001128    FBgn0053057     FBgn0035889 etc.
>>>>>>> ammonium transmembrane transporter activity      FBgn0053057
>>>>>>>   FBgn0035889
>>>>>>> hydrolayse activity   FBgn0040736     FBgn0001128
>>>>>>>
>>>>>>>
>>>>>>> I would appriciate any kind of help or ideas
>>>>>>>
>>>>>>> Thanks
>>>>>>> Assa
>>>>>>>
>>>>>>>      [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>>>> **********************************************************
>>>>>>> Electronic Mail is not secure, may not be read every day, and should
>>>>>>> not be
>>>>>>> used for urgent or sensitive issues
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>> -- 
>>>>> James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> Douglas Lab
>>>>> University of Michigan
>>>>> Department of Human Genetics
>>>>>
>>>>> 5912 Buhl
>>>>> 1241 E. Catherine St.
>>>>> Ann Arbor MI 48109-5618
>>>>> 734-615-7826
>>>>> **********************************************************
>>>>> Electronic Mail is not secure, may not be read every day, and
>>>>> should not
>>>>> be used for urgent or sensitive issues
>>>>>
>>>>
>>>>
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioconductor mailing list