[BioC] clustering genes in GO categories
Martin Morgan
mtmorgan at fhcrc.org
Mon Jan 24 18:18:25 CET 2011
On 01/24/2011 06:37 AM, James W. MacDonald wrote:
> Hi Assa,
>
> On 1/24/2011 4:52 AM, Assa Yeroslaviz wrote:
>> Hello James and Bioconductor users,
>>
>> It starts to look better now. here is a short summary of my script:
>> dat<- changedGenes.sub# changedGenes.sub is the complete data from the
>> file
>> FB_simulated_contrasts for Luke
>>
>> lst<- tapply(1:nrow(dat), dat$flybase_gene_id, function(x)
>> dat[x,"bioProc"])
>>
>> lst2<- lapply(lst, function(x) unlist(strsplit(as.character(x), ":")))
>> unlst<- cbind(rep(names(lst2), sapply(lst2, length)), unlist(lst2,
>> use.names = FALSE))
>>
>> done<- tapply(1:nrow(unlst), unlst[,2], function(x) unlst[x,1])
>
> The only way you will be able to get this into a data.frame is if you
> have a consistent number of columns. Since you can have an arbitrary
> number of Flybase genes associated with a particular GO term, you have
> to collapse each list item to length one.
>
> This is easy enough to do, just collapse to a single string, separated
> by commas.
>
> done <- lapply(done, paste, collapse = ",")
> out <- data.frame(GO = names(done), FBgn = unlist(done))
>
> Best,
>
> Jim
>
>
>>
>> The result I get is a list of lists:
>>> str(done)
>> List of 103
>> $
>> : chr [1:4] "FBgn0010359" "FBgn0021800" "FBgn0031420" "FBgn0034345"
>> $ actin cytoskeleton
>> organization :
>> chr
>> "FBgn0000318"
>> $ actin filament
>> organization
>> :
>> chr "FBgn0000318"
Jumping in in the middle so perhaps not understanding, but...
To create a flat data frame that contains these data in a 'denormalized'
form you might
len <- sapply(done, length)
data.frame(Term=rep(names(done), len), unlist(done, use.names=FALSE)
use.names=FALSE is an efficiency that likely does not make a difference
in the current situation; it might be necessary to first filter out
elements that are not NULL, e.g., Filter(Negate(is.null), done)
Martin
>> $ adenosine to inosine
>> editing : chr
>> "FBgn0044510"
>> $ adult
>> behavior
>> : chr "FBgn0044510"
>> $ adult locomotory
>> behavior
>> : chr
>> "FBgn0044510"
>> $ antimicrobial humoral
>> response : chr
>> "FBgn0000318"
>> $
>> apoptosis
>> : chr "FBgn0016977"
>> $ apposition of dorsal and ventral imaginal disc-derived wing
>> surfaces : chr "FBgn0034326"
>> $ asymmetric cell
>> division :
>> chr "FBgn0052484"
>> ...
>>
>> Unfortunately I can't find a way of converting this list of lists into an
>> exportable table/file to work with.
>> What I would like to have is the same is in the same form as this list of
>> lists, but as a data.frame with two columns.
>> like that (this is a *hypothetical object,* which I couldn't generate
>> until
>> now) :
>>> done
>> GO
>> category
>> Gene_IDs
>> no category
>>
>> : chr [1:4] "FBgn0010359" "FBgn0021800" "FBgn0031420" "FBgn0034345"
>> actin cytoskeleton
>> organization :
>> chr
>> "FBgn0000318"
>> actin filament
>> organization
>> :
>> chr "FBgn0000318"
>> adenosine to inosine
>> editing : chr
>> "FBgn0044510"
>> adult
>> behavior
>> : chr "FBgn0044510"
>> adult locomotory
>> behavior
>> : chr
>> "FBgn0044510"
>> antimicrobial humoral
>> response : chr
>> "FBgn0000318"
>> apoptosis
>> : chr "FBgn0016977"
>> apposition of dorsal and ventral imaginal disc-derived wing
>> surfaces : chr "FBgn0034326"
>> asymmetric cell
>> division :
>> chr "FBgn0052484"
>>
>> Just using as.data.frame can't convert it as it still stays a list,
>> which is
>> not exportable.
>> I tried to convert the list of lists using:
>>> done.df<- do.call('rbind', lapply(names(done),
>> function(.name){data.frame(done[[.name]], Name=.name)}))
>>
>> But I get the error message that I have different length of rows.
>> Error in data.frame(done[[.name]], Name = .name) :
>> arguments imply differing number of rows: 0, 1
>>
>> I would like to know if there is a way of exporting a list of lists
>> into a
>> table, or to convert it into a data.frame.
>>
>> Thanks for any help
>>
>> Assa
>>
>>
>> On Mon, Jan 17, 2011 at 16:56, Assa Yeroslaviz<frymor at gmail.com> wrote:
>>
>>> Hi again,
>>>
>>> ok. I solved it. well to be honest, it wasn't that difficult. I just
>>> added
>>>
>>>> lst2<- lapply(list, function(x) unlist(strsplit(*as.character(x)*,
>>> ":"))
>>>
>>> Assa
>>>
>>>
>>> On Mon, Jan 17, 2011 at 16:42, Assa Yeroslaviz<frymor at gmail.com> wrote:
>>>
>>>>
>>>> Hi James,
>>>>
>>>> thanks for the help, but unfortunately I get an error message when
>>>> running
>>>> the second line
>>>>
>>>>> list<- tapply(1:nrow(dat), dat$flybase_gene_id, function(x)
>>>> dat[x,"GOMF")
>>>>
>>>>> lst2<- lapply(list, function(x) unlist(strsplit(x, ":"))
>>>>
>>>> Error in strsplit(x, ":") : non-character argument
>>>>
>>>>> str(list)
>>>> List of 13369
>>>> $ FBgn0000008: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: NA
>>>> $ FBgn0000014: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: 3330 NA
>>>> $ FBgn0000015: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: 2546 880
>>>> $ FBgn0000017: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: NA 35
>>>> $ FBgn0000018: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: NA
>>>> $ FBgn0000022: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: 893
>>>> $ FBgn0000024: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: 2546
>>>> $ FBgn0000028: Factor w/ 3814 levels "\"1,3-beta-glucan synthase
>>>> activity:transferase activity:transferase activity, transferring
>>>> glycosyl
>>>> groups\"",..: NA
>>>>
>>>> I tried to convert the factor of the data.frame into characters, but it
>>>> still give me the same error.
>>>> list1<- data.frame(lapply(list, as.character), stringsAsFactors=FALSE)
>>>>
>>>> Is there a way of converting the lines to characters?
>>>>
>>>> THX
>>>> Assa
>>>>
>>>>
>>>>
>>>>
>>>> Hi Assa,
>>>>>
>>>>> OK, I see your point. This is still pretty easy.
>>>>>
>>>>> lst<- tapply(1:nrow(dat), dat$flybase_gene_id, function(x)
>>>>> dat[x,"GOMF")
>>>>>
>>>>> lst2<- lapply(lst, function(x) unlist(strsplit(x, ":"))
>>>>>
>>>>
>>>>
>>>>
>>>>> unlst<- cbind(rep(names(lst2), sapply(lst2, length)), unlist(lst2,
>>>>> use.names = FALSE))
>>>>>
>>>>> done<- tapply(1:nrow(unlst), unlst[,2], function(x) unlst[x,1])
>>>>>
>>>>> There are assuredly other more elegant ways to do this, but this
>>>>> should
>>>>> suffice.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 1/12/2011 7:28 AM, Assa Yeroslaviz wrote:
>>>>>
>>>>>> Hi James,
>>>>>>
>>>>>> thanks for this idea, but unfortunately it wasn't exactly what I
>>>>>> needed.
>>>>>> This kind of transformation I was able to do on my own. Ye problem
>>>>>> is,
>>>>>> that
>>>>>> I would like to split the third column into single GO categories.
>>>>>>
>>>>>> this waht I have until now, after applying the tapply command:
>>>>>> "carboxylesterase activity:hydrolase activity:3',5'-cyclic-nucleotide
>>>>>> phosphodiesterase activity:protein binding" FBgn0001128
>>>>>> aminopeptidase activity:metalloexopeptidase activity:hydrolase
>>>>>> activity:manganese ion binding FBgn0040736
>>>>>> nucleotide binding:protein binding:ATP binding:chaperone
>>>>>> binding:ammonium
>>>>>> transmembrane transporter activity FBgn0053057,FBgn0035889
>>>>>> protein binding FBgn0034454
>>>>>>
>>>>>> What I need is to split the first column (or in the original file the
>>>>>> third
>>>>>> column) in to separate names (in this column these are separated by
>>>>>> ':').
>>>>>> and concatenate ALL the right IDs to the ALL the right GO categories.
>>>>>> As if to get something like:
>>>>>> carboxylesterase activity FBgn0001128 ....
>>>>>> hydrolase activity FBgn0001128 FBgn0040736 .....
>>>>>> 3',5'-cyclic-nucleotide phosphodiesterase activity
>>>>>> FBgn0001128 ....
>>>>>> protein binding FBgn0001128 FBgn0034454 FBgn0053057
>>>>>> FBgn0035889
>>>>>> ....
>>>>>> nucleotide binding FBgn0053057 FBgn0035889 ...
>>>>>> ATP binding FBgn0053057 FBgn0035889 ....
>>>>>> chaperone binding FBgn0053057 FBgn0035889 ....
>>>>>> ammonium transmembrane transporter activity FBgn0053057
>>>>>> FBgn0035889 ....
>>>>>> aminopeptidase activity FBgn0040736 ....
>>>>>> metalloexopeptidase activity FBgn0040736 ....
>>>>>> manganese ion binding FBgn0040736 ....
>>>>>> ....
>>>>>>
>>>>>> I would appreciate any help on that subject.
>>>>>>
>>>>>> THX
>>>>>> Assa
>>>>>>
>>>>>> On Thu, Jan 6, 2011 at 22:09, James MacDonald<jmacdon at med.umich.edu>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Assa,
>>>>>>>
>>>>>>> I don't think you need a package for that. A call to tapply()
>>>>>>> followed
>>>>>>> by a
>>>>>>> call to do.call() should get you where you want to go.
>>>>>>>
>>>>>>> Say you read your table into R, and call it 'dat'.
>>>>>>>
>>>>>>> thelist<- tapply(1:nrow(dat), dat$GOMF, function(x) dat[x, 3])
>>>>>>>
>>>>>>> then you will have a list, with the names being the GOMF and the
>>>>>>> list
>>>>>>> items
>>>>>>> being all the gene ids. Collapsing that to a matrix is difficult
>>>>>>> because you
>>>>>>> will have different numbers of columns. So you can either
>>>>>>> collapse all
>>>>>>> the
>>>>>>> list items using commas, or directly write out to a file. Collapsing
>>>>>>> with
>>>>>>> commas is easy:
>>>>>>>
>>>>>>> commalist<- lapply(thelist, paste, collapse = ",")
>>>>>>> avector<- do.call("c", commalist)
>>>>>>> names(vector)<- names(commalist)
>>>>>>>
>>>>>>> or you could just write out to a file using something like
>>>>>>>
>>>>>>> con<- file("mydata.txt", "w")
>>>>>>>
>>>>>>> for(i in seq(along = commalist)) cat(names(commalist)[i],
>>>>>>> commalist[[i]],
>>>>>>> "\n", sep = "\t", file = con)
>>>>>>>
>>>>>>> close(con)
>>>>>>>
>>>>>>> All untested, so you might have to fiddle a bit to get the
>>>>>>> results you
>>>>>>> want.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Jim
>>>>>>>
>>>>>>> James W. MacDonald, M.S.
>>>>>>> Biostatistician
>>>>>>> Douglas Lab
>>>>>>> 5912 Buhl
>>>>>>> 1241 E. Catherine St.
>>>>>>> Ann Arbor MI 48109-5618
>>>>>>> 734-615-7826
>>>>>>>
>>>>>>>> Assa Yeroslaviz 01/06/11 1:02 PM>>>
>>>>>>>>>>
>>>>>>>>> Hi, everybody,
>>>>>>>
>>>>>>> I was wondering whether there is a package to cluster a list of
>>>>>>> genes
>>>>>>> to
>>>>>>> different GO categories
>>>>>>>
>>>>>>> my problem is as such:
>>>>>>> i have a list of genes (a tab delimited file):
>>>>>>> id flybasename_gene flybase_gene_id entrezgene GOMF
>>>>>>>
>>>>>>> 1616608_a_at Gpdh FBgn0001128 33824 carboxylesterase
>>>>>>> activity
>>>>>>> hydrolase activity 3',5'-cyclic-nucleotide phosphodiesterase
>>>>>>> activity
>>>>>>> protein binding
>>>>>>> 1622892_s_at CG33057 FBgn0053057 318833 nucleotide
>>>>>>> binding
>>>>>>> protein binding ATP binding chaperone binding ammonium
>>>>>>> transmembrane transporter activity
>>>>>>> 1622892_s_at mkg-p FBgn0035889 38955 nucleotide binding
>>>>>>> protein binding ATP binding chaperone binding ammonium
>>>>>>> transmembrane transporter activity
>>>>>>> 1622893_at IM3 FBgn0040736 50209 aminopeptidase activity
>>>>>>> metalloexopeptidase activity hydrolase activity manganese ion
>>>>>>> bindin
>>>>>>> 1622894_at CG15120 FBgn0034454 37248 protein binding
>>>>>>>
>>>>>>> I would like to try and group the genes in various GO categories,
>>>>>>> which
>>>>>>> are
>>>>>>> mentioned here in the last columns. The GO categories take more than
>>>>>>> one
>>>>>>> column and the number is not equal in each line, deending on the
>>>>>>> depth
>>>>>>> of
>>>>>>> the annotation for each gene.
>>>>>>> Is there a way of transforming the table, so that I in the first
>>>>>>> column
>>>>>>> a
>>>>>>> list of my GO categories and than on each line a list with gene IDs
>>>>>>> (the
>>>>>>> right ID are not important as I can change them as I wish).
>>>>>>> I would like to have something like that:
>>>>>>> GO genes
>>>>>>> protein binding FBgn0001128 FBgn0053057 FBgn0035889 etc.
>>>>>>> ammonium transmembrane transporter activity FBgn0053057
>>>>>>> FBgn0035889
>>>>>>> hydrolayse activity FBgn0040736 FBgn0001128
>>>>>>>
>>>>>>>
>>>>>>> I would appriciate any kind of help or ideas
>>>>>>>
>>>>>>> Thanks
>>>>>>> Assa
>>>>>>>
>>>>>>> [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>>>> **********************************************************
>>>>>>> Electronic Mail is not secure, may not be read every day, and should
>>>>>>> not be
>>>>>>> used for urgent or sensitive issues
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>> --
>>>>> James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> Douglas Lab
>>>>> University of Michigan
>>>>> Department of Human Genetics
>>>>>
>>>>> 5912 Buhl
>>>>> 1241 E. Catherine St.
>>>>> Ann Arbor MI 48109-5618
>>>>> 734-615-7826
>>>>> **********************************************************
>>>>> Electronic Mail is not secure, may not be read every day, and
>>>>> should not
>>>>> be used for urgent or sensitive issues
>>>>>
>>>>
>>>>
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioconductor
mailing list