[BioC] GoStats and microRNA pipeline using Biomart

David martin vilanew at gmail.com
Thu Mar 31 11:47:42 CEST 2011


Ok thanks,
Any idea on how to turn the biomart output into a valid GOFrame input ??

For example :
I wrote this function

getgoids <- function (id) {
   getBM(attributes=c(
           'entrezgene',
           'ensembl_transcript_id',
           'go_biological_process_id',
           'go_biological_process_linkage_type',
           'go_cellular_component_id',
           'go_cellular_component_linkage_type',
           'go_molecular_function_id',
           'go_molecular_function_linkage_type')
         ,filters="ensembl_transcript_id",  values=id,  mart=mart)
}
getgoids('ENST00000306434')

How do i turn this into a valid GOFrame Object ?

thanks,
david




On 03/31/2011 10:10 AM, James F. Reid wrote:
> Hi David,
>
> On 03/30/2011 08:31 PM, David martin wrote:
>  > Yes absolutly. A few ensembl releases ago UTR tend to be smaller but
>  > this is getting better now. How would you normalize that based on
> length ?
>
> I'm afraid that I don't have a simple answer to this it would need
> thinking out especially wrt to your GO enrichment analysis.
> Any ideas from the members of the list?
>
> Best,
> J.
>
>> On 03/30/2011 07:00 PM, James F. Reid wrote:
>>> Hi David,
>>>
>>> I understand your reasoning for counting the number of miRNA binding
>>> sites with the 3' UTR of a predicted target, you are trying to include
>>> the 'combinatorial' effect of miRNA targeting.
>>> I would try to include the length of any UTR however (some kind of
>>> normalization if you wish) since the longer the UTR the more chances are
>>> that miRNA will bind.
>>> Does this make sense?
>>>
>>> Best,
>>> J.
>>>
>>> On 03/30/2011 05:23 PM, David martin wrote:
>>>> On 03/30/2011 04:56 PM, Steve Lianoglou wrote:
>>>>> Hi,
>>>>>
>>>>> On Wed, Mar 30, 2011 at 9:43 AM, David
>>>>> martin<vilanew at gmail.com> wrote:
>>>>>> Hi,
>>>>>> I open this new discussion so not to confuse with the previous one.
>>>>>>
>>>>>> The objective here is to look for overrepresented GoTerms from
>>>>>> microRNA
>>>>>> targets. One microRNA can have several targets (genes) and one single
>>>>>> gene
>>>>>> can be targeted by several microRNAs. The assumption is to check
>>>>>> for a
>>>>>> specific microRNAs which GoTerms are overrepresented.
>>>>>>
>>>>>>
>>>>>> Ok so let's say me my microRNA of interest is mir-A.
>>>>>>
>>>>>> Step1: based on my favorite prediction algorithm i have managed to
>>>>>> get a
>>>>>> list of genes targeted by mir-A. The genes are ensembl transcripts
>>>>>> and as i
>>>>>> said before miR-A can target several times the same transcript (at
>>>>>> different
>>>>>> location) so i need to account for this.
>>>>>>
>>>>>> miR-A targets ->
>>>>>> ENST001,ENST001,ENST001,ENST0025,ENST089,ENST099,ENST0099......) up
>>>>>> to 300
>>>>>> different transcripts.
>>>>>
>>>>> I don't get why you'd want to have the same transcript multiple times
>>>>> as a target for the miRNA -- if the miRNA targets the same transcript
>>>>> in two different locations, you then want to double count the GO terms
>>>>> associated with that transcript?
>>>>
>>>> That's correct. The idea behind that is that a transcript targeted at
>>>> different locations is more "likely to be twice targeted" and therefore
>>>> GO term associated to this transcript have to be replicated. This sound
>>>> good to me but i don not expect that you agree on that.
>>>>
>>>>
>>>> i have managed to get all GO ids with a small function. Basically you
>>>> input one transcript id in a loop
>>>>
>>>> l = length(genes) # list of all ensembl transcripts
>>>> for (l in 1:l)
>>>> {
>>>> goid[l] <- getgoids("ENST...")
>>>>
>>>> }
>>>> getgoids <- function (id) {
>>>> getBM(attributes=c(
>>>> 'go_biological_process_id',
>>>> 'go_biological_process_linkage_type',
>>>> 'go_cellular_component_id',
>>>> 'go_cellular_component_linkage_type',
>>>> 'go_molecular_function_id',
>>>> 'go_molecular_function_linkage_type')
>>>> ,filters="ensembl_transcript_id", values=id, mart=mart)
>>>> }
>>>>
>>>> I agree wioth you that i might need to add the transcript_id to be able
>>>> to use for GoStats mapping between transcripts and GO ids.
>>>>
>>>>
>>>> Now i want to use that as the univere set for GoStats and do hyperG to
>>>> compare with the GO for a specific microRNA.
>>>>
>>>> I guess :
>>>>
>>>> goframeData = data.frame(frame$go_id, frame$Evidence, frame$gene_id)
>>>> #list of all GOids from all transcripts targeted by all microRNA
>>>>
>>>> goFrame = GOFrame(goframeData, organism = "Homo sapiens")
>>>> goAllFrame = GOAllFrame(goFrame) #Geneid to ALL go id mapping
>>>>
>>>>
>>>> In the GSEAGOHyperGParams function below can you correct me ?
>>>> geneSetCollection = List of all go ids off all transcripts targetted by
>>>> all microRNA
>>>> single_mir_transcript_ids = list of ENSEMBl transcripts ids targeted by
>>>> a specific microRNA
>>>> univerGeneIds: list of transcript to Go mapping
>>>> Is this correc t?
>>>>
>>>>
>>>> gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())
>>>> params <- GSEAGOHyperGParams(name = "My Custom GSEA based annot
>>>> Params",geneSetCollection = gsc, geneIds = single_mir_transcripts_ids,
>>>> universeGeneIds = universe,ontology = "BP", pvalueCutoff = 0.05,
>>>> conditional = FALSE,testDirection = "over")
>>>>
>>>>
>>>>>
>>>>> Somehow that seems wrong to me -- if the "hit count" of the miRNA to
>>>>> the transcript is important to you, one thing you can do is store your
>>>>> miR-A vector as its "table()" so the names will the the transcripts,
>>>>> and the values will be the number of hits.
>>>>>
>>>>>> I use biomart to get the corresponding GoIds for these transcripts
>>>>>>
>>>>>> ....
>>>>>> #Select mart database
>>>>>> mart<- useMart("ensembl", dataset="hsapiens_gene_ensembl")
>>>>>>
>>>>>> #Get go for a specific transcript
>>>>>> # First problem as Biomart will not return twice GoTerms for
>>>>>> duplicated
>>>>>> transcripts. The example below show that for transcript
>>>>>> c("ENST00000347770","ENST00000347770") i get the same goTerms than
>>>>>> for
>>>>>> transcript c("ENST00000347770").
>>>>>> # As i said before a microRNA can target several times the same
>>>>>> microRNA so
>>>>>> twice the number of goterms associated to this particular microRNA.
>>>>>> Can we
>>>>>> force biomart to return redundant GoTerms ????
>>>>>
>>>>> I'm actually still not sure what you want to do, but if you follow my
>>>>> advice above, you can manipulate the data.frame you get from getBM to
>>>>> replicate rows (or whatever you're trying to do).
>>>>>
>>>>> You will also want to add "ensembl_transcript_id" to your vector of
>>>>> attributes so you can reassociate the rows in the table that is
>>>>> returned to you with your original ensembl transcripts you are
>>>>> querying for, eg:
>>>>>
>>>>> R> gomir<- getBM(attributes=c('ensembl_transcript_id', 'go..', ...),
>>>>> filters='ensemble_transcript_id', values=c("ENST..."), mart=mart)
>>>>>
>>>>> Hope that helps,
>>>>> -steve
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list