[BioC] How to annotate genomic coordinates

Wed Nov 14 15:03:48 CET 2012

Hi Herve,

On 11/13/12 17:32, Hervé Pagès wrote:
> Hi all,
>
> On 11/12/2012 10:24 AM, Valerie Obenchain wrote:
>> Hi Jose,
>>
>> Here is a slightly different approach.
>>
>> library(Mus.musculus)
>> library(TxDb.Mmusculus.UCSC.mm9.knownGene)
>> txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
>>
>> ## I assume you've figured out how to read your data into a GRanges.
>> ## Here we use a small example.
>> gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20))
>>
>> ## The locateVariants() function in the VariantAnnotation package will
>> ## give you the GENEIDs for all ranges in 'query'. If the range does not
>> ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and
>> ## FOLLOWID are provided. The LOCATION column indicates what
>> ## region the range fell in. See ?locateVariants for details on the
>> ## different regions and the ability to set 'upstream' and 'downstream'
>> ## values for promoter regions.
>> library(VariantAnnotation)
>> loc <- locateVariants(query=gr, subject=txdb, region=AllVariants())
>>
>> > loc
>> GRanges with 1 range and 7 metadata columns:
>>        seqnames               ranges strand | LOCATION   QUERYID      
>> TXID
>> <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
>>    [1]    chr17 [31245606, 31245625]      * |   coding         1     
>> 47716
>>            CDSID      GENEID   PRECEDEID    FOLLOWID
>> <integer> <character> <character> <character>
>>    [1]    184246       11307 <NA> <NA>
>
> Nice! So IIUC locateVariants() is not intrinsically for variants but
> seems to be generally useful for "locating" any set of genomic
> positions. Shouldn't we have this (or something similar) in
> GenomicFeatures (probably with a different name)?

Yes, locateVariants() locates any range, not just variants. The name 
choice was probably not the best. I like the idea of possibly renaming 
and moving to GenomicRanges.

Val
>
> Cheers,
> H.
>
>>
>>
>> ## Use the select() function to map these GENEID's to the other
>> ## values you are interested in. The GENEID's from locateVariants()
>> ## are Entrez ID's so we use those as our 'keytype'.
>> entrezid <- loc$GENEID
>> select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
>> cols=c("GENENAME", "ENSEMBL"))
>> > select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
>> cols=c("GENENAME", "ENSEMBL"))
>>    ENTREZID            ENSEMBL
>> 1    11307 ENSMUSG00000024030
>>                                                GENENAME
>> 1 ATP-binding cassette, sub-family G (WHITE), member 1
>>
>>
>> Valerie
>>
>>
>>
>> On 11/12/12 06:59, José Luis Lavín wrote:
>>> Hello all,
>>>
>>> First of all I want to thank everybody that gave me advice on this
>>> subject.
>>> trying to follow the advice, I added some modifications mixing codes 
>>> from
>>> Tim,  Harris and James , but it seems I got lost somewhere in 
>>> between...
>>> I'm sorry for bothering you all again, but I'm afraid I need some more
>>> help.
>>>
>>> I need to read my ids.txt file and then iterate use each line of id
>>> (chr.position) to perform the annotation process with it. I thought 
>>> of a
>>> "for" loop to achieve it, but I do not seem to catch the essence of R
>>> processes and I guess I miss my tryout.
>>> Please have a look at my "disaster" and help me with it If such 
>>> thing is
>>> possible...
>>>
>>> biocLite('Mus.musculus')
>>> require(Mus.musculus)
>>> require(TxDb.Mmusculus.UCSC.mm9.knownGene)
>>> txdb<- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, 'gene')
>>> egid<- names(txdb)
>>> name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA))
>>> length(name) == length(egid) ## TRUE, OK to use
>>> esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA))
>>> length(esbl) == length(egid) ## FALSE, do not use
>>>
>>> #read input table
>>> coordTable<- read.delim(/path_to_dir/ids.txt, sep=".", header=FALSE,
>>> as.is
>>> =TRUE)
>>>
>>> for(i in 1:length(coordTable))
>>> {
>>> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1),
>>>                   strand='*')
>>> }
>>>
>>> genome(probes)<- 'mm9' ## prevents some stoopid mistakes
>>>
>>> geneBodyProbes<- findOverlaps(probes, txdb)
>>> geneBodyProbes
>>>
>>> write.table
>>> (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t")
>>>
>>> ## Hits of length 1
>>> ## queryLength: 1
>>> ## subjectLength: 21761
>>> ##   queryHits subjectHits
>>> ##<integer> <integer>
>>> ##     1         1        1641
>>> ##
>>>
>>> name[subjectHits(geneBodyProbes)]
>>>
>>> ##   11307  # egid
>>> ## "Abcg1"  # name
>>> ##
>>>
>>> org.Mm.egCHRLOC[['11307']]
>>>
>>> ##       17
>>> ## 31057694
>>> ##
>>>
>>> org.Mm.egENSEMBL[['11307']]
>>>
>>> ## [1] "ENSMUSG00000024030"
>>>
>>> promotersByGene<- flank(txdb, 1500) # or however many bases you want
>>> promotersByGene<- flank(promotersByGene, 200, FALSE) # a little extra
>>> promoterProbes<- findOverlaps(probes, promotersByGene)
>>> promoterProbes
>>>
>>> write.table
>>> (promoterProbes,file="/path_to_dir/promo_trans_id.tsv",quote=FALSE,sep="\t") 
>>>
>>>
>>>
>>> ## Hits of length 0
>>> ## queryLength: 1
>>> ## subjectLength: 21761
>>> ##
>>> ## read the GRanges and GenomicFeatures vignettes for more
>>>
>>> write.table
>>> (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t")
>>>
>>>
>>> *Thanks in advance*
>>>
>>> JL
>>>
>>> 2012/11/8 Harris A. Jaffee<hj at jhu.edu>
>>>
>>>> On the elementary end of all this...
>>>>
>>>> If the sites are on a *file*, one per line, you could do
>>>>
>>>>          y<- read.delim(filename, sep=".", header=FALSE, as.is=TRUE)
>>>>
>>>> etc.
>>>>
>>>> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote:
>>>>
>>>>> Hi Jose,
>>>>>
>>>>> On 11/8/2012 10:28 AM, José Luis Lavín wrote:
>>>>>> Dear James,
>>>>>>
>>>>>> To answer your questions swiftly, the features are methylation sites
>>>> (that's why I only have a coordinate instead of having a Start/End
>>>> pair) in
>>>> mouse (mm9) genome and I have a list of those in the following format:
>>>>>> chr17.31246506
>>>>>>
>>>>>> How could I read the list so that I can input the "chr" and the
>>>> "coordinate" parameters into the expression you recommended?
>>>>> First you need to split your data to chr and pos.
>>>>>
>>>>> chr.pos<- do.call("rbind", strsplit(<your vector of chr17.pos data>,
>>>> "\\."))
>>>>> Note that your vector of chr.pos data may have been in as a 
>>>>> factor, so
>>>> you may need to wrap with an as.character(). If you don't know what I
>>>> mean
>>>> by that, you should first read 'An Introduction to R' (
>>>> http://cran.r-project.org/doc/manuals/R-intro.html). That will be time
>>>> well spent.
>>>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>>>>> Both the seqnames and ranges argument to GRanges() can be based on
>>>> vectors. So if you had a matrix (called 'y') like
>>>>> chr16    3423543
>>>>> chr3    403992
>>>>> chr18    3494202
>>>>>
>>>>> then you can do
>>>>>
>>>>> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1))
>>>>>
>>>>> see ?GRanges for more info.
>>>>>
>>>>> As a side note, IIRC, methylation sites are not in general found in
>>>> exons, but are more likely to be found upstream of a given gene. You
>>>> might
>>>> then want to use fiveUTRsByTranscript() rather than exonsBy(). See
>>>> ?exonsBy
>>>> after loading the GenomicFeatures package.
>>>>>> I 'm lookin forward to obtain a table where my "coordinate-based Id"
>>>> appears in a column, and the gene_name and if possible, the
>>>> Entrez/Ensembl
>>>> Ids in two other columns :
>>>>>> Coordinate Gene_name  Entrez_ID  Ensembl_ID
>>>>>>
>>>>>> Is it easy to do this in R?
>>>>> Of course! Everything is easy to do in R. You should see my sweet R
>>>> toast 'n coffee maker ;-D
>>>>> But you should note that the fiveUTRByTranscript() function is
>>>> transcript based (obvs), and so you will have multiple transcripts per
>>>> gene. This makes things much more difficult, as a given methylation 
>>>> site
>>>> may overlap multiple transcripts of the same gene. So that makes it
>>>> hard to
>>>> merge your original data with the overlapping transcripts.
>>>>> You could do something like
>>>>>
>>>>> ex<- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, 
>>>>> use.names
>>>> = TRUE)
>>>>> ex2<- do.call("rbind", sapply(ex[ex %in% x], 
>>>>> elementMetadata))$exon_id
>>>>>
>>>>> then you could use unique Gene IDs thusly
>>>>>
>>>>> my.trans<- select(Mus.musculus, unique(as.character(ex2)),
>>>> c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID")
>>>>> That should at least give you a start. See where you can go on your
>>>>> own,
>>>> and let us know if you get stuck.
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>> I'm still really new to R and I lack many concepts you may consider
>>>> basic... I'm awfully sorry
>>>>>> Best
>>>>>>
>>>>>> JL
>>>>>>
>>>>>> 2012/11/8 James W. MacDonald<jmacdon at uw.edu<mailto:jmacdon at uw.edu>>
>>>>>>
>>>>>>     Hi Jose,
>>>>>>
>>>>>>
>>>>>>     On 11/8/2012 8:19 AM, José Luis Lavín wrote:
>>>>>>
>>>>>>         Dear Bioconductor list,
>>>>>>
>>>>>>         I write you this email asking for a Bioconductor module that
>>>>>>         allows me to
>>>>>>         annotate genomic coordinates and get different GeneIds.
>>>>>>         I'll show you an example of what I'm referring to:
>>>>>>
>>>>>>         I have this data:
>>>>>>         Chromosome  coordinate
>>>>>>         chr17              31246506
>>>>>>
>>>>>>
>>>>>>     It depends on what that coordinate is. Is it the start of a
>>>>>>     transcript? A SNP? Do you really just have a single 
>>>>>> coordinate, or
>>>>>>     do you have a range? What species are we talking about here?
>>>>>>
>>>>>>     Depending on what your data are, you might want to use either 
>>>>>> one
>>>>>>     of the TxDb packages, or a SNPlocs package. There really isn't
>>>>>>     much to go on here. If I assume this is a coordinate that one
>>>>>>     might think is within an exon, and if I further assume you are
>>>>>>     working with H. sapiens, I could suggest something like
>>>>>>
>>>>>>     library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>>>     ex<- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
>>>>>>
>>>>>>     x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>>>>>>
>>>>>>     ex[ex %in% x]
>>>>>>
>>>>>>     or maybe more appropriately
>>>>>>
>>>>>>     names(ex)[ex %in% x]
>>>>>>
>>>>>>     which will give you the Gene ID, and you can go from there using
>>>>>>     the org.Hs.eg.db package.
>>>>>>
>>>>>>     If however, your coordinate isn't in an exon, but might be in a
>>>>>>     UTR, you can look at ?exonsBy to see what other sequences you 
>>>>>> can
>>>>>>     extract to compare with.
>>>>>>
>>>>>>     If these are SNPs, then you can look at the help pages for the
>>>>>>     relevant SNPlocs package.
>>>>>>
>>>>>>     Best,
>>>>>>
>>>>>>     Jim
>>>>>>
>>>>>>
>>>>>>
>>>>>>         which can also be written this way by the program that 
>>>>>> yielded
>>>>>>         the result:
>>>>>>         chr17.31246506
>>>>>>
>>>>>>         And I need to convert this data into a gene name and known
>>>>>>         gene Ids, such
>>>>>>         as:
>>>>>>
>>>>>>         Gene name  Entrez_ID  Ensembl_ID
>>>>>>
>>>>>>         Tff3 NM_011575 20050
>>>>>>         Can you please advice me about a module able to perform this
>>>>>>         ID conversion
>>>>>>         using a list of  "chr17.31246506" type coordinates as input?
>>>>>>
>>>>>>         Thanks in advance
>>>>>>
>>>>>>         With best wishes
>>>>>>
>>>>>>
>>>>>>
>>>>>>         _______________________________________________
>>>>>>         Bioconductor mailing list
>>>>>>         
>>>>>> Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>>>>>>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>         Search the archives:
>>>>>>
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>>     --     James W. MacDonald, M.S.
>>>>>>     Biostatistician
>>>>>>     University of Washington
>>>>>>     Environmental and Occupational Health Sciences
>>>>>>     4225 Roosevelt Way NE, # 100
>>>>>>     Seattle WA 98105-6099
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> -- 
>>>>>> Dr. José Luis Lavín Trueba
>>>>>>
>>>>>> Dpto. de Producción Agraria
>>>>>> Grupo de Genética y Microbiología
>>>>>> Universidad Pública de Navarra
>>>>>> 31006 Pamplona
>>>>>> Navarra
>>>>>> SPAIN
>>>>> -- 
>>>>> James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> University of Washington
>>>>> Environmental and Occupational Health Sciences
>>>>> 4225 Roosevelt Way NE, # 100
>>>>> Seattle WA 98105-6099
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>