[BioC] How to annotate genomic coordinates

Mon Nov 12 19:24:10 CET 2012

Hi Jose,

Here is a slightly different approach.

library(Mus.musculus)
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene

## I assume you've figured out how to read your data into a GRanges.
## Here we use a small example.
gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20))

## The locateVariants() function in the VariantAnnotation package will
## give you the GENEIDs for all ranges in 'query'. If the range does not
## fall in a gene (i.e., it is intergenic), then the PRECEDEID and
## FOLLOWID are provided. The LOCATION column indicates what
## region the range fell in. See ?locateVariants for details on the
## different regions and the ability to set 'upstream' and 'downstream'
## values for promoter regions.
library(VariantAnnotation)
loc <- locateVariants(query=gr, subject=txdb, region=AllVariants())

 > loc
GRanges with 1 range and 7 metadata columns:
       seqnames               ranges strand | LOCATION   QUERYID      TXID
<Rle> <IRanges> <Rle> | <factor> <integer> <integer>
   [1]    chr17 [31245606, 31245625]      * |   coding         1     47716
           CDSID      GENEID   PRECEDEID    FOLLOWID
<integer> <character> <character> <character>
   [1]    184246       11307 <NA> <NA>

## Use the select() function to map these GENEID's to the other
## values you are interested in. The GENEID's from locateVariants()
## are Entrez ID's so we use those as our 'keytype'.
entrezid <- loc$GENEID
select(Mus.musculus, keytype="ENTREZID", keys=entrezid, 
cols=c("GENENAME", "ENSEMBL"))
 > select(Mus.musculus, keytype="ENTREZID", keys=entrezid, 
cols=c("GENENAME", "ENSEMBL"))
   ENTREZID            ENSEMBL
1    11307 ENSMUSG00000024030
                                               GENENAME
1 ATP-binding cassette, sub-family G (WHITE), member 1

Valerie

On 11/12/12 06:59, José Luis Lavín wrote:
> Hello all,
>
> First of all I want to thank everybody that gave me advice on this subject.
> trying to follow the advice, I added some modifications mixing codes from
> Tim,  Harris and James , but it seems I got lost somewhere in between...
> I'm sorry for bothering you all again, but I'm afraid I need some more help.
>
> I need to read my ids.txt file and then iterate use each line of id
> (chr.position) to perform the annotation process with it. I thought of a
> "for" loop to achieve it, but I do not seem to catch the essence of R
> processes and I guess I miss my tryout.
> Please have a look at my "disaster" and help me with it If such thing is
> possible...
>
> biocLite('Mus.musculus')
> require(Mus.musculus)
> require(TxDb.Mmusculus.UCSC.mm9.knownGene)
> txdb<- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, 'gene')
> egid<- names(txdb)
> name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA))
> length(name) == length(egid) ## TRUE, OK to use
> esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA))
> length(esbl) == length(egid) ## FALSE, do not use
>
> #read input table
> coordTable<- read.delim(/path_to_dir/ids.txt, sep=".", header=FALSE, as.is
> =TRUE)
>
> for(i in 1:length(coordTable))
> {
> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1),
>                   strand='*')
> }
>
> genome(probes)<- 'mm9' ## prevents some stoopid mistakes
>
> geneBodyProbes<- findOverlaps(probes, txdb)
> geneBodyProbes
>
> write.table
> (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t")
>
> ## Hits of length 1
> ## queryLength: 1
> ## subjectLength: 21761
> ##   queryHits subjectHits
> ##<integer>    <integer>
> ##     1         1        1641
> ##
>
> name[subjectHits(geneBodyProbes)]
>
> ##   11307  # egid
> ## "Abcg1"  # name
> ##
>
> org.Mm.egCHRLOC[['11307']]
>
> ##       17
> ## 31057694
> ##
>
> org.Mm.egENSEMBL[['11307']]
>
> ## [1] "ENSMUSG00000024030"
>
> promotersByGene<- flank(txdb, 1500) # or however many bases you want
> promotersByGene<- flank(promotersByGene, 200, FALSE) # a little extra
> promoterProbes<- findOverlaps(probes, promotersByGene)
> promoterProbes
>
> write.table
> (promoterProbes,file="/path_to_dir/promo_trans_id.tsv",quote=FALSE,sep="\t")
>
> ## Hits of length 0
> ## queryLength: 1
> ## subjectLength: 21761
> ##
> ## read the GRanges and GenomicFeatures vignettes for more
>
> write.table
> (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t")
>
>
> *Thanks in advance*
>
> JL
>
> 2012/11/8 Harris A. Jaffee<hj at jhu.edu>
>
>> On the elementary end of all this...
>>
>> If the sites are on a *file*, one per line, you could do
>>
>>          y<- read.delim(filename, sep=".", header=FALSE, as.is=TRUE)
>>
>> etc.
>>
>> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote:
>>
>>> Hi Jose,
>>>
>>> On 11/8/2012 10:28 AM, José Luis Lavín wrote:
>>>> Dear James,
>>>>
>>>> To answer your questions swiftly, the features are methylation sites
>> (that's why I only have a coordinate instead of having a Start/End pair) in
>> mouse (mm9) genome and I have a list of those in the following format:
>>>> chr17.31246506
>>>>
>>>> How could I read the list so that I can input the "chr" and the
>> "coordinate" parameters into the expression you recommended?
>>> First you need to split your data to chr and pos.
>>>
>>> chr.pos<- do.call("rbind", strsplit(<your vector of chr17.pos data>,
>> "\\."))
>>> Note that your vector of chr.pos data may have been in as a factor, so
>> you may need to wrap with an as.character(). If you don't know what I mean
>> by that, you should first read 'An Introduction to R' (
>> http://cran.r-project.org/doc/manuals/R-intro.html). That will be time
>> well spent.
>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>>> Both the seqnames and ranges argument to GRanges() can be based on
>> vectors. So if you had a matrix (called 'y') like
>>> chr16    3423543
>>> chr3    403992
>>> chr18    3494202
>>>
>>> then you can do
>>>
>>> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1))
>>>
>>> see ?GRanges for more info.
>>>
>>> As a side note, IIRC, methylation sites are not in general found in
>> exons, but are more likely to be found upstream of a given gene. You might
>> then want to use fiveUTRsByTranscript() rather than exonsBy(). See ?exonsBy
>> after loading the GenomicFeatures package.
>>>> I 'm lookin forward to obtain a table where my "coordinate-based Id"
>> appears in a column, and the gene_name and if possible, the  Entrez/Ensembl
>> Ids in two other columns :
>>>> Coordinate Gene_name  Entrez_ID  Ensembl_ID
>>>>
>>>> Is it easy to do this in R?
>>> Of course! Everything is easy to do in R. You should see my sweet R
>> toast 'n coffee maker ;-D
>>> But you should note that the fiveUTRByTranscript() function is
>> transcript based (obvs), and so you will have multiple transcripts per
>> gene. This makes things much more difficult, as a given methylation site
>> may overlap multiple transcripts of the same gene. So that makes it hard to
>> merge your original data with the overlapping transcripts.
>>> You could do something like
>>>
>>> ex<- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, use.names
>> = TRUE)
>>> ex2<- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id
>>>
>>> then you could use unique Gene IDs thusly
>>>
>>> my.trans<- select(Mus.musculus, unique(as.character(ex2)),
>> c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID")
>>> That should at least give you a start. See where you can go on your own,
>> and let us know if you get stuck.
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>>> I'm still really new to R and I lack many concepts you may consider
>> basic... I'm awfully sorry
>>>> Best
>>>>
>>>> JL
>>>>
>>>> 2012/11/8 James W. MacDonald<jmacdon at uw.edu<mailto:jmacdon at uw.edu>>
>>>>
>>>>     Hi Jose,
>>>>
>>>>
>>>>     On 11/8/2012 8:19 AM, José Luis Lavín wrote:
>>>>
>>>>         Dear Bioconductor list,
>>>>
>>>>         I write you this email asking for a Bioconductor module that
>>>>         allows me to
>>>>         annotate genomic coordinates and get different GeneIds.
>>>>         I'll show you an example of what I'm referring to:
>>>>
>>>>         I have this data:
>>>>         Chromosome  coordinate
>>>>         chr17              31246506
>>>>
>>>>
>>>>     It depends on what that coordinate is. Is it the start of a
>>>>     transcript? A SNP? Do you really just have a single coordinate, or
>>>>     do you have a range? What species are we talking about here?
>>>>
>>>>     Depending on what your data are, you might want to use either one
>>>>     of the TxDb packages, or a SNPlocs package. There really isn't
>>>>     much to go on here. If I assume this is a coordinate that one
>>>>     might think is within an exon, and if I further assume you are
>>>>     working with H. sapiens, I could suggest something like
>>>>
>>>>     library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>     ex<- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
>>>>
>>>>     x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>>>>
>>>>     ex[ex %in% x]
>>>>
>>>>     or maybe more appropriately
>>>>
>>>>     names(ex)[ex %in% x]
>>>>
>>>>     which will give you the Gene ID, and you can go from there using
>>>>     the org.Hs.eg.db package.
>>>>
>>>>     If however, your coordinate isn't in an exon, but might be in a
>>>>     UTR, you can look at ?exonsBy to see what other sequences you can
>>>>     extract to compare with.
>>>>
>>>>     If these are SNPs, then you can look at the help pages for the
>>>>     relevant SNPlocs package.
>>>>
>>>>     Best,
>>>>
>>>>     Jim
>>>>
>>>>
>>>>
>>>>         which can also be written this way by the program that yielded
>>>>         the result:
>>>>         chr17.31246506
>>>>
>>>>         And I need to convert this data into a gene name and known
>>>>         gene Ids, such
>>>>         as:
>>>>
>>>>         Gene name  Entrez_ID  Ensembl_ID
>>>>
>>>>         Tff3 NM_011575 20050
>>>>         Can you please advice me about a module able to perform this
>>>>         ID conversion
>>>>         using a list of  "chr17.31246506" type coordinates as input?
>>>>
>>>>         Thanks in advance
>>>>
>>>>         With best wishes
>>>>
>>>>
>>>>
>>>>         _______________________________________________
>>>>         Bioconductor mailing list
>>>>         Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>>>>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>         Search the archives:
>>>>
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>     --     James W. MacDonald, M.S.
>>>>     Biostatistician
>>>>     University of Washington
>>>>     Environmental and Occupational Health Sciences
>>>>     4225 Roosevelt Way NE, # 100
>>>>     Seattle WA 98105-6099
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Dr. José Luis Lavín Trueba
>>>>
>>>> Dpto. de Producción Agraria
>>>> Grupo de Genética y Microbiología
>>>> Universidad Pública de Navarra
>>>> 31006 Pamplona
>>>> Navarra
>>>> SPAIN
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor