[BioC] How to annotate genomic coordinates
Valerie Obenchain
vobencha at fhcrc.org
Wed Nov 14 15:03:48 CET 2012
Hi Herve,
On 11/13/12 17:32, Hervé Pagès wrote:
> Hi all,
>
> On 11/12/2012 10:24 AM, Valerie Obenchain wrote:
>> Hi Jose,
>>
>> Here is a slightly different approach.
>>
>> library(Mus.musculus)
>> library(TxDb.Mmusculus.UCSC.mm9.knownGene)
>> txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
>>
>> ## I assume you've figured out how to read your data into a GRanges.
>> ## Here we use a small example.
>> gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20))
>>
>> ## The locateVariants() function in the VariantAnnotation package will
>> ## give you the GENEIDs for all ranges in 'query'. If the range does not
>> ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and
>> ## FOLLOWID are provided. The LOCATION column indicates what
>> ## region the range fell in. See ?locateVariants for details on the
>> ## different regions and the ability to set 'upstream' and 'downstream'
>> ## values for promoter regions.
>> library(VariantAnnotation)
>> loc <- locateVariants(query=gr, subject=txdb, region=AllVariants())
>>
>> > loc
>> GRanges with 1 range and 7 metadata columns:
>> seqnames ranges strand | LOCATION QUERYID
>> TXID
>> <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
>> [1] chr17 [31245606, 31245625] * | coding 1
>> 47716
>> CDSID GENEID PRECEDEID FOLLOWID
>> <integer> <character> <character> <character>
>> [1] 184246 11307 <NA> <NA>
>
> Nice! So IIUC locateVariants() is not intrinsically for variants but
> seems to be generally useful for "locating" any set of genomic
> positions. Shouldn't we have this (or something similar) in
> GenomicFeatures (probably with a different name)?
Yes, locateVariants() locates any range, not just variants. The name
choice was probably not the best. I like the idea of possibly renaming
and moving to GenomicRanges.
Val
>
> Cheers,
> H.
>
>>
>>
>> ## Use the select() function to map these GENEID's to the other
>> ## values you are interested in. The GENEID's from locateVariants()
>> ## are Entrez ID's so we use those as our 'keytype'.
>> entrezid <- loc$GENEID
>> select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
>> cols=c("GENENAME", "ENSEMBL"))
>> > select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
>> cols=c("GENENAME", "ENSEMBL"))
>> ENTREZID ENSEMBL
>> 1 11307 ENSMUSG00000024030
>> GENENAME
>> 1 ATP-binding cassette, sub-family G (WHITE), member 1
>>
>>
>> Valerie
>>
>>
>>
>> On 11/12/12 06:59, José Luis Lavín wrote:
>>> Hello all,
>>>
>>> First of all I want to thank everybody that gave me advice on this
>>> subject.
>>> trying to follow the advice, I added some modifications mixing codes
>>> from
>>> Tim, Harris and James , but it seems I got lost somewhere in
>>> between...
>>> I'm sorry for bothering you all again, but I'm afraid I need some more
>>> help.
>>>
>>> I need to read my ids.txt file and then iterate use each line of id
>>> (chr.position) to perform the annotation process with it. I thought
>>> of a
>>> "for" loop to achieve it, but I do not seem to catch the essence of R
>>> processes and I guess I miss my tryout.
>>> Please have a look at my "disaster" and help me with it If such
>>> thing is
>>> possible...
>>>
>>> biocLite('Mus.musculus')
>>> require(Mus.musculus)
>>> require(TxDb.Mmusculus.UCSC.mm9.knownGene)
>>> txdb<- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, 'gene')
>>> egid<- names(txdb)
>>> name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA))
>>> length(name) == length(egid) ## TRUE, OK to use
>>> esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA))
>>> length(esbl) == length(egid) ## FALSE, do not use
>>>
>>> #read input table
>>> coordTable<- read.delim(/path_to_dir/ids.txt, sep=".", header=FALSE,
>>> as.is
>>> =TRUE)
>>>
>>> for(i in 1:length(coordTable))
>>> {
>>> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1),
>>> strand='*')
>>> }
>>>
>>> genome(probes)<- 'mm9' ## prevents some stoopid mistakes
>>>
>>> geneBodyProbes<- findOverlaps(probes, txdb)
>>> geneBodyProbes
>>>
>>> write.table
>>> (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t")
>>>
>>> ## Hits of length 1
>>> ## queryLength: 1
>>> ## subjectLength: 21761
>>> ## queryHits subjectHits
>>> ##<integer> <integer>
>>> ## 1 1 1641
>>> ##
>>>
>>> name[subjectHits(geneBodyProbes)]
>>>
>>> ## 11307 # egid
>>> ## "Abcg1" # name
>>> ##
>>>
>>> org.Mm.egCHRLOC[['11307']]
>>>
>>> ## 17
>>> ## 31057694
>>> ##
>>>
>>> org.Mm.egENSEMBL[['11307']]
>>>
>>> ## [1] "ENSMUSG00000024030"
>>>
>>> promotersByGene<- flank(txdb, 1500) # or however many bases you want
>>> promotersByGene<- flank(promotersByGene, 200, FALSE) # a little extra
>>> promoterProbes<- findOverlaps(probes, promotersByGene)
>>> promoterProbes
>>>
>>> write.table
>>> (promoterProbes,file="/path_to_dir/promo_trans_id.tsv",quote=FALSE,sep="\t")
>>>
>>>
>>>
>>> ## Hits of length 0
>>> ## queryLength: 1
>>> ## subjectLength: 21761
>>> ##
>>> ## read the GRanges and GenomicFeatures vignettes for more
>>>
>>> write.table
>>> (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t")
>>>
>>>
>>> *Thanks in advance*
>>>
>>> JL
>>>
>>> 2012/11/8 Harris A. Jaffee<hj at jhu.edu>
>>>
>>>> On the elementary end of all this...
>>>>
>>>> If the sites are on a *file*, one per line, you could do
>>>>
>>>> y<- read.delim(filename, sep=".", header=FALSE, as.is=TRUE)
>>>>
>>>> etc.
>>>>
>>>> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote:
>>>>
>>>>> Hi Jose,
>>>>>
>>>>> On 11/8/2012 10:28 AM, José Luis Lavín wrote:
>>>>>> Dear James,
>>>>>>
>>>>>> To answer your questions swiftly, the features are methylation sites
>>>> (that's why I only have a coordinate instead of having a Start/End
>>>> pair) in
>>>> mouse (mm9) genome and I have a list of those in the following format:
>>>>>> chr17.31246506
>>>>>>
>>>>>> How could I read the list so that I can input the "chr" and the
>>>> "coordinate" parameters into the expression you recommended?
>>>>> First you need to split your data to chr and pos.
>>>>>
>>>>> chr.pos<- do.call("rbind", strsplit(<your vector of chr17.pos data>,
>>>> "\\."))
>>>>> Note that your vector of chr.pos data may have been in as a
>>>>> factor, so
>>>> you may need to wrap with an as.character(). If you don't know what I
>>>> mean
>>>> by that, you should first read 'An Introduction to R' (
>>>> http://cran.r-project.org/doc/manuals/R-intro.html). That will be time
>>>> well spent.
>>>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>>>>> Both the seqnames and ranges argument to GRanges() can be based on
>>>> vectors. So if you had a matrix (called 'y') like
>>>>> chr16 3423543
>>>>> chr3 403992
>>>>> chr18 3494202
>>>>>
>>>>> then you can do
>>>>>
>>>>> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1))
>>>>>
>>>>> see ?GRanges for more info.
>>>>>
>>>>> As a side note, IIRC, methylation sites are not in general found in
>>>> exons, but are more likely to be found upstream of a given gene. You
>>>> might
>>>> then want to use fiveUTRsByTranscript() rather than exonsBy(). See
>>>> ?exonsBy
>>>> after loading the GenomicFeatures package.
>>>>>> I 'm lookin forward to obtain a table where my "coordinate-based Id"
>>>> appears in a column, and the gene_name and if possible, the
>>>> Entrez/Ensembl
>>>> Ids in two other columns :
>>>>>> Coordinate Gene_name Entrez_ID Ensembl_ID
>>>>>>
>>>>>> Is it easy to do this in R?
>>>>> Of course! Everything is easy to do in R. You should see my sweet R
>>>> toast 'n coffee maker ;-D
>>>>> But you should note that the fiveUTRByTranscript() function is
>>>> transcript based (obvs), and so you will have multiple transcripts per
>>>> gene. This makes things much more difficult, as a given methylation
>>>> site
>>>> may overlap multiple transcripts of the same gene. So that makes it
>>>> hard to
>>>> merge your original data with the overlapping transcripts.
>>>>> You could do something like
>>>>>
>>>>> ex<- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene,
>>>>> use.names
>>>> = TRUE)
>>>>> ex2<- do.call("rbind", sapply(ex[ex %in% x],
>>>>> elementMetadata))$exon_id
>>>>>
>>>>> then you could use unique Gene IDs thusly
>>>>>
>>>>> my.trans<- select(Mus.musculus, unique(as.character(ex2)),
>>>> c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID")
>>>>> That should at least give you a start. See where you can go on your
>>>>> own,
>>>> and let us know if you get stuck.
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>> I'm still really new to R and I lack many concepts you may consider
>>>> basic... I'm awfully sorry
>>>>>> Best
>>>>>>
>>>>>> JL
>>>>>>
>>>>>> 2012/11/8 James W. MacDonald<jmacdon at uw.edu<mailto:jmacdon at uw.edu>>
>>>>>>
>>>>>> Hi Jose,
>>>>>>
>>>>>>
>>>>>> On 11/8/2012 8:19 AM, José Luis Lavín wrote:
>>>>>>
>>>>>> Dear Bioconductor list,
>>>>>>
>>>>>> I write you this email asking for a Bioconductor module that
>>>>>> allows me to
>>>>>> annotate genomic coordinates and get different GeneIds.
>>>>>> I'll show you an example of what I'm referring to:
>>>>>>
>>>>>> I have this data:
>>>>>> Chromosome coordinate
>>>>>> chr17 31246506
>>>>>>
>>>>>>
>>>>>> It depends on what that coordinate is. Is it the start of a
>>>>>> transcript? A SNP? Do you really just have a single
>>>>>> coordinate, or
>>>>>> do you have a range? What species are we talking about here?
>>>>>>
>>>>>> Depending on what your data are, you might want to use either
>>>>>> one
>>>>>> of the TxDb packages, or a SNPlocs package. There really isn't
>>>>>> much to go on here. If I assume this is a coordinate that one
>>>>>> might think is within an exon, and if I further assume you are
>>>>>> working with H. sapiens, I could suggest something like
>>>>>>
>>>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>>> ex<- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
>>>>>>
>>>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>>>>>>
>>>>>> ex[ex %in% x]
>>>>>>
>>>>>> or maybe more appropriately
>>>>>>
>>>>>> names(ex)[ex %in% x]
>>>>>>
>>>>>> which will give you the Gene ID, and you can go from there using
>>>>>> the org.Hs.eg.db package.
>>>>>>
>>>>>> If however, your coordinate isn't in an exon, but might be in a
>>>>>> UTR, you can look at ?exonsBy to see what other sequences you
>>>>>> can
>>>>>> extract to compare with.
>>>>>>
>>>>>> If these are SNPs, then you can look at the help pages for the
>>>>>> relevant SNPlocs package.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>>
>>>>>>
>>>>>> which can also be written this way by the program that
>>>>>> yielded
>>>>>> the result:
>>>>>> chr17.31246506
>>>>>>
>>>>>> And I need to convert this data into a gene name and known
>>>>>> gene Ids, such
>>>>>> as:
>>>>>>
>>>>>> Gene name Entrez_ID Ensembl_ID
>>>>>>
>>>>>> Tff3 NM_011575 20050
>>>>>> Can you please advice me about a module able to perform this
>>>>>> ID conversion
>>>>>> using a list of "chr17.31246506" type coordinates as input?
>>>>>>
>>>>>> Thanks in advance
>>>>>>
>>>>>> With best wishes
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>>
>>>>>> Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>>
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>> -- James W. MacDonald, M.S.
>>>>>> Biostatistician
>>>>>> University of Washington
>>>>>> Environmental and Occupational Health Sciences
>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>> Seattle WA 98105-6099
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> --
>>>>>> Dr. José Luis Lavín Trueba
>>>>>>
>>>>>> Dpto. de Producción Agraria
>>>>>> Grupo de Genética y Microbiología
>>>>>> Universidad Pública de Navarra
>>>>>> 31006 Pamplona
>>>>>> Navarra
>>>>>> SPAIN
>>>>> --
>>>>> James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> University of Washington
>>>>> Environmental and Occupational Health Sciences
>>>>> 4225 Roosevelt Way NE, # 100
>>>>> Seattle WA 98105-6099
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list