[BioC] How to annotate genomic coordinates
Valerie Obenchain
vobencha at fhcrc.org
Mon Nov 12 19:24:10 CET 2012
Hi Jose,
Here is a slightly different approach.
library(Mus.musculus)
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
## I assume you've figured out how to read your data into a GRanges.
## Here we use a small example.
gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20))
## The locateVariants() function in the VariantAnnotation package will
## give you the GENEIDs for all ranges in 'query'. If the range does not
## fall in a gene (i.e., it is intergenic), then the PRECEDEID and
## FOLLOWID are provided. The LOCATION column indicates what
## region the range fell in. See ?locateVariants for details on the
## different regions and the ability to set 'upstream' and 'downstream'
## values for promoter regions.
library(VariantAnnotation)
loc <- locateVariants(query=gr, subject=txdb, region=AllVariants())
> loc
GRanges with 1 range and 7 metadata columns:
seqnames ranges strand | LOCATION QUERYID TXID
<Rle> <IRanges> <Rle> | <factor> <integer> <integer>
[1] chr17 [31245606, 31245625] * | coding 1 47716
CDSID GENEID PRECEDEID FOLLOWID
<integer> <character> <character> <character>
[1] 184246 11307 <NA> <NA>
## Use the select() function to map these GENEID's to the other
## values you are interested in. The GENEID's from locateVariants()
## are Entrez ID's so we use those as our 'keytype'.
entrezid <- loc$GENEID
select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
cols=c("GENENAME", "ENSEMBL"))
> select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
cols=c("GENENAME", "ENSEMBL"))
ENTREZID ENSEMBL
1 11307 ENSMUSG00000024030
GENENAME
1 ATP-binding cassette, sub-family G (WHITE), member 1
Valerie
On 11/12/12 06:59, José Luis Lavín wrote:
> Hello all,
>
> First of all I want to thank everybody that gave me advice on this subject.
> trying to follow the advice, I added some modifications mixing codes from
> Tim, Harris and James , but it seems I got lost somewhere in between...
> I'm sorry for bothering you all again, but I'm afraid I need some more help.
>
> I need to read my ids.txt file and then iterate use each line of id
> (chr.position) to perform the annotation process with it. I thought of a
> "for" loop to achieve it, but I do not seem to catch the essence of R
> processes and I guess I miss my tryout.
> Please have a look at my "disaster" and help me with it If such thing is
> possible...
>
> biocLite('Mus.musculus')
> require(Mus.musculus)
> require(TxDb.Mmusculus.UCSC.mm9.knownGene)
> txdb<- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, 'gene')
> egid<- names(txdb)
> name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA))
> length(name) == length(egid) ## TRUE, OK to use
> esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA))
> length(esbl) == length(egid) ## FALSE, do not use
>
> #read input table
> coordTable<- read.delim(/path_to_dir/ids.txt, sep=".", header=FALSE, as.is
> =TRUE)
>
> for(i in 1:length(coordTable))
> {
> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1),
> strand='*')
> }
>
> genome(probes)<- 'mm9' ## prevents some stoopid mistakes
>
> geneBodyProbes<- findOverlaps(probes, txdb)
> geneBodyProbes
>
> write.table
> (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t")
>
> ## Hits of length 1
> ## queryLength: 1
> ## subjectLength: 21761
> ## queryHits subjectHits
> ##<integer> <integer>
> ## 1 1 1641
> ##
>
> name[subjectHits(geneBodyProbes)]
>
> ## 11307 # egid
> ## "Abcg1" # name
> ##
>
> org.Mm.egCHRLOC[['11307']]
>
> ## 17
> ## 31057694
> ##
>
> org.Mm.egENSEMBL[['11307']]
>
> ## [1] "ENSMUSG00000024030"
>
> promotersByGene<- flank(txdb, 1500) # or however many bases you want
> promotersByGene<- flank(promotersByGene, 200, FALSE) # a little extra
> promoterProbes<- findOverlaps(probes, promotersByGene)
> promoterProbes
>
> write.table
> (promoterProbes,file="/path_to_dir/promo_trans_id.tsv",quote=FALSE,sep="\t")
>
> ## Hits of length 0
> ## queryLength: 1
> ## subjectLength: 21761
> ##
> ## read the GRanges and GenomicFeatures vignettes for more
>
> write.table
> (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t")
>
>
> *Thanks in advance*
>
> JL
>
> 2012/11/8 Harris A. Jaffee<hj at jhu.edu>
>
>> On the elementary end of all this...
>>
>> If the sites are on a *file*, one per line, you could do
>>
>> y<- read.delim(filename, sep=".", header=FALSE, as.is=TRUE)
>>
>> etc.
>>
>> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote:
>>
>>> Hi Jose,
>>>
>>> On 11/8/2012 10:28 AM, José Luis Lavín wrote:
>>>> Dear James,
>>>>
>>>> To answer your questions swiftly, the features are methylation sites
>> (that's why I only have a coordinate instead of having a Start/End pair) in
>> mouse (mm9) genome and I have a list of those in the following format:
>>>> chr17.31246506
>>>>
>>>> How could I read the list so that I can input the "chr" and the
>> "coordinate" parameters into the expression you recommended?
>>> First you need to split your data to chr and pos.
>>>
>>> chr.pos<- do.call("rbind", strsplit(<your vector of chr17.pos data>,
>> "\\."))
>>> Note that your vector of chr.pos data may have been in as a factor, so
>> you may need to wrap with an as.character(). If you don't know what I mean
>> by that, you should first read 'An Introduction to R' (
>> http://cran.r-project.org/doc/manuals/R-intro.html). That will be time
>> well spent.
>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>>> Both the seqnames and ranges argument to GRanges() can be based on
>> vectors. So if you had a matrix (called 'y') like
>>> chr16 3423543
>>> chr3 403992
>>> chr18 3494202
>>>
>>> then you can do
>>>
>>> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1))
>>>
>>> see ?GRanges for more info.
>>>
>>> As a side note, IIRC, methylation sites are not in general found in
>> exons, but are more likely to be found upstream of a given gene. You might
>> then want to use fiveUTRsByTranscript() rather than exonsBy(). See ?exonsBy
>> after loading the GenomicFeatures package.
>>>> I 'm lookin forward to obtain a table where my "coordinate-based Id"
>> appears in a column, and the gene_name and if possible, the Entrez/Ensembl
>> Ids in two other columns :
>>>> Coordinate Gene_name Entrez_ID Ensembl_ID
>>>>
>>>> Is it easy to do this in R?
>>> Of course! Everything is easy to do in R. You should see my sweet R
>> toast 'n coffee maker ;-D
>>> But you should note that the fiveUTRByTranscript() function is
>> transcript based (obvs), and so you will have multiple transcripts per
>> gene. This makes things much more difficult, as a given methylation site
>> may overlap multiple transcripts of the same gene. So that makes it hard to
>> merge your original data with the overlapping transcripts.
>>> You could do something like
>>>
>>> ex<- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, use.names
>> = TRUE)
>>> ex2<- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id
>>>
>>> then you could use unique Gene IDs thusly
>>>
>>> my.trans<- select(Mus.musculus, unique(as.character(ex2)),
>> c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID")
>>> That should at least give you a start. See where you can go on your own,
>> and let us know if you get stuck.
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>>> I'm still really new to R and I lack many concepts you may consider
>> basic... I'm awfully sorry
>>>> Best
>>>>
>>>> JL
>>>>
>>>> 2012/11/8 James W. MacDonald<jmacdon at uw.edu<mailto:jmacdon at uw.edu>>
>>>>
>>>> Hi Jose,
>>>>
>>>>
>>>> On 11/8/2012 8:19 AM, José Luis Lavín wrote:
>>>>
>>>> Dear Bioconductor list,
>>>>
>>>> I write you this email asking for a Bioconductor module that
>>>> allows me to
>>>> annotate genomic coordinates and get different GeneIds.
>>>> I'll show you an example of what I'm referring to:
>>>>
>>>> I have this data:
>>>> Chromosome coordinate
>>>> chr17 31246506
>>>>
>>>>
>>>> It depends on what that coordinate is. Is it the start of a
>>>> transcript? A SNP? Do you really just have a single coordinate, or
>>>> do you have a range? What species are we talking about here?
>>>>
>>>> Depending on what your data are, you might want to use either one
>>>> of the TxDb packages, or a SNPlocs package. There really isn't
>>>> much to go on here. If I assume this is a coordinate that one
>>>> might think is within an exon, and if I further assume you are
>>>> working with H. sapiens, I could suggest something like
>>>>
>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>> ex<- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
>>>>
>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>>>>
>>>> ex[ex %in% x]
>>>>
>>>> or maybe more appropriately
>>>>
>>>> names(ex)[ex %in% x]
>>>>
>>>> which will give you the Gene ID, and you can go from there using
>>>> the org.Hs.eg.db package.
>>>>
>>>> If however, your coordinate isn't in an exon, but might be in a
>>>> UTR, you can look at ?exonsBy to see what other sequences you can
>>>> extract to compare with.
>>>>
>>>> If these are SNPs, then you can look at the help pages for the
>>>> relevant SNPlocs package.
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>
>>>> which can also be written this way by the program that yielded
>>>> the result:
>>>> chr17.31246506
>>>>
>>>> And I need to convert this data into a gene name and known
>>>> gene Ids, such
>>>> as:
>>>>
>>>> Gene name Entrez_ID Ensembl_ID
>>>>
>>>> Tff3 NM_011575 20050
>>>> Can you please advice me about a module able to perform this
>>>> ID conversion
>>>> using a list of "chr17.31246506" type coordinates as input?
>>>>
>>>> Thanks in advance
>>>>
>>>> With best wishes
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>>
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> -- James W. MacDonald, M.S.
>>>> Biostatistician
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Dr. José Luis Lavín Trueba
>>>>
>>>> Dpto. de Producción Agraria
>>>> Grupo de Genética y Microbiología
>>>> Universidad Pública de Navarra
>>>> 31006 Pamplona
>>>> Navarra
>>>> SPAIN
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list