[BioC] How to annotate genomic coordinates
Valerie Obenchain
vobencha at fhcrc.org
Wed Nov 14 19:37:11 CET 2012
On 11/14/2012 07:51 AM, José Luis Lavín wrote:
> Dear Valerie,
>
> *I've tried your last approach in the following way:*
>
> library(Mus.musculus)
> library(TxDb.Mmusculus.UCSC.mm9.knownGene)
> library(VariantAnnotation)
>
> txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
>
> y <- read.delim("/path_to_file/ids_noM.txt", sep=".", header=FALSE,
> as.is <http://as.is>=TRUE)
> gr<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width = 20))
>
> #gr <- GRanges(c("chr1", "chr17", "chr16"), IRanges(c(41245606,
> 31245606, 11248806), width = 20))
>
> loc <- locateVariants(query=gr, subject=txdb, region=AllVariants())
> head(loc)
>
> idx <- loc[ ,c("QUERYID", "GENEID")]
> idx <- idx[!duplicated(idx)] #no duplicar ids
> head(idx)
> geneid <- idx[!is.na <http://is.na>(idx$GENEID)] #Remove NA GENEID's and
> call select().
>
> genetable <- select(Mus.musculus, keytype="ENTREZID",
> keys=geneid$GENEID, cols=c("GENENAME", "ENSEMBL"))
> head(genetable)
>
> id <- paste(as.character(seqnames(geneid)), ".", start(geneid), sep="")
> data.frame(id, genetable)
>
> * and I get this error*:
>
> /Error in data.frame(id, genetable) :
> arguments imply differing number of rows: 180912, 12962/
The error means you have a different number of rows in 'id' and
'genetable'. Take a look at the ?select page and work through the
examples at the bottom. What does select() return? What happens if one
of the keys you give to select() doesn't have a match? What if there are
multiple matches?
If the results aren't one-to-one you will need to match the ENTREZID in
the select() result to the ENTREZID in the genetable.
Valerie
>
>
> It seems we are losing rows in the process,but I have no clue where or
> why this happens...
>
> Please forgive me for being still so unable to catch R file processing
> dynamics properly.
>
> Thank you again
>
>
>
>
> 2012/11/14 Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>>
>
> Hello,
>
> Are you familiar with the R help pages? For any function or class
> object you can type ? followed by the name.
>
> ?locateVariants
> ?select
>
> The page for locateVariants explains each of the the metadata
> columns in the result. One of the columns is 'QUERYID'. This id
> represents the row in your original query (i.e., the GRanges you
> entered as the 'query' argument). The page also explains that the
> result has one row for each range-transcript match.
>
> gr <- GRanges(c("chr1", "chr17", "chr16"), IRanges(c(41245606,
> 31245606, 11248806), width = 20))
>
> loc <- locateVariants(query=gr, subject=txdb, region=AllVariants())
>
> The first range hit an intergenic region and we therefore have
> PRECEDEID and FOLLOWID. The second range hit a coding region in a
> single transcript and the third range hit an intron in two different
> transcripts. Because the third range hit 2 transcripts we have 2
> rows of data, each has a unique TXID but the same GENEID.
>
> > loc
> GRanges with 4 ranges and 7 metadata columns:
>
> seqnames ranges strand | LOCATION QUERYID
> TXID
> <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
> [1] chr1 [41245606, 41245625] * | intergenic 1 <NA>
> [2] chr17 [31245606, 31245625] * | coding 2
> 47716
> [3] chr16 [11248806, 11248825] * | intron 3
> 46316
> [4] chr16 [11248806, 11248825] * | intron 3
> 46317
>
> CDSID GENEID PRECEDEID FOLLOWID
> <integer> <character> <character> <character>
> [1] <NA> <NA> 211798 381339
> [2] 184246 11307 <NA> <NA>
> [3] <NA> 14852 <NA> <NA>
> [4] <NA> 14852 <NA> <NA>
>
>
> Pull out the unique combinations of GENEID and QUERYID.
>
> idx <- loc[ ,c("QUERYID", "GENEID")]
> idx <- idx[!duplicated(idx)]
> > idx
> GRanges with 3 ranges and 2 metadata columns:
> seqnames ranges strand | QUERYID GENEID
> <Rle> <IRanges> <Rle> | <integer> <character>
> [1] chr1 [41245606, 41245625] * | 1 <NA>
> [2] chr17 [31245606, 31245625] * | 2 11307
> [3] chr16 [11248806, 11248825] * | 3 14852
>
> Remove NA GENEID's and call select().
> geneid <- idx[!is.na <http://is.na>(idx$GENEID)]
> genetable <- select(Mus.musculus, keytype="ENTREZID",
> keys=geneid$GENEID, cols=c("GENENAME", "ENSEMBL"))
> > genetable
>
> ENTREZID ENSEMBL
> 1 11307 ENSMUSG00000024030
> 2 14852 ENSMUSG00000062203
>
> GENENAME
> 1 ATP-binding cassette, sub-family G (WHITE), member 1
> 2 G1 to S phase transition 1
>
> You can use the QUERYID to map back to an original list of
> identifiers. If you just need 'chromosome.start' information you can
> extract that from the ranges in geneid. Note that the QUERYID maps
> back to the original 'query' but those same ranges are also included
> in the result.
> id <- paste(as.character(seqnames(__geneid)), ".", start(geneid),
> sep="")
> > data.frame(id, genetable)
> id ENTREZID ENSEMBL
> 1 chr17.31245606 11307 ENSMUSG00000024030
> 2 chr16.11248806 14852 ENSMUSG00000062203
>
> GENENAME
> 1 ATP-binding cassette, sub-family G (WHITE), member 1
> 2 G1 to S phase transition 1
>
>
> Valerie
>
>
>
> On 11/14/12 01:00, José Luis Lavín wrote:
>
> Dear Valerie,
>
> I have an issue with the results of your approach. Now I get a
> table with the following fields:
>
> ENTREZID ENSEMBL GENENAME
>
> 1 216285 ENSMUSG00000036602 ALX homeobox 1
>
>
> But the Identifier I converted to obtain the previously
> mentioned data is not present in that table (eg.
> chr10.100837476) , so there's nothing I can do with this results
> table because I can't correlate my previous chr.location type
> identifiers to the results of the table.
> I guess I've done something wrong somewhere in the code I
> modified so that I missed my "chr.coordinate" ids column. So
> I'll paste the code here again to see if any of you can tell me
> where I lost that column.
>
> source("http://bioconductor.__org/biocLite.R
> <http://bioconductor.org/biocLite.R>")
> biocLite('Mus.musculus')
> biocLite('TxDb.Mmusculus.UCSC.__mm9.knownGene')
>
> biocLite('VariantAnnotation')
> library(Mus.musculus)
> library(TxDb.Mmusculus.UCSC.__mm9.knownGene)
>
> txdb <- TxDb.Mmusculus.UCSC.mm9.__knownGene
> y <- read.delim("/path_to_dir/ids___noM.txt", sep=".",
> header=FALSE, as.is <http://as.is> <http://as.is>=TRUE)
>
> probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2],
> width=1))
>
> library(VariantAnnotation)
> loc <- locateVariants(query=probes, subject=txdb,
> region=AllVariants())
> head(loc)
>
> entrezid <- loc$GENEID
> entrezid2<-select(Mus.__musculus, keytype="ENTREZID",
> keys=entrezid, cols=c("GENENAME", "ENSEMBL"))
>
> write.table(entrezid2, file = "Annotated_Ids.csv",quote = FALSE,
> sep = ",", eol = "\n")
>
> Thank you again for your advice
>
>
> 2012/11/14 Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>>>
>
>
> Hi all,
>
>
> On 11/12/2012 10:24 AM, Valerie Obenchain wrote:
>
> Hi Jose,
>
> Here is a slightly different approach.
>
> library(Mus.musculus)
> library(TxDb.Mmusculus.UCSC.__mm9.knownGene)
> txdb <- TxDb.Mmusculus.UCSC.mm9.__knownGene
>
> ## I assume you've figured out how to read your data into a
> GRanges.
> ## Here we use a small example.
> gr <- GRanges(seq = "chr17", IRanges(start = 31245606,
> width =
> 20))
>
> ## The locateVariants() function in the VariantAnnotation
> package will
> ## give you the GENEIDs for all ranges in 'query'. If the
> range does not
> ## fall in a gene (i.e., it is intergenic), then the
> PRECEDEID and
> ## FOLLOWID are provided. The LOCATION column indicates
> what
> ## region the range fell in. See ?locateVariants for
> details
> on the
> ## different regions and the ability to set 'upstream' and
> 'downstream'
> ## values for promoter regions.
> library(VariantAnnotation)
> loc <- locateVariants(query=gr, subject=txdb,
> region=AllVariants())
>
> > loc
> GRanges with 1 range and 7 metadata columns:
> seqnames ranges strand | LOCATION
> QUERYID TXID
> <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
> [1] chr17 [31245606, 31245625] * | coding
> 1 47716
> CDSID GENEID PRECEDEID FOLLOWID
> <integer> <character> <character> <character>
> [1] 184246 11307 <NA> <NA>
>
>
> Nice! So IIUC locateVariants() is not intrinsically for
> variants but
> seems to be generally useful for "locating" any set of genomic
> positions. Shouldn't we have this (or something similar) in
> GenomicFeatures (probably with a different name)?
>
> Cheers,
> H.
>
>
>
>
> ## Use the select() function to map these GENEID's to
> the other
> ## values you are interested in. The GENEID's from
> locateVariants()
> ## are Entrez ID's so we use those as our 'keytype'.
> entrezid <- loc$GENEID
> select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
> cols=c("GENENAME", "ENSEMBL"))
> > select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
> cols=c("GENENAME", "ENSEMBL"))
> ENTREZID ENSEMBL
> 1 11307 ENSMUSG00000024030
> GENENAME
> 1 ATP-binding cassette, sub-family G (WHITE), member 1
>
>
> Valerie
>
>
>
> On 11/12/12 06:59, José Luis Lavín wrote:
>
> Hello all,
>
> First of all I want to thank everybody that gave me
> advice
> on this
> subject.
> trying to follow the advice, I added some modifications
> mixing codes from
> Tim, Harris and James , but it seems I got lost
> somewhere
> in between...
> I'm sorry for bothering you all again, but I'm afraid I
> need some more
> help.
>
> I need to read my ids.txt file and then iterate use
> each
> line of id
> (chr.position) to perform the annotation process
> with it.
> I thought of a
> "for" loop to achieve it, but I do not seem to catch the
> essence of R
> processes and I guess I miss my tryout.
> Please have a look at my "disaster" and help me
> with it If
> such thing is
> possible...
>
> biocLite('Mus.musculus')
> require(Mus.musculus)
> require(TxDb.Mmusculus.UCSC.__mm9.knownGene)
> txdb<-
> transcriptsBy(TxDb.Mmusculus.__UCSC.mm9.knownGene,
> 'gene')
> egid<- names(txdb)
> name<- unlist(mget(egid, org.Mm.egSYMBOL,
> ifnotfound=NA))
> length(name) == length(egid) ## TRUE, OK to use
> esbl<- unlist(mget(egid, org.Mm.egENSEMBL,
> ifnotfound=NA))
> length(esbl) == length(egid) ## FALSE, do not use
>
> #read input table
> coordTable<- read.delim(/path_to_dir/ids.__txt,
> sep=".",
> header=FALSE,
> as.is <http://as.is> <http://as.is>
>
> =TRUE)
>
> for(i in 1:length(coordTable))
> {
> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2],
> width = 1),
> strand='*')
> }
>
> genome(probes)<- 'mm9' ## prevents some stoopid
> mistakes
>
> geneBodyProbes<- findOverlaps(probes, txdb)
> geneBodyProbes
>
> write.table
>
> (geneBodyProbes,file="/path___to_dir/trans_id.tsv",quote=__FALSE,sep="\t")
>
> ## Hits of length 1
> ## queryLength: 1
> ## subjectLength: 21761
> ## queryHits subjectHits
> ##<integer> <integer>
> ## 1 1 1641
> ##
>
> name[subjectHits(__geneBodyProbes)]
>
> ## 11307 # egid
> ## "Abcg1" # name
> ##
>
> org.Mm.egCHRLOC[['11307']]
>
> ## 17
> ## 31057694
> ##
>
> org.Mm.egENSEMBL[['11307']]
>
> ## [1] "ENSMUSG00000024030"
>
> promotersByGene<- flank(txdb, 1500) # or however many
> bases you want
> promotersByGene<- flank(promotersByGene, 200,
> FALSE) # a
> little extra
> promoterProbes<- findOverlaps(probes, promotersByGene)
> promoterProbes
>
> write.table
>
> (promoterProbes,file="/path___to_dir/promo_trans_id.tsv",__quote=FALSE,sep="\t")
>
>
> ## Hits of length 0
> ## queryLength: 1
> ## subjectLength: 21761
> ##
> ## read the GRanges and GenomicFeatures vignettes
> for more
>
> write.table
>
> (geneBodyProbes,file="/path___to_dir/trans_id.tsv",quote=__FALSE,sep="\t")
>
>
> *Thanks in advance*
>
> JL
>
> 2012/11/8 Harris A. Jaffee<hj at jhu.edu
> <mailto:hj at jhu.edu> <mailto:hj at jhu.edu <mailto:hj at jhu.edu>>>
>
>
> On the elementary end of all this...
>
> If the sites are on a *file*, one per line, you
> could do
>
> y<- read.delim(filename, sep=".",
> header=FALSE, as.is <http://as.is>
> <http://as.is>=TRUE)
>
>
> etc.
>
> On Nov 8, 2012, at 11:38 AM, James W. MacDonald
> wrote:
>
> Hi Jose,
>
> On 11/8/2012 10:28 AM, José Luis Lavín wrote:
>
> Dear James,
>
> To answer your questions swiftly, the
> features
> are methylation sites
>
> (that's why I only have a coordinate instead of
> having
> a Start/End
> pair) in
> mouse (mm9) genome and I have a list of those
> in the
> following format:
>
> chr17.31246506
>
> How could I read the list so that I can
> input
> the "chr" and the
>
> "coordinate" parameters into the expression you
> recommended?
>
> First you need to split your data to chr
> and pos.
>
> chr.pos<- do.call("rbind", strsplit(<your
> vector
> of chr17.pos data>,
>
> "\\."))
>
> Note that your vector of chr.pos data may have
> been in as a factor, so
>
> you may need to wrap with an as.character(). If you
> don't know what I
> mean
> by that, you should first read 'An Introduction
> to R' (
> http://cran.r-project.org/doc/__manuals/R-intro.html
> <http://cran.r-project.org/doc/manuals/R-intro.html>).
> That will be time
> well spent.
>
> x<- GRanges(seq = "chr17", IRanges(start =
> 31245606, width = 1))
>
> Both the seqnames and ranges argument to
> GRanges()
> can be based on
>
> vectors. So if you had a matrix (called 'y') like
>
> chr16 3423543
> chr3 403992
> chr18 3494202
>
> then you can do
>
> x<- GRanges(y[,1], IRanges(start = y[,2],
> width = 1))
>
> see ?GRanges for more info.
>
> As a side note, IIRC, methylation sites are
> not in
> general found in
>
> exons, but are more likely to be found upstream
> of a
> given gene. You
> might
> then want to use fiveUTRsByTranscript() rather than
> exonsBy(). See
> ?exonsBy
> after loading the GenomicFeatures package.
>
> I 'm lookin forward to obtain a table
> where my
> "coordinate-based Id"
>
> appears in a column, and the gene_name and if
> possible, the
> Entrez/Ensembl
> Ids in two other columns :
>
> Coordinate Gene_name Entrez_ID Ensembl_ID
>
> Is it easy to do this in R?
>
> Of course! Everything is easy to do in R. You
> should see my sweet R
>
> toast 'n coffee maker ;-D
>
> But you should note that the
> fiveUTRByTranscript()
> function is
>
> transcript based (obvs), and so you will have
> multiple
> transcripts per
> gene. This makes things much more difficult, as a
> given methylation site
> may overlap multiple transcripts of the same
> gene. So
> that makes it
> hard to
> merge your original data with the overlapping
> transcripts.
>
> You could do something like
>
> ex<-
>
> fiveUTRsByTranscript(TxDb.__Mmusculus.UCSC.mm10.knownGene,
> use.names
>
> = TRUE)
>
> ex2<- do.call("rbind", sapply(ex[ex %in% x],
> elementMetadata))$exon_id
>
> then you could use unique Gene IDs thusly
>
> my.trans<- select(Mus.musculus,
> unique(as.character(ex2)),
>
> c("CHR","CHRLOC","ENTREZID","__ENSEMBL"),
> "ENTREZID")
>
> That should at least give you a start. See
> where
> you can go on your
> own,
>
> and let us know if you get stuck.
>
> Best,
>
> Jim
>
>
>
> I'm still really new to R and I lack many
> concepts you may consider
>
> basic... I'm awfully sorry
>
> Best
>
> JL
>
> 2012/11/8 James W.
> MacDonald<jmacdon at uw.edu <mailto:jmacdon at uw.edu>
> <mailto:jmacdon at uw.edu
> <mailto:jmacdon at uw.edu>><__mailto:jmacdon at uw.edu
> <mailto:jmacdon at uw.edu>
>
> <mailto:jmacdon at uw.edu <mailto:jmacdon at uw.edu>>>>
>
> Hi Jose,
>
>
> On 11/8/2012 8:19 AM, José Luis
> Lavín wrote:
>
> Dear Bioconductor list,
>
> I write you this email asking for a
> Bioconductor module that
> allows me to
> annotate genomic coordinates
> and get
> different GeneIds.
> I'll show you an example of
> what I'm
> referring to:
>
> I have this data:
> Chromosome coordinate
> chr17 31246506
>
>
> It depends on what that coordinate
> is. Is
> it the start of a
> transcript? A SNP? Do you really
> just have
> a single coordinate, or
> do you have a range? What species
> are we
> talking about here?
>
> Depending on what your data are,
> you might
> want to use either one
> of the TxDb packages, or a SNPlocs
> package. There really isn't
> much to go on here. If I assume
> this is a
> coordinate that one
> might think is within an exon, and if I
> further assume you are
> working with H. sapiens, I could
> suggest
> something like
>
>
> library(TxDb.Hsapiens.UCSC.__hg19.knownGene)
> ex<-
>
> exonsBy(TxDb.Hsapiens.UCSC.__hg19.knownGene, "gene")
>
> x<- GRanges(seq = "chr17",
> IRanges(start =
> 31245606, width = 1))
>
> ex[ex %in% x]
>
> or maybe more appropriately
>
> names(ex)[ex %in% x]
>
> which will give you the Gene ID,
> and you
> can go from there using
> the org.Hs.eg.db package.
>
> If however, your coordinate isn't in an
> exon, but might be in a
> UTR, you can look at ?exonsBy to
> see what
> other sequences you can
> extract to compare with.
>
> If these are SNPs, then you can look at
> the help pages for the
> relevant SNPlocs package.
>
> Best,
>
> Jim
>
>
>
> which can also be written this
> way by
> the program that yielded
> the result:
> chr17.31246506
>
> And I need to convert this data
> into a
> gene name and known
> gene Ids, such
> as:
>
> Gene name Entrez_ID Ensembl_ID
>
> Tff3 NM_011575 20050
> Can you please advice me about a
> module able to perform this
> ID conversion
> using a list of
> "chr17.31246506" type
> coordinates as input?
>
> Thanks in advance
>
> With best wishes
>
>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>><mailto:Bioconduct__or at r-project.org
> <mailto:Bioconductor at r-project.org>
>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
>
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
> -- James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational
> Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>
>
>
> --
> --
> Dr. José Luis Lavín Trueba
>
> Dpto. de Producción Agraria
> Grupo de Genética y Microbiología
> Universidad Pública de Navarra
> 31006 Pamplona
> Navarra
> SPAIN
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
>
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
> -- Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
> <tel:%28206%29%20667-1319>
>
>
>
>
>
> --
> --
> Dr. José Luis Lavín Trueba
>
> Dpto. de Producción Agraria
> Grupo de Genética y Microbiología
> Universidad Pública de Navarra
> 31006 Pamplona
> Navarra
> SPAIN
>
>
>
>
>
> --
> --
> Dr. José Luis Lavín Trueba
>
> Dpto. de Producción Agraria
> Grupo de Genética y Microbiología
> Universidad Pública de Navarra
> 31006 Pamplona
> Navarra
> SPAIN
More information about the Bioconductor
mailing list