[BioC] How to annotate genomic coordinates
Harris A. Jaffee
hj at jhu.edu
Thu Nov 8 19:49:54 CET 2012
On the elementary end of all this...
If the sites are on a *file*, one per line, you could do
y <- read.delim(filename, sep=".", header=FALSE, as.is=TRUE)
etc.
On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote:
> Hi Jose,
>
> On 11/8/2012 10:28 AM, José Luis Lavín wrote:
>> Dear James,
>>
>> To answer your questions swiftly, the features are methylation sites (that's why I only have a coordinate instead of having a Start/End pair) in mouse (mm9) genome and I have a list of those in the following format:
>>
>> chr17.31246506
>>
>> How could I read the list so that I can input the "chr" and the "coordinate" parameters into the expression you recommended?
>
> First you need to split your data to chr and pos.
>
> chr.pos <- do.call("rbind", strsplit(<your vector of chr17.pos data>, "\\."))
>
> Note that your vector of chr.pos data may have been in as a factor, so you may need to wrap with an as.character(). If you don't know what I mean by that, you should first read 'An Introduction to R' (http://cran.r-project.org/doc/manuals/R-intro.html). That will be time well spent.
>
>>
>> x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>
> Both the seqnames and ranges argument to GRanges() can be based on vectors. So if you had a matrix (called 'y') like
>
> chr16 3423543
> chr3 403992
> chr18 3494202
>
> then you can do
>
> x <- GRanges(y[,1], IRanges(start = y[,2], width = 1))
>
> see ?GRanges for more info.
>
> As a side note, IIRC, methylation sites are not in general found in exons, but are more likely to be found upstream of a given gene. You might then want to use fiveUTRsByTranscript() rather than exonsBy(). See ?exonsBy after loading the GenomicFeatures package.
>
>>
>> I 'm lookin forward to obtain a table where my "coordinate-based Id" appears in a column, and the gene_name and if possible, the Entrez/Ensembl Ids in two other columns :
>>
>> Coordinate Gene_name Entrez_ID Ensembl_ID
>>
>> Is it easy to do this in R?
>
> Of course! Everything is easy to do in R. You should see my sweet R toast 'n coffee maker ;-D
>
> But you should note that the fiveUTRByTranscript() function is transcript based (obvs), and so you will have multiple transcripts per gene. This makes things much more difficult, as a given methylation site may overlap multiple transcripts of the same gene. So that makes it hard to merge your original data with the overlapping transcripts.
>
> You could do something like
>
> ex <- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, use.names = TRUE)
> ex2 <- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id
>
> then you could use unique Gene IDs thusly
>
> my.trans <- select(Mus.musculus, unique(as.character(ex2)), c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID")
>
> That should at least give you a start. See where you can go on your own, and let us know if you get stuck.
>
> Best,
>
> Jim
>
>
>
>>
>> I'm still really new to R and I lack many concepts you may consider basic... I'm awfully sorry
>>
>> Best
>>
>> JL
>>
>> 2012/11/8 James W. MacDonald <jmacdon at uw.edu <mailto:jmacdon at uw.edu>>
>>
>> Hi Jose,
>>
>>
>> On 11/8/2012 8:19 AM, José Luis Lavín wrote:
>>
>> Dear Bioconductor list,
>>
>> I write you this email asking for a Bioconductor module that
>> allows me to
>> annotate genomic coordinates and get different GeneIds.
>> I'll show you an example of what I'm referring to:
>>
>> I have this data:
>> Chromosome coordinate
>> chr17 31246506
>>
>>
>> It depends on what that coordinate is. Is it the start of a
>> transcript? A SNP? Do you really just have a single coordinate, or
>> do you have a range? What species are we talking about here?
>>
>> Depending on what your data are, you might want to use either one
>> of the TxDb packages, or a SNPlocs package. There really isn't
>> much to go on here. If I assume this is a coordinate that one
>> might think is within an exon, and if I further assume you are
>> working with H. sapiens, I could suggest something like
>>
>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>> ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
>>
>> x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>>
>> ex[ex %in% x]
>>
>> or maybe more appropriately
>>
>> names(ex)[ex %in% x]
>>
>> which will give you the Gene ID, and you can go from there using
>> the org.Hs.eg.db package.
>>
>> If however, your coordinate isn't in an exon, but might be in a
>> UTR, you can look at ?exonsBy to see what other sequences you can
>> extract to compare with.
>>
>> If these are SNPs, then you can look at the help pages for the
>> relevant SNPlocs package.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>> which can also be written this way by the program that yielded
>> the result:
>> chr17.31246506
>>
>> And I need to convert this data into a gene name and known
>> gene Ids, such
>> as:
>>
>> Gene name Entrez_ID Ensembl_ID
>>
>> Tff3 NM_011575 20050
>> Can you please advice me about a module able to perform this
>> ID conversion
>> using a list of "chr17.31246506" type coordinates as input?
>>
>> Thanks in advance
>>
>> With best wishes
>>
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> -- James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>>
>>
>>
>> --
>> --
>> Dr. José Luis Lavín Trueba
>>
>> Dpto. de Producción Agraria
>> Grupo de Genética y Microbiología
>> Universidad Pública de Navarra
>> 31006 Pamplona
>> Navarra
>> SPAIN
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list