[BioC] Mapping genomic coordinates to transcript coordinates? (revived)
Pages, Herve
hpages at fhcrc.org
Thu Mar 17 02:13:21 CET 2011
Hi Michael,
The naming of the function and its arguments makes it sound restricted to
transcripts and exons, but, currently, the function can use any set of
compound ranges. For example, if those ranges are in a GRangesList
object 'grl', then use 'start(grl)' for the exons starts and 'end(grl)'
for the exon ends. One restriction is that only one strand value can
be assigned for any given group of ranges (i.e. any top-level element in
the GRangesList). Assuming 'grl' satisfies this (and I guess that would
be the case for spliced reads), then you can extract the value to pass
to the strand argument with something like
sapply(grl, function(gr) strand(gr)[1L])
(there are more efficient ways to do this)
As I said previously transcriptLocs2refLocs() is low-level and would need
to be improved or wrapped into some higher-level tool (generic function?)
that would work on GRangesList, GappedAlignments, and maybe other high-level
representations of sets of compound ranges.
Cheers,
H.
----- Original Message -----
From: "Michael Lawrence" <lawrence.michael at gene.com>
To: "Herve Pages" <hpages at fhcrc.org>
Cc: "Chris Fields" <cjfields at illinois.edu>, bioconductor at r-project.org
Sent: Wednesday, March 16, 2011 4:16:56 PM
Subject: Re: [BioC] Mapping genomic coordinates to transcript coordinates? (revived)
On Wed, Mar 2, 2011 at 11:58 PM, Pages, Herve < hpages at fhcrc.org > wrote:
Hi Chris, Malcolm,
There is the transcriptLocs2refLocs() function in Biostrings that
does the reverse mapping i.e. it maps transcript coordinates to
genomic coordinates. There is no doubt that the GenomicFeatures
package would be a better place for this function so we should move
it there.
Here is how it works:
Let's say we have 3 transcripts, 2 on the + strand and 1 on the -
strand, with the following exons (1-based genomic coordinates):
exonStarts <- list(
c(2401L, 3922L, 4806L),
c(2401L, 4806L),
c(6291L, 5278L)
)
exonEnds <- list(
c(3344L, 4421L, 5200L),
c(3344L, 5200L),
c(6500L, 5899L)
)
strand <- c("+", "+", "-")
The lengths of the transcripts are:
> sapply(1:3, function(i) sum(exonEnds[[i]] - exonStarts[[i]] + 1L))
[1] 1839 1339 832
Let's say we have positions on each transcript that we want to
map to the genome. Those positions are stored in a list of 3
integer vectors:
tlocs <- list(
c(1L, 200L, 1000L, 1839L),
940:950,
c(1L, 832L)
)
All those positions can be converted into genomic coordinates with
transcriptLocs2refLocs():
This is a cool function but can't it be generalized for any compound set of ranges, like any GRangesList? For example, one could want to translate read-relative positions in a spliced read alignment to global positions, or the other way around. Especially the other way around. Returning an IntegerList would be one way to handle multiple possibilities.
> library(Biostrings)
> transcriptLocs2refLocs(tlocs, exonStarts, exonEnds, strand)
[[1]]
[1] 2401 2600 3977 5200
[[2]]
[1] 3340 3341 3342 3343 3344 4806 4807 4808 4809 4810 4811
[[3]]
[1] 6500 5278
It's vectorized and fast (implemented in C).
Unfortunately we don't have a refLocs2transcriptLocs() function at
the moment for going the other way around but, yes, that's something
we should definitely have. When called on the previous result and with
the same 'exonStarts', 'exonEnds' and 'strand' values, it should return
the original 'tlocs'.
There would be 2 complications for such a refLocs2transcriptLocs though:
1. If the genomic location doesn't hit the transcript. Not a big deal,
NA could be used for this.
2. Sometimes (very rarely) the genomic location hits an ambiguous
location on the transcript (e.g. for a small number of transcripts
in UCSC knownGene track, some exons overlap). What to do then?
Also those 2 functions should really be in GenomicFeatures, not
in Biostrings, and their interface should be modernized to accept
a GRangesList object instead of exonStarts, exonEnds and strand
(the transcriptLocs2refLocs() function predates the GenomicRanges
era).
Here in Seattle we didn't work on this yet because of lack of time
and also because there was apparently no demand for it so far. For
now, I'm just going to move transcriptLocs2refLocs() to GenomicFeatures
so it's more visible and it will also make it easier for someone
interested to contribute.
H.
----- Original Message -----
From: "Chris Fields" < cjfields at illinois.edu >
To: "Malcolm Cook" < MEC at stowers.org >
Cc: " lawrence.michael at gene.com " < lawrence.michael at gene.com >, " amackey at virginia.edu " < amackey at virginia.edu >, " bioconductor at r-project.org " < bioconductor at r-project.org >
Sent: Friday, February 25, 2011 7:45:11 AM
Subject: Re: [BioC] Mapping genomic coordinates to transcript coordinates? (revived)
+1 (from the bioperl end). Would be nice to have a straightforward BioC-based way of doing this. Aaron's suggestion of a similar (maybe simpler?) API to Bio::Coordinate::GeneMapper is good (though as pointed out in the thread it is a bit complex).
chris
On Feb 25, 2011, at 9:25 AM, Cook, Malcolm wrote:
> Hiya,
>
> I am reviving this thread
>
> https://stat.ethz.ch/pipermail/bioconductor/2010-November/036661.html
>
> and am poised to follow the lead proposed by Marc Carlson
>
> but before I do..... has anyone beaten me to it?
>
> And/or does anyone have a suggestion as to how to best contribute such an effort back ....
>
> ... as an addition to GenomicFeatures perhaps
>
> Best,
>
> Malcolm Cook
> Stowers Institute for Medical Research - Bioinformatics
> Kansas City, Missouri USA
>
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list