[BioC] rtracklayer and UCSC
Keith Satterley
keith at wehi.EDU.AU
Fri May 15 02:23:42 CEST 2009
My understanding of UCSC co-ordinates is, as Sean says, zero based and one
based. However I have stopped using the word "start" and "end" with UCSC
co-ordinates. I believe it would be better to use "left" and "right".
The UCSC data definitions of their annotation files, see:
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.sql
use txStart/txEnd, cdsStart/cdsEnd, exonStarts/exonEnds. However these
co-ordinates are only start and end co-ordinates for positive strand genes. They
are end and start co-ordinates for negative strand genes, assuming that start
means the 5 prime end of a gene.
I think it is more accurate to say that LEFT end UCSC co-ordinates are zero
based and RIGHT end UCSC co-ordinates are one based.
However note that whenever UCSC are displaying co-ordinates to GUI users, they
adjust left end co-ordinates back to being one based. If I remember correctly,
if you use the DNA option in the UCSC browser to get DNA bases, the co-ordinates
are all still one based, but as stated, if you download the annotation files,
such as refGene.txt, from the above link, the left co-ordinates are zero based.
I don't know how rtracklayer handles this issue.
cheers,
Keith
Sean Davis wrote:
> On Thu, May 14, 2009 at 7:29 PM, Kasper Daniel Hansen <
> khansen at stat.berkeley.edu> wrote:
>
>> As far as I know USCS uses zero-based indexing of their genomes, R uses
>> 1-based. What kind of conversion is being used by rtracklayer - I suspect
>> none at all? It might be worthwhile to add a discussion about this somewhere
>> in the vignette?
>
>
> It is even slightly more complicated than that. They use zero-based starts
> and 1-based ends, except for graphical display:
>
> http://genome.ucsc.edu/FAQ/FAQtracks#tracks1
>
> Sean
>
>
>>
>> More specifically, I have downloaded a couple of tables from UCSC using
>> rtracklayer and I wanted to know if I need to add 1 to the column named
>> exonStart (after a suitable splitting - it is a comma separated character
>> list).
>>
>> Kasper
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list