[Bioc-devel] Mistake in extractTranscriptSeqs() {from: GenomicFeatures} when transcripts are on minus strand
Hervé Pagès
hpages at fhcrc.org
Tue Aug 26 21:13:36 CEST 2014
Hi Kristoffer,
On 08/26/2014 07:40 AM, Kristoffer Vitting-Seerup wrote:
> The problem is best illustrated with an example (with comments in bold):
>
> # Load libraries:
> library('GenomicFeatures')
> library("BSgenome.Hsapiens.UCSC.hg19�)
>
> # Create two test transcripts, one on each strand
> testTranscriptPlus <- GRanges(seqnames='chr1', ranges=IRanges(start=1e5+c(1,11), end=1e5+c(3,13)), strand='+')
> testTranscriptMinus <- GRanges(seqnames='chr1', ranges=IRanges(start=1e5+c(1,11), end=1e5+c(3,13)), strand='-�)
>
> # Use getSeq() to extract sequence
>
> getSeq(Hsapiens, testTranscriptPlus)
> A DNAStringSet instance of length 2
> width seq
> [1] 3 ACT
> [2] 3 CAG
>
>> testTranscriptMinus <- GRanges(seqnames='chr1', ranges=IRanges(start=1e5+c(1,11), end=1e5+c(3,13)), strand='-')
>> getSeq(Hsapiens, testTranscriptMinus)
> A DNAStringSet instance of length 2
> width seq
> [1] 3 AGT
> [2] 3 CTG
>
> # By comparing the plus and minus transcripts it can be seen that getSeq returns the sequences in the 5� to 3� orientation no matter the strand (very nice feature by the way)
> # Furthermore we would expect the transcript sequence of the testTranscriptMinus to be: CTGAGT (since we are on the minus strand).
Nope. You should expect the transcript sequence to be AGTCTG. Because
the convention we use is that exons are ranked according to their
position (i.e. index) in the GRanges object, and this *independently*
of the strand. So the 1st element in the GRanges object is the 1st
exon, the 2nd element is the 2nd exon, etc...
>
> # Now lets extract the sequences of the minus strand transcript using the extractTranscriptSeqs() function
> extractTranscriptSeqs( Hsapiens, split( testTranscriptMinus, f=c('test','test')))
> A DNAStringSet instance of length 1
> width seq names
> [1] 6 AGTCTG test
>
> # From which it can be observed that extractTranscriptSeqs() have concatenated the exons without taking the order into account (which works fine on plus strand but results in non-exisiting transcript from the minus strand).
Yes, exon rank was taken into account. See above.
Note that extractTranscriptSeqs() is more often used on a GRangesList
object that represents a set of transcripts:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)
'transcripts' is a named GRangesList object with one list element per
transcript. The names on 'transcripts' are the UCSC transcript ids.
Each list element in 'transcripts' is a small GRanges object that lists
the exons for a particular transcript. Let's look at 3 transcripts, 2
on the + strand, and 1 on the - strand:
> transcripts[4072:4074]
GRangesList of length 3:
$uc009xhd.3
GRanges with 4 ranges and 3 metadata columns:
seqnames ranges strand | exon_id exon_name
exon_rank
<Rle> <IRanges> <Rle> | <integer> <character>
<integer>
[1] chr1 [249200442, 249200541] + | 13958 <NA>
1
[2] chr1 [249208015, 249208078] + | 13959 <NA>
2
[3] chr1 [249208638, 249208758] + | 13960 <NA>
3
[4] chr1 [249210801, 249213345] + | 13961 <NA>
4
$uc021pmh.1
GRanges with 1 range and 3 metadata columns:
seqnames ranges strand | exon_id exon_name
exon_rank
[1] chr1 [249211537, 249212562] + | 13963 <NA>
1
$uc009vis.3
GRanges with 4 ranges and 3 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank
[1] chr1 [16607, 16765] - | 13970 <NA> 1
[2] chr1 [15796, 15942] - | 13968 <NA> 2
[3] chr1 [14970, 15038] - | 13966 <NA> 3
[4] chr1 [14362, 14829] - | 13964 <NA> 4
---
seqlengths:
chr1 chr2 ... chrUn_gl000249
249250621 243199373 ... 38502
As you can see, exons are always listed in the order corresponding to
their rank, independently of the strand. This means that, for a normal
transcript on the minus strand (like transcript uc009vis.3), exons
will usually be sorted by descending coordinates. But for some exotic
transcripts (like the very unrealistic one you crafted in
'testTranscriptMinus') this might not be the case.
Using getSeq() on transcript uc009vis.3:
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
exon_seqs <- getSeq(genome, transcripts$uc009vis.3)
Then:
> exon_seqs
A DNAStringSet instance of length 4
width seq
[1] 159
ACCTACAAGATGGGGTACTAACACCACCCCCACC...AGGACGACAGCAGCAGCAGCGCGTCTCCTTCAG
[2] 147
GGAGCTCCCAGGGAAGTGGTTGACCCCTCCGGTG...TGGAGAAGCAGCAGCAGAAGGAGCAGGAGCAAG
[3] 69
TGAGAGCCACGAGCCAAGGTGGGCACTTGATGTCGGATCTCTTCAACAAGCTGGTCATGAGGCGCAAGG
[4] 468
GCATCTCTGGGAAAGGACCTGGGGCTGGTGAGGG...TGCTTTTAATAAAGGATCTCTAGCTGTGCAGGA
Now with extractTranscriptSeqs():
> tx_seq <- extractTranscriptSeqs(genome, transcripts["uc009vis.3"])
> tx_seq
A DNAStringSet instance of length 1
width seq names
[1] 843 ACCTACAAGATGGGGTACTAACA...AAAGGATCTCTAGCTGTGCAGGA uc009vis.3
'tx_seq' contains the 4 sequences returned by getSeq() and concatenated:
> tx_seq == unlist(exon_seqs)
[1] TRUE
Using UCSC online query tools will give you the same sequence:
http://genome.ucsc.edu/cgi-bin/hgGene?db=hg19&hgg_gene=uc009vis.3
(Click on Genomic Sequence, then check 5' UTR Exons, CDS Exons, and
3' UTR Exons. No introns, no promoter/upstream, no downstream
sequences.)
Hope this helps clarifying things.
H.
>
>> sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] grDevices datasets grid parallel stats graphics utils methods base
>
> other attached packages:
> [1] GenomicFeatures_1.17.14 AnnotationDbi_1.27.9 Biobase_2.25.0 BSgenome.Hsapiens.UCSC.hg19_1.3.99
> [5] BSgenome_1.33.8 Biostrings_2.33.13 XVector_0.5.7 spliceR_1.7.1
> [9] plyr_1.8.1 RColorBrewer_1.0-5 VennDiagram_1.6.7 cummeRbund_2.7.2
> [13] Gviz_1.9.11 rtracklayer_1.25.13 GenomicRanges_1.17.28 GenomeInfoDb_1.1.18
> [17] IRanges_1.99.24 S4Vectors_0.1.2 fastcluster_1.1.13 reshape2_1.4
> [21] ggplot2_1.0.0 RSQLite_0.11.4 DBI_0.2-7 BiocGenerics_0.11.4
>
>
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list