[Bioc-devel] Mistake in extractTranscriptSeqs() {from: GenomicFeatures} when transcripts are on minus strand

Tue Aug 26 21:13:36 CEST 2014

Hi Kristoffer,

On 08/26/2014 07:40 AM, Kristoffer Vitting-Seerup wrote:
> The problem is best illustrated with an example (with comments in bold):
>
> # Load libraries:
> library('GenomicFeatures')
> library("BSgenome.Hsapiens.UCSC.hg19�)
>
> # Create two test transcripts, one on each strand
> testTranscriptPlus <- GRanges(seqnames='chr1', ranges=IRanges(start=1e5+c(1,11), end=1e5+c(3,13)), strand='+')
> testTranscriptMinus <- GRanges(seqnames='chr1', ranges=IRanges(start=1e5+c(1,11), end=1e5+c(3,13)), strand='-�)
>
> # Use getSeq() to extract sequence
>
> getSeq(Hsapiens, testTranscriptPlus)
>    A DNAStringSet instance of length 2
>      width seq
> [1]     3 ACT
> [2]     3 CAG
>
>> testTranscriptMinus <- GRanges(seqnames='chr1', ranges=IRanges(start=1e5+c(1,11), end=1e5+c(3,13)), strand='-')
>> getSeq(Hsapiens, testTranscriptMinus)
>    A DNAStringSet instance of length 2
>      width seq
> [1]     3 AGT
> [2]     3 CTG
>
> # By comparing the plus and minus transcripts it can be seen that getSeq returns the sequences in the 5� to 3� orientation no matter the strand (very nice feature by the way)
> # Furthermore we would expect the transcript sequence of the testTranscriptMinus to be: CTGAGT (since we are on the minus strand).

Nope. You should expect the transcript sequence to be AGTCTG. Because
the convention we use is that exons are ranked according to their
position (i.e. index) in the GRanges object, and this *independently*
of the strand. So the 1st element in the GRanges object is the 1st
exon, the 2nd element is the 2nd exon, etc...

>
> # Now lets extract the sequences of the minus strand transcript using the extractTranscriptSeqs() function
> extractTranscriptSeqs( Hsapiens, split( testTranscriptMinus, f=c('test','test')))
>    A DNAStringSet instance of length 1
>      width seq                                                                                                                             names
> [1]     6 AGTCTG                                                                                                                          test
>
> # From which it can be observed that extractTranscriptSeqs() have concatenated the exons without taking the order into account (which works fine on plus strand but results in non-exisiting transcript from the minus strand).

Yes, exon rank was taken into account. See above.

Note that extractTranscriptSeqs() is more often used on a GRangesList
object that represents a set of transcripts:

   library(TxDb.Hsapiens.UCSC.hg19.knownGene)
   txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
   transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)

'transcripts' is a named GRangesList object with one list element per
transcript. The names on 'transcripts' are the UCSC transcript ids.
Each list element in 'transcripts' is a small GRanges object that lists
the exons for a particular transcript. Let's look at 3 transcripts, 2
on the + strand, and 1 on the - strand:

   > transcripts[4072:4074]
   GRangesList of length 3:
   $uc009xhd.3
   GRanges with 4 ranges and 3 metadata columns:
         seqnames                 ranges strand |   exon_id   exon_name 
exon_rank
            <Rle>              <IRanges>  <Rle> | <integer> <character> 
<integer>
     [1]     chr1 [249200442, 249200541]      + |     13958        <NA> 
         1
     [2]     chr1 [249208015, 249208078]      + |     13959        <NA> 
         2
     [3]     chr1 [249208638, 249208758]      + |     13960        <NA> 
         3
     [4]     chr1 [249210801, 249213345]      + |     13961        <NA> 
         4

   $uc021pmh.1
   GRanges with 1 range and 3 metadata columns:
         seqnames                 ranges strand | exon_id exon_name 
exon_rank
     [1]     chr1 [249211537, 249212562]      + |   13963      <NA> 
     1

   $uc009vis.3
   GRanges with 4 ranges and 3 metadata columns:
         seqnames         ranges strand | exon_id exon_name exon_rank
     [1]     chr1 [16607, 16765]      - |   13970      <NA>         1
     [2]     chr1 [15796, 15942]      - |   13968      <NA>         2
     [3]     chr1 [14970, 15038]      - |   13966      <NA>         3
     [4]     chr1 [14362, 14829]      - |   13964      <NA>         4

   ---
   seqlengths:
                     chr1                  chr2 ...        chrUn_gl000249
                249250621             243199373 ...                 38502

As you can see, exons are always listed in the order corresponding to
their rank, independently of the strand. This means that, for a normal
transcript on the minus strand (like transcript uc009vis.3), exons
will usually be sorted by descending coordinates. But for some exotic
transcripts (like the very unrealistic one you crafted in
'testTranscriptMinus') this might not be the case.

Using getSeq() on transcript uc009vis.3:

   library(BSgenome.Hsapiens.UCSC.hg19)
   genome <- BSgenome.Hsapiens.UCSC.hg19
   exon_seqs <- getSeq(genome, transcripts$uc009vis.3)

Then:

   > exon_seqs
     A DNAStringSet instance of length 4
       width seq
   [1]   159 
ACCTACAAGATGGGGTACTAACACCACCCCCACC...AGGACGACAGCAGCAGCAGCGCGTCTCCTTCAG
   [2]   147 
GGAGCTCCCAGGGAAGTGGTTGACCCCTCCGGTG...TGGAGAAGCAGCAGCAGAAGGAGCAGGAGCAAG
   [3]    69 
TGAGAGCCACGAGCCAAGGTGGGCACTTGATGTCGGATCTCTTCAACAAGCTGGTCATGAGGCGCAAGG
   [4]   468 
GCATCTCTGGGAAAGGACCTGGGGCTGGTGAGGG...TGCTTTTAATAAAGGATCTCTAGCTGTGCAGGA

Now with extractTranscriptSeqs():

   > tx_seq <- extractTranscriptSeqs(genome, transcripts["uc009vis.3"])
   > tx_seq
     A DNAStringSet instance of length 1
       width seq                                               names 

   [1]   843 ACCTACAAGATGGGGTACTAACA...AAAGGATCTCTAGCTGTGCAGGA uc009vis.3

'tx_seq' contains the 4 sequences returned by getSeq() and concatenated:

   > tx_seq == unlist(exon_seqs)
   [1] TRUE

Using UCSC online query tools will give you the same sequence:

   http://genome.ucsc.edu/cgi-bin/hgGene?db=hg19&hgg_gene=uc009vis.3

(Click on Genomic Sequence, then check 5' UTR Exons, CDS Exons, and
3' UTR Exons. No introns, no promoter/upstream, no downstream
sequences.)

Hope this helps clarifying things.

H.

>
>> sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] grDevices datasets  grid      parallel  stats     graphics  utils     methods   base
>
> other attached packages:
>   [1] GenomicFeatures_1.17.14            AnnotationDbi_1.27.9               Biobase_2.25.0                     BSgenome.Hsapiens.UCSC.hg19_1.3.99
>   [5] BSgenome_1.33.8                    Biostrings_2.33.13                 XVector_0.5.7                      spliceR_1.7.1
>   [9] plyr_1.8.1                         RColorBrewer_1.0-5                 VennDiagram_1.6.7                  cummeRbund_2.7.2
> [13] Gviz_1.9.11                        rtracklayer_1.25.13                GenomicRanges_1.17.28              GenomeInfoDb_1.1.18
> [17] IRanges_1.99.24                    S4Vectors_0.1.2                    fastcluster_1.1.13                 reshape2_1.4
> [21] ggplot2_1.0.0                      RSQLite_0.11.4                     DBI_0.2-7                          BiocGenerics_0.11.4
>
>
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319