[Bioc-sig-seq] strandedness bug in Rsamtools?

Vincent Carey stvjc at channing.harvard.edu
Sat Feb 6 19:58:14 CET 2010


recalling the doc in scanBam man page that

     seq: The query sequence, in the 5' to 3' orientation. If aligned
          to the minus strand, it is the reverse complement of the
          original sequence.

the 36mer given in the SAM file matches the reference sequence
directly, and the flag is 16.   i would assume that SAM is telling us
through the flag that the hit was to the minus strand, but is giving
us the reverse complement of the original query (?); scanBam is
telling us that it is a plus-strand hit (bug?) but giving reverse
complement of SAM's sequence (and thus the original query) in the
read.

"by indirections find directions out" -- lord polonius



On Sat, Feb 6, 2010 at 1:23 PM, Vincent Carey
<stvjc at channing.harvard.edu> wrote:
> using a bam file in leeBamSet/inst/bam (in svn experimentData, not yet released)
>
>> wt5 = scanBam("isowt5_13e.bam")
>> p1 = wt5[[1]]$pos[1]
>> p1
> [1] 799975
>> str1 = wt5[[1]]$strand[1]
>> str1
> [1] +
> Levels: + - *
>> rd1 = wt5[[1]]$seq[1]
>> rd1
>  A DNAStringSet instance of length 1
>    width seq
> [1]    36 GGAGGATAGAGTGGTAACTAAAAAGACTTGGATAAC
>> rd1 = wt5[[1]]$seq[[1]]
>> rd1
>  36-letter "DNAString" instance
> seq: GGAGGATAGAGTGGTAACTAAAAAGACTTGGATAAC
>> library(BSgenome.Scerevisiae.UCSC.sacCer2)
>> sc13 = Scerevisiae$chrXIII
>> matchPattern(rd1, sc13)
>  Views on a 924429-letter DNAString subject
> subject: CCACACACACACCACACCCACACCACACCCACAC...GTGGGTGTGGTGTGGGTGTGGTGTGTGTGTGGGG
> views: NONE
>> matchPattern(reverseComplement(rd1), sc13)
>  Views on a 924429-letter DNAString subject
> subject: CCACACACACACCACACCCACACCACACCCACAC...GTGGGTGTGGTGTGGGTGTGGTGTGTGTGTGGGG
> views:
>     start    end width
> [1] 799975 800010    36 [GTTATCCAAGTCTTTTTAGTTACCACTCTATCCTCC]
>
> So although the read is marked + with scanBam and has  flag 16
>
> bash-3.2$ samtools view isowt5*bam | head -2
> GA-EAS46_1_209DH:5:14:206:926   16      Scchr13 799975  25      36M
>  *       0       0       GTTATCCAAGTCTTTTTAGTTACCACTCTATCCTCC
> JONSHWfURPJUgXcadYKhfUhh^hhhhhhhh^hh    NM:i:0  X0:i:1  MD:Z:36
>
> i need to take reverseComplement to match to the reference
>
> the locale is C
>
>> sessionInfo()
> R version 2.11.0 Under development (unstable) (2010-01-07 r50940)
> i386-apple-darwin9.8.0
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices datasets  tools     utils     methods
> [8] base
>
> other attached packages:
> [1] BSgenome.Scerevisiae.UCSC.sacCer2_1.3.16
> [2] Rsamtools_0.1.29
> [3] BSgenome_1.15.4
> [4] Biostrings_2.15.18
> [5] IRanges_1.5.39
> [6] weaver_1.13.0
> [7] codetools_0.2-2
> [8] digest_0.4.1
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.7.0
>
>
>
> On Sat, Feb 6, 2010 at 11:40 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 02/06/2010 07:29 AM, Steve Lianoglou wrote:
>>> Hi,
>>>
>>> On Sat, Feb 6, 2010 at 4:17 AM, Steve Lianoglou
>>> <mailinglist.honeypot at gmail.com> wrote:
>>>
>>>> I think there might be a bug in Rsamtools. I have reason to believe
>>>> that the strand information that's being returned from
>>>> scanBam(...)$strand is flipped, ie. a read that should be on the '+'
>>>> strand is being reported as aligning to the '-' strand.
>>>
>>> I just wanted to include an example of what I'm seeing (sorry I didn't
>>> before, the need for shut-eye was competing with the desire to write a
>>> better message :-)
>>>
>>> Anyway, here's what I get from my bamfile:
>>>
>>> $ samtools view my-sorted.bam chr2:2355-2360
>>> 1805160 16      chr2    2355    255     20M     *       0       0
>>>  GTCTTACTTTGATTATGATC    IIIIIIIIIIIIIIIIIIII    XA:i:0  MD:Z:20
>>> NM:i:0
>>> 4681745 16      chr2    2355    255     20M     *       0       0
>>>  GTCTTACTTTGATTATGATC    IIIIIIIIIIIIIIIIIIII    XA:i:0  MD:Z:20
>>> NM:i:0
>>>
>>> I believe a value of 16 in the FLAG field indicates the negative strand ...
>>>
>>> Now in R:
>>>
>>> R> bf <- 'my-sorted.bam
>>> R> r <- scanBam(bf,
>>> param=ScanBamParam(which=RangesList(chr2=IRanges(start=2355,
>>> end=2360))))[[1]]
>>> R> r$pos
>>> [1] 2355 2355
>>> R> r$strand
>>> [1] + +
>>> Levels: + - *
>>
>> I notice that your levels are different from mine; what does
>> levels(strand()) say? what's your locale? (sesssionInfo() ;) --
>> Rsamtools and strand() in general was trying to be locale-independent,
>> but this seems not to be the case...
>>
>> The workaround is likely to work in a "C" locale, ?Sys.setlocale
>>
>> Martin
>>
>>> If I'm still thinking about this straight, and it's a bug, maybe for
>>> now I'll just swap the levels of the r$strand to be c('-', '+', '*')
>>> when I get my reads back from scanBam.
>>>
>>> Thanks,
>>> -steve
>>>
>>
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>



More information about the Bioc-sig-sequencing mailing list