[Bioc-sig-seq] strandedness bug in Rsamtools?

Martin Morgan mtmorgan at fhcrc.org
Sat Feb 6 19:57:34 CET 2010


ok, I think it's this

------------------------------------------------------------------------
r44085 | p.aboyoun | 2010-01-19 15:16:56 -0800 (Tue, 19 Jan 2010) | 2 lines
Changed paths:
   M /trunk/madman/Rpacks/BSgenome/R/strand.R

Added support for Rle strand column in strand,DataTable method.
Also made "+" the first level of strand factor.
------------------------------------------------------------------------

which changed the order of strand levels. This will take a bit of
coordination on our part. Thanks for the reports.

Martin


On 02/06/2010 10:23 AM, Vincent Carey wrote:
> using a bam file in leeBamSet/inst/bam (in svn experimentData, not yet released)
> 
>> wt5 = scanBam("isowt5_13e.bam")
>> p1 = wt5[[1]]$pos[1]
>> p1
> [1] 799975
>> str1 = wt5[[1]]$strand[1]
>> str1
> [1] +
> Levels: + - *
>> rd1 = wt5[[1]]$seq[1]
>> rd1
>   A DNAStringSet instance of length 1
>     width seq
> [1]    36 GGAGGATAGAGTGGTAACTAAAAAGACTTGGATAAC
>> rd1 = wt5[[1]]$seq[[1]]
>> rd1
>   36-letter "DNAString" instance
> seq: GGAGGATAGAGTGGTAACTAAAAAGACTTGGATAAC
>> library(BSgenome.Scerevisiae.UCSC.sacCer2)
>> sc13 = Scerevisiae$chrXIII
>> matchPattern(rd1, sc13)
>   Views on a 924429-letter DNAString subject
> subject: CCACACACACACCACACCCACACCACACCCACAC...GTGGGTGTGGTGTGGGTGTGGTGTGTGTGTGGGG
> views: NONE
>> matchPattern(reverseComplement(rd1), sc13)
>   Views on a 924429-letter DNAString subject
> subject: CCACACACACACCACACCCACACCACACCCACAC...GTGGGTGTGGTGTGGGTGTGGTGTGTGTGTGGGG
> views:
>      start    end width
> [1] 799975 800010    36 [GTTATCCAAGTCTTTTTAGTTACCACTCTATCCTCC]
> 
> So although the read is marked + with scanBam and has  flag 16
> 
> bash-3.2$ samtools view isowt5*bam | head -2
> GA-EAS46_1_209DH:5:14:206:926   16      Scchr13 799975  25      36M
>  *       0       0       GTTATCCAAGTCTTTTTAGTTACCACTCTATCCTCC
> JONSHWfURPJUgXcadYKhfUhh^hhhhhhhh^hh    NM:i:0  X0:i:1  MD:Z:36
> 
> i need to take reverseComplement to match to the reference
> 
> the locale is C
> 
>> sessionInfo()
> R version 2.11.0 Under development (unstable) (2010-01-07 r50940)
> i386-apple-darwin9.8.0
> 
> locale:
> [1] C
> 
> attached base packages:
> [1] stats     graphics  grDevices datasets  tools     utils     methods
> [8] base
> 
> other attached packages:
> [1] BSgenome.Scerevisiae.UCSC.sacCer2_1.3.16
> [2] Rsamtools_0.1.29
> [3] BSgenome_1.15.4
> [4] Biostrings_2.15.18
> [5] IRanges_1.5.39
> [6] weaver_1.13.0
> [7] codetools_0.2-2
> [8] digest_0.4.1
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.7.0
> 
> 
> 
> On Sat, Feb 6, 2010 at 11:40 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 02/06/2010 07:29 AM, Steve Lianoglou wrote:
>>> Hi,
>>>
>>> On Sat, Feb 6, 2010 at 4:17 AM, Steve Lianoglou
>>> <mailinglist.honeypot at gmail.com> wrote:
>>>
>>>> I think there might be a bug in Rsamtools. I have reason to believe
>>>> that the strand information that's being returned from
>>>> scanBam(...)$strand is flipped, ie. a read that should be on the '+'
>>>> strand is being reported as aligning to the '-' strand.
>>>
>>> I just wanted to include an example of what I'm seeing (sorry I didn't
>>> before, the need for shut-eye was competing with the desire to write a
>>> better message :-)
>>>
>>> Anyway, here's what I get from my bamfile:
>>>
>>> $ samtools view my-sorted.bam chr2:2355-2360
>>> 1805160 16      chr2    2355    255     20M     *       0       0
>>>  GTCTTACTTTGATTATGATC    IIIIIIIIIIIIIIIIIIII    XA:i:0  MD:Z:20
>>> NM:i:0
>>> 4681745 16      chr2    2355    255     20M     *       0       0
>>>  GTCTTACTTTGATTATGATC    IIIIIIIIIIIIIIIIIIII    XA:i:0  MD:Z:20
>>> NM:i:0
>>>
>>> I believe a value of 16 in the FLAG field indicates the negative strand ...
>>>
>>> Now in R:
>>>
>>> R> bf <- 'my-sorted.bam
>>> R> r <- scanBam(bf,
>>> param=ScanBamParam(which=RangesList(chr2=IRanges(start=2355,
>>> end=2360))))[[1]]
>>> R> r$pos
>>> [1] 2355 2355
>>> R> r$strand
>>> [1] + +
>>> Levels: + - *
>>
>> I notice that your levels are different from mine; what does
>> levels(strand()) say? what's your locale? (sesssionInfo() ;) --
>> Rsamtools and strand() in general was trying to be locale-independent,
>> but this seems not to be the case...
>>
>> The workaround is likely to work in a "C" locale, ?Sys.setlocale
>>
>> Martin
>>
>>> If I'm still thinking about this straight, and it's a bug, maybe for
>>> now I'll just swap the levels of the r$strand to be c('-', '+', '*')
>>> when I get my reads back from scanBam.
>>>
>>> Thanks,
>>> -steve
>>>
>>
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list