[Bioc-sig-seq] strandedness bug in Rsamtools?

Martin Morgan mtmorgan at fhcrc.org
Sat Feb 13 01:43:18 CET 2010


On 02/06/2010 10:57 AM, Martin Morgan wrote:
> ok, I think it's this
> 
> ------------------------------------------------------------------------
> r44085 | p.aboyoun | 2010-01-19 15:16:56 -0800 (Tue, 19 Jan 2010) | 2 lines
> Changed paths:
>    M /trunk/madman/Rpacks/BSgenome/R/strand.R
> 
> Added support for Rle strand column in strand,DataTable method.
> Also made "+" the first level of strand factor.
> ------------------------------------------------------------------------
> 
> which changed the order of strand levels. This will take a bit of
> coordination on our part. Thanks for the reports.

Steve, Vince, and others --

Thanks for your patience. The strand issue has been resolved in favor of
the optimists --

> levels(strand())
[1] "+" "-" "*"

In addition, the default way in which reads and quality strings are
returned by scanBam have been CHANGED -- reads and quality strings are
returned as represented in the BAM file. This mean (supposing that the
reads have entered the BAM file correctly) that all reads are presented
as they occur when mapped to the 5' strand. This is different from
ShortRead, where reads are represented in their native 5'->3'
orientation; the file 'scripts/.readAligned_bam.R', which reads BAM
files into ShortRead AlignedRead objects, has been updated to input
reads in a manner consistent with ShortRead.

This is with Rsamtools v. 0.1.34

Martin

> 
> Martin
> 
> 
> On 02/06/2010 10:23 AM, Vincent Carey wrote:
>> using a bam file in leeBamSet/inst/bam (in svn experimentData, not yet released)
>>
>>> wt5 = scanBam("isowt5_13e.bam")
>>> p1 = wt5[[1]]$pos[1]
>>> p1
>> [1] 799975
>>> str1 = wt5[[1]]$strand[1]
>>> str1
>> [1] +
>> Levels: + - *
>>> rd1 = wt5[[1]]$seq[1]
>>> rd1
>>   A DNAStringSet instance of length 1
>>     width seq
>> [1]    36 GGAGGATAGAGTGGTAACTAAAAAGACTTGGATAAC
>>> rd1 = wt5[[1]]$seq[[1]]
>>> rd1
>>   36-letter "DNAString" instance
>> seq: GGAGGATAGAGTGGTAACTAAAAAGACTTGGATAAC
>>> library(BSgenome.Scerevisiae.UCSC.sacCer2)
>>> sc13 = Scerevisiae$chrXIII
>>> matchPattern(rd1, sc13)
>>   Views on a 924429-letter DNAString subject
>> subject: CCACACACACACCACACCCACACCACACCCACAC...GTGGGTGTGGTGTGGGTGTGGTGTGTGTGTGGGG
>> views: NONE
>>> matchPattern(reverseComplement(rd1), sc13)
>>   Views on a 924429-letter DNAString subject
>> subject: CCACACACACACCACACCCACACCACACCCACAC...GTGGGTGTGGTGTGGGTGTGGTGTGTGTGTGGGG
>> views:
>>      start    end width
>> [1] 799975 800010    36 [GTTATCCAAGTCTTTTTAGTTACCACTCTATCCTCC]
>>
>> So although the read is marked + with scanBam and has  flag 16
>>
>> bash-3.2$ samtools view isowt5*bam | head -2
>> GA-EAS46_1_209DH:5:14:206:926   16      Scchr13 799975  25      36M
>>  *       0       0       GTTATCCAAGTCTTTTTAGTTACCACTCTATCCTCC
>> JONSHWfURPJUgXcadYKhfUhh^hhhhhhhh^hh    NM:i:0  X0:i:1  MD:Z:36
>>
>> i need to take reverseComplement to match to the reference
>>
>> the locale is C
>>
>>> sessionInfo()
>> R version 2.11.0 Under development (unstable) (2010-01-07 r50940)
>> i386-apple-darwin9.8.0
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices datasets  tools     utils     methods
>> [8] base
>>
>> other attached packages:
>> [1] BSgenome.Scerevisiae.UCSC.sacCer2_1.3.16
>> [2] Rsamtools_0.1.29
>> [3] BSgenome_1.15.4
>> [4] Biostrings_2.15.18
>> [5] IRanges_1.5.39
>> [6] weaver_1.13.0
>> [7] codetools_0.2-2
>> [8] digest_0.4.1
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.7.0
>>
>>
>>
>> On Sat, Feb 6, 2010 at 11:40 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>> On 02/06/2010 07:29 AM, Steve Lianoglou wrote:
>>>> Hi,
>>>>
>>>> On Sat, Feb 6, 2010 at 4:17 AM, Steve Lianoglou
>>>> <mailinglist.honeypot at gmail.com> wrote:
>>>>
>>>>> I think there might be a bug in Rsamtools. I have reason to believe
>>>>> that the strand information that's being returned from
>>>>> scanBam(...)$strand is flipped, ie. a read that should be on the '+'
>>>>> strand is being reported as aligning to the '-' strand.
>>>>
>>>> I just wanted to include an example of what I'm seeing (sorry I didn't
>>>> before, the need for shut-eye was competing with the desire to write a
>>>> better message :-)
>>>>
>>>> Anyway, here's what I get from my bamfile:
>>>>
>>>> $ samtools view my-sorted.bam chr2:2355-2360
>>>> 1805160 16      chr2    2355    255     20M     *       0       0
>>>>  GTCTTACTTTGATTATGATC    IIIIIIIIIIIIIIIIIIII    XA:i:0  MD:Z:20
>>>> NM:i:0
>>>> 4681745 16      chr2    2355    255     20M     *       0       0
>>>>  GTCTTACTTTGATTATGATC    IIIIIIIIIIIIIIIIIIII    XA:i:0  MD:Z:20
>>>> NM:i:0
>>>>
>>>> I believe a value of 16 in the FLAG field indicates the negative strand ...
>>>>
>>>> Now in R:
>>>>
>>>> R> bf <- 'my-sorted.bam
>>>> R> r <- scanBam(bf,
>>>> param=ScanBamParam(which=RangesList(chr2=IRanges(start=2355,
>>>> end=2360))))[[1]]
>>>> R> r$pos
>>>> [1] 2355 2355
>>>> R> r$strand
>>>> [1] + +
>>>> Levels: + - *
>>>
>>> I notice that your levels are different from mine; what does
>>> levels(strand()) say? what's your locale? (sesssionInfo() ;) --
>>> Rsamtools and strand() in general was trying to be locale-independent,
>>> but this seems not to be the case...
>>>
>>> The workaround is likely to work in a "C" locale, ?Sys.setlocale
>>>
>>> Martin
>>>
>>>> If I'm still thinking about this straight, and it's a bug, maybe for
>>>> now I'll just swap the levels of the r$strand to be c('-', '+', '*')
>>>> when I get my reads back from scanBam.
>>>>
>>>> Thanks,
>>>> -steve
>>>>
>>>
>>>
>>> --
>>> Martin Morgan
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
> 
> 


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list