[BioC] Possible bug in ShortRead pileup function

Fri Mar 20 14:29:16 CET 2009

David Rossell <david.rossell at irbbarcelona.org> writes:

> It turns out the error is due to the width method for "AlignedRead" objects.
> "width" works fine on DNAStringSet objects but not on "AlignedRead" objects,
> as it returns NA after the first few reads (see below)
>
>> width(aln1)[1:40]
>  [1] 25 22 20 19 18 21 17 16 12 23 14 13 40 39 34 38 37 35 15 26 33 29 28 27
> 24
> [26] 30 32 31 36 NA NA NA NA NA NA NA NA NA NA NA

width(aln) is meant to return the unique widths, and you're indexing
past the end of those. The documentation is ambiguous about what
is really being returned.

>> width(aln1 at sread)[1:40]
>  [1] 25 22 20 20 19 20 19 18 19 21 21 19 19 19 19 19 19 20 20 20 21 21 20 21
> 19
> [26] 21 19 20 20 21 17 16 19 19 19 20 12 12 12 12

use the accessor width(sread(aln1))

Martin

> David
>
>
> On Fri, Mar 20, 2009 at 11:16 AM, David Rossell <
> david.rossell at irbbarcelona.org> wrote:
>
>> Hi, the pileup function seems to stop doing the pileup after the first few
>> sequences. For instance, if I test the function with my first 10 sequences
>> it works fine:
>>
>> > start <- position(aln1)[sel][1:10]; fraglength <- width(aln1)[sel][1:10];
>> dir1 <- str1[1:10]
>> > puprova <-
>> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>> > sum(puprova)
>> [1] 193
>>
>> I manually checked that the result is correct. Then I repeat the process
>> with the first 50 sequences
>>
>> > sel <- chromosome(aln1)=="chr2L"
>> > str1 <- factor(strand(aln1)[sel],levels=c('-','+'),labels=c('-','+'))
>> > start <- position(aln1)[sel][1:50]; fraglength <- width(aln1)[sel][1:50];
>> dir1 <- str1[1:50]
>> > puprova <-
>> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>> > sum(puprova)
>> [1] 754
>>
>> So far so good, but now if I repeat the process with the first 100
>> sequences the pileup has not incorporated any new information.
>>
>> > start <- position(aln1)[sel][1:100]; fraglength <-
>> width(aln1)[sel][1:100]; dir1 <- str1[1:100]
>> > puprova <-
>> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>> > sum(puprova)
>> [1] 754
>>
>> Same if I use the full chromosome chr2L
>>
>> > start <- position(aln1)[sel]; fraglength <- width(aln1)[sel]; dir1 <-
>> str1
>> > puprova <-
>> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>> > sum(puprova)
>> [1] 754
>>
>> As you can see below there are many more sequences in chromosome chr2L
>>
>> > table(chromosome(aln1))
>>
>>     chr2L  chr2LHet     chr2R  chr2RHet     chr3L  chr3LHet     chr3R
>> chr3RHet
>>    903343       526    978985      3898    200033      8135    639261
>> 6096
>>      chr4      chrM      chrU chrUextra      chrX   chrXHet   chrYHet
>>      6246      3241     11115     15711    106300      1028       111
>>
>> There's also a minor bug that returns an error if the argument 'dir' to
>> 'pileup' is a factor with levels(dir)=c('+','-'), the routine requires the
>> levels to be ('-','+') in this exact order. Simple fixing the levels(dir)==
>> c('-','+') for an levels(dir) %in% c('-','+') should fix the issue. Also,
>> notice that readAligned typically returns a factor with levels ('-','+','*')
>> and this causes an error in pileup.
>>
>> Thank you, take care,
>>
>> David
>>
>>
>>

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793