[BioC] Possible bug in ShortRead pileup function

Wed Mar 25 14:43:09 CET 2009

Martin Morgan <mtmorgan at fhcrc.org> writes:

> David Rossell <david.rossell at irbbarcelona.org> writes:
>
>> It turns out the error is due to the width method for "AlignedRead" objects.
>> "width" works fine on DNAStringSet objects but not on "AlignedRead" objects,
>> as it returns NA after the first few reads (see below)
>>
>>> width(aln1)[1:40]
>>  [1] 25 22 20 19 18 21 17 16 12 23 14 13 40 39 34 38 37 35 15 26 33 29 28 27
>> 24
>> [26] 30 32 31 36 NA NA NA NA NA NA NA NA NA NA NA
>
> width(aln) is meant to return the unique widths, and you're indexing
> past the end of those. The documentation is ambiguous about what
> is really being returned.

For the record, the behavior of width() has been changed in the
development (soon to be release!) version of the ShortRead package to
return a vector of widths, with the length of the vector equal to the
length of the object.

Thanks for bringing this inconsistency to the fore.

Martin

>>> width(aln1 at sread)[1:40]
>>  [1] 25 22 20 20 19 20 19 18 19 21 21 19 19 19 19 19 19 20 20 20 21 21 20 21
>> 19
>> [26] 21 19 20 20 21 17 16 19 19 19 20 12 12 12 12
>
> use the accessor width(sread(aln1))
>
> Martin
>
>> David
>>
>>
>> On Fri, Mar 20, 2009 at 11:16 AM, David Rossell <
>> david.rossell at irbbarcelona.org> wrote:
>>
>>> Hi, the pileup function seems to stop doing the pileup after the first few
>>> sequences. For instance, if I test the function with my first 10 sequences
>>> it works fine:
>>>
>>> > start <- position(aln1)[sel][1:10]; fraglength <- width(aln1)[sel][1:10];
>>> dir1 <- str1[1:10]
>>> > puprova <-
>>> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>>> > sum(puprova)
>>> [1] 193
>>>
>>> I manually checked that the result is correct. Then I repeat the process
>>> with the first 50 sequences
>>>
>>> > sel <- chromosome(aln1)=="chr2L"
>>> > str1 <- factor(strand(aln1)[sel],levels=c('-','+'),labels=c('-','+'))
>>> > start <- position(aln1)[sel][1:50]; fraglength <- width(aln1)[sel][1:50];
>>> dir1 <- str1[1:50]
>>> > puprova <-
>>> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>>> > sum(puprova)
>>> [1] 754
>>>
>>> So far so good, but now if I repeat the process with the first 100
>>> sequences the pileup has not incorporated any new information.
>>>
>>> > start <- position(aln1)[sel][1:100]; fraglength <-
>>> width(aln1)[sel][1:100]; dir1 <- str1[1:100]
>>> > puprova <-
>>> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>>> > sum(puprova)
>>> [1] 754
>>>
>>> Same if I use the full chromosome chr2L
>>>
>>> > start <- position(aln1)[sel]; fraglength <- width(aln1)[sel]; dir1 <-
>>> str1
>>> > puprova <-
>>> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>>> > sum(puprova)
>>> [1] 754
>>>
>>> As you can see below there are many more sequences in chromosome chr2L
>>>
>>> > table(chromosome(aln1))
>>>
>>>     chr2L  chr2LHet     chr2R  chr2RHet     chr3L  chr3LHet     chr3R
>>> chr3RHet
>>>    903343       526    978985      3898    200033      8135    639261
>>> 6096
>>>      chr4      chrM      chrU chrUextra      chrX   chrXHet   chrYHet
>>>      6246      3241     11115     15711    106300      1028       111
>>>
>>> There's also a minor bug that returns an error if the argument 'dir' to
>>> 'pileup' is a factor with levels(dir)=c('+','-'), the routine requires the
>>> levels to be ('-','+') in this exact order. Simple fixing the levels(dir)==
>>> c('-','+') for an levels(dir) %in% c('-','+') should fix the issue. Also,
>>> notice that readAligned typically returns a factor with levels ('-','+','*')
>>> and this causes an error in pileup.
>>>
>>> Thank you, take care,
>>>
>>> David
>>>
>>>
>>>

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793