[BioC] Possible bug in ShortRead pileup function
Martin Morgan
mtmorgan at fhcrc.org
Fri Mar 20 14:11:37 CET 2009
Hi again --
David Rossell <david.rossell at irbbarcelona.org> writes:
> Hi, the pileup function seems to stop doing the pileup after the first few
> sequences. For instance, if I test the function with my first 10 sequences
> it works fine:
>
>> start <- position(aln1)[sel][1:10]; fraglength <- width(aln1)[sel][1:10];
> dir1 <- str1[1:10]
>> puprova <-
> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>> sum(puprova)
> [1] 193
>
> I manually checked that the result is correct. Then I repeat the process
> with the first 50 sequences
>
>> sel <- chromosome(aln1)=="chr2L"
>> str1 <- factor(strand(aln1)[sel],levels=c('-','+'),labels=c('-','+'))
>> start <- position(aln1)[sel][1:50]; fraglength <- width(aln1)[sel][1:50];
> dir1 <- str1[1:50]
>> puprova <-
> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>> sum(puprova)
> [1] 754
>
> So far so good, but now if I repeat the process with the first 100 sequences
> the pileup has not incorporated any new information.
>
>> start <- position(aln1)[sel][1:100]; fraglength <-
> width(aln1)[sel][1:100]; dir1 <- str1[1:100]
>> puprova <-
> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>> sum(puprova)
> [1] 754
>
> Same if I use the full chromosome chr2L
>
>> start <- position(aln1)[sel]; fraglength <- width(aln1)[sel]; dir1 <- str1
>> puprova <-
> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>> sum(puprova)
> [1] 754
If I run example(readAligned), and then do the following (to get some
data that is reproducible, though not meaningful)
> example(readAligned)
[snip]
> aln <- readAligned(ap, "s_2_export.txt", "SolexaExport")
> aln1 = aln[!is.na(position(aln))]
I end up with 406 reads, which pile up to
> pup = pileup(position(aln1), 35, max(position(aln1))+35,
+ dir=factor(strand(aln), levels=c("-", "+")))
and
> sum(pup) / width(aln2)
[1] 406
so something non-trivial is going on. Can you provide a reproducible
example (privately or otherwise) and your sessionInfo()? also...
> As you can see below there are many more sequences in chromosome chr2L
>
>> table(chromosome(aln1))
>
> chr2L chr2LHet chr2R chr2RHet chr3L chr3LHet chr3R
> chr3RHet
> 903343 526 978985 3898 200033 8135 639261
> 6096
> chr4 chrM chrU chrUextra chrX chrXHet chrYHet
> 6246 3241 11115 15711 106300 1028 111
>
> There's also a minor bug that returns an error if the argument 'dir' to
> 'pileup' is a factor with levels(dir)=c('+','-'), the routine requires the
> levels to be ('-','+') in this exact order. Simple fixing the levels(dir)==
> c('-','+') for an levels(dir) %in% c('-','+') should fix the issue. Also,
> notice that readAligned typically returns a factor with levels ('-','+','*')
> and this causes an error in pileup.
The code requires factors with levels in the specified order, but the
development (soon to be released) version has tidied this up. One
small facility is a method for strand() that takes a character vector
and creates a factor with the levels ordered consistently.
In general the code in ShortRead and supporting packages has changed
a lot, and using the 'devel' version of R and Bioconductor for these
applications is a good idea.
Here's my sessionInfo()
> sessionInfo()
R version 2.8.1 Patched (2009-02-25 r48007)
x86_64-unknown-linux-gnu
locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
attached base packages:
[1] tools stats graphics utils datasets grDevices methods
[8] base
other attached packages:
[1] ShortRead_1.0.7 lattice_0.17-15 Biobase_2.2.2 Biostrings_2.10.21
[5] IRanges_1.0.14
loaded via a namespace (and not attached):
[1] grid_2.8.1 Matrix_0.999375-22
Martin
> Thank you, take care,
>
> David
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (206) 667-2793
More information about the Bioconductor
mailing list