[BioC] FastqSampler not sampling correctly in ShortRead

Martin Morgan mtmorgan at fhcrc.org
Wed Apr 25 13:43:46 CEST 2012


Thanks Marcus, this serious bug was introduced in ShortRead 1.12.3 / 
1.13.9 on Dec 4/5 2011. It is fixed in ShortRead 1.4.3 / 1.5.4. Records 
2:n were actually records 2:n in the file.

FWIW in case there is confusion the intended way to draw independent 
samples is just

   f = open(FastqSample(fl, 50)
   s1 = yield(f); s2 = yield(f)
   close(f)

Martin

On 04/25/2012 12:50 AM, Marcus Davy wrote:
> Hi,
> there appears to be a problem in FastqSampler() which seems to sample the
> first read at random, but then the next n-1 reads
> are always the same, so the next sample obtained with yield() is not
> independent. Is this a bug, or my code implementation wrong?
>
> ## From example(FastqSampler)
> sp<- SolexaPath(system.file('extdata', package='ShortRead'))
> fl<- file.path(analysisPath(sp), "s_1_sequence.txt")
>
> f<- FastqFile(fl)
> rfq<- readFastq(f)
>
> set.seed(1)
> f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE)
> s1<- yield(f)    # sample of size n=50
> set.seed(2)
> f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE)
> s2<- yield(f)    # independent sample of size 50
>
>
> ## Intersection length between samples always 50-1
> length(intersect(id(s1), id(s2)))
>
> ## First read is different, the next 2:n reads are the same
> head(sread(s1))
> head(sread(s2))
> close(f)
>
>> head(sread(s1))
>    A DNAStringSet instance of length 6
>      width seq
> [1]    36 GTCTGGAAACGTACGGATTGTTCAGTAACTTTACTC # Different
> [2]    36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC #Same
> [3]    36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT # ...
> [4]    36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA # ...
> [5]    36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT # ...
> [6]    36 GTTCTCTAAAAACCATTTTTCGTCCCCTTCGGGGCG Same
>> head(sread(s2))
>    A DNAStringSet instance of length 6
>      width seq
> [1]    36 GTTTACGAATTAAATCGAAGTGGACTGCTGGGGGGA # Different
> [2]    36 GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC #Same
> [3]    36 GCGGTGGTCTATAGTGTTATTAATATCAATTTGGGT # ...
> [4]    36 GTTACCATGATGTTATTTCTTCATTTGGAGGTAAAA # ...
> [5]    36 GTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTT # ...
> [6]    36 GTTCTCTAAAAACCATTTTTCGTCCCCTTCGGGGCG #Same
>
> Marcus
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] ShortRead_1.14.2    latticeExtra_0.6-19 RColorBrewer_1.0-5
> [4] Rsamtools_1.8.4     lattice_0.20-6      Biostrings_2.24.1
> [7] GenomicRanges_1.8.3 IRanges_1.14.2      BiocGenerics_0.2.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0    hwriter_1.3
>   stats4_2.15.0
> [6] tools_2.15.0   zlibbioc_1.2.0
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioconductor mailing list