[Bioc-devel] ShortRead::countLines integer overflow with large fastq files

Thomas Girke thomas.girke at ucr.edu
Wed Feb 21 16:47:46 CET 2018


Great thanks!

Thomas

On Wed, Feb 21, 2018 at 3:11 AM Martin Morgan <martin.morgan at roswellpark.org>
wrote:

> Thanks Thomas, countLines() in ShortRead 1.37.3 and later) will return
> numeric() rather than integer() and hence support large files.
>
> Martin
>
> On 02/20/2018 10:08 PM, Thomas Girke wrote:
> > Dear Martin,
> >
> > countLines in ShrotRead returns the line counts as integers which appears
> > to create problems with large FASTQ files (>536.8 Mio lines) due to R's
> > integer limit (2^31-1). When the integer limit is reached/exceeded it
> seems
> > that countLines returns negative values not reflecting the number of
> lines
> > in a file anymore. At least this is what I learned after several users
> > reported this problem and then running some tests myself on large FASTQ
> > files with variable line numbers around the integer limit. If my
> conclusion
> > is correct and there aren' t any strong reasons against it, would it be
> > possible to consider returning numeric values instead either by default
> or
> > conditionally (e.g. when the count is >= .Machine$integer.max) to lift
> this
> > limit. If this is not possible then returning NAs instead of negative
> > values would be a sensible compromise.
> >
> > Thanks,
> >
> > Thomas
> >
> >> sessionInfo()
> > R version 3.4.2 (2017-09-28)
> > Platform: x86_64-pc-linux-gnu (64-bit)
> > Running under: CentOS Linux 7 (Core)
> >
> > Matrix products: default
> > BLAS: /usr/lib64/libblas.so.3.4.2
> > LAPACK: /usr/lib64/liblapack.so.3.4.2
> >
> > locale:
> > [1] C
> >
> > attached base packages:
> > [1] stats4    parallel  stats     graphics  utils     datasets  grDevices
> > methods   base
> >
> > other attached packages:
> >   [1] ShortRead_1.36.0           GenomicAlignments_1.14.1
> >   SummarizedExperiment_1.8.0 DelayedArray_0.4.1
>  matrixStats_0.52.2
> >         Biobase_2.38.0             Rsamtools_1.30.0
> >   GenomicRanges_1.30.0       GenomeInfoDb_1.14.0        Biostrings_2.46.0
> >        XVector_0.18.0             IRanges_2.12.0
> >   S4Vectors_0.16.0
> > [14] BiocParallel_1.12.0        BiocGenerics_0.24.0        setwidth_1.0-4
> >             colorout_1.1-3
> >
> > loaded via a namespace (and not attached):
> >   [1] zlibbioc_1.24.0         lattice_0.20-35         hwriter_1.3.2
> >   tools_3.4.2             grid_3.4.2              latticeExtra_0.6-28
> >   Matrix_1.2-12           GenomeInfoDbData_0.99.1 RColorBrewer_1.1-2
> > bitops_1.0-6            RCurl_1.95-4.8          compiler_3.4.2
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>
>
> This email message may contain legally privileged and/or confidential
> information.  If you are not the intended recipient(s), or the employee or
> agent responsible for the delivery of this message to the intended
> recipient(s), you are hereby notified that any disclosure, copying,
> distribution, or use of this email message is prohibited.  If you have
> received this message in error, please notify the sender immediately by
> e-mail and delete this email message from your computer. Thank you.
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list