[Bioc-devel] ShortRead::countLines integer overflow with large fastq files
Martin Morgan
martin.morgan at roswellpark.org
Wed Feb 21 12:11:13 CET 2018
Thanks Thomas, countLines() in ShortRead 1.37.3 and later) will return
numeric() rather than integer() and hence support large files.
Martin
On 02/20/2018 10:08 PM, Thomas Girke wrote:
> Dear Martin,
>
> countLines in ShrotRead returns the line counts as integers which appears
> to create problems with large FASTQ files (>536.8 Mio lines) due to R's
> integer limit (2^31-1). When the integer limit is reached/exceeded it seems
> that countLines returns negative values not reflecting the number of lines
> in a file anymore. At least this is what I learned after several users
> reported this problem and then running some tests myself on large FASTQ
> files with variable line numbers around the integer limit. If my conclusion
> is correct and there aren' t any strong reasons against it, would it be
> possible to consider returning numeric values instead either by default or
> conditionally (e.g. when the count is >= .Machine$integer.max) to lift this
> limit. If this is not possible then returning NAs instead of negative
> values would be a sensible compromise.
>
> Thanks,
>
> Thomas
>
>> sessionInfo()
> R version 3.4.2 (2017-09-28)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: CentOS Linux 7 (Core)
>
> Matrix products: default
> BLAS: /usr/lib64/libblas.so.3.4.2
> LAPACK: /usr/lib64/liblapack.so.3.4.2
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats4 parallel stats graphics utils datasets grDevices
> methods base
>
> other attached packages:
> [1] ShortRead_1.36.0 GenomicAlignments_1.14.1
> SummarizedExperiment_1.8.0 DelayedArray_0.4.1 matrixStats_0.52.2
> Biobase_2.38.0 Rsamtools_1.30.0
> GenomicRanges_1.30.0 GenomeInfoDb_1.14.0 Biostrings_2.46.0
> XVector_0.18.0 IRanges_2.12.0
> S4Vectors_0.16.0
> [14] BiocParallel_1.12.0 BiocGenerics_0.24.0 setwidth_1.0-4
> colorout_1.1-3
>
> loaded via a namespace (and not attached):
> [1] zlibbioc_1.24.0 lattice_0.20-35 hwriter_1.3.2
> tools_3.4.2 grid_3.4.2 latticeExtra_0.6-28
> Matrix_1.2-12 GenomeInfoDbData_0.99.1 RColorBrewer_1.1-2
> bitops_1.0-6 RCurl_1.95-4.8 compiler_3.4.2
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
This email message may contain legally privileged and/or...{{dropped:2}}
More information about the Bioc-devel
mailing list