[Bioc-devel] ShortRead::countLines integer overflow with large fastq files

Martin Morgan martin.morgan at roswellpark.org
Wed Feb 21 12:11:13 CET 2018


Thanks Thomas, countLines() in ShortRead 1.37.3 and later) will return 
numeric() rather than integer() and hence support large files.

Martin

On 02/20/2018 10:08 PM, Thomas Girke wrote:
> Dear Martin,
> 
> countLines in ShrotRead returns the line counts as integers which appears
> to create problems with large FASTQ files (>536.8 Mio lines) due to R's
> integer limit (2^31-1). When the integer limit is reached/exceeded it seems
> that countLines returns negative values not reflecting the number of lines
> in a file anymore. At least this is what I learned after several users
> reported this problem and then running some tests myself on large FASTQ
> files with variable line numbers around the integer limit. If my conclusion
> is correct and there aren' t any strong reasons against it, would it be
> possible to consider returning numeric values instead either by default or
> conditionally (e.g. when the count is >= .Machine$integer.max) to lift this
> limit. If this is not possible then returning NAs instead of negative
> values would be a sensible compromise.
> 
> Thanks,
> 
> Thomas
> 
>> sessionInfo()
> R version 3.4.2 (2017-09-28)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: CentOS Linux 7 (Core)
> 
> Matrix products: default
> BLAS: /usr/lib64/libblas.so.3.4.2
> LAPACK: /usr/lib64/liblapack.so.3.4.2
> 
> locale:
> [1] C
> 
> attached base packages:
> [1] stats4    parallel  stats     graphics  utils     datasets  grDevices
> methods   base
> 
> other attached packages:
>   [1] ShortRead_1.36.0           GenomicAlignments_1.14.1
>   SummarizedExperiment_1.8.0 DelayedArray_0.4.1         matrixStats_0.52.2
>         Biobase_2.38.0             Rsamtools_1.30.0
>   GenomicRanges_1.30.0       GenomeInfoDb_1.14.0        Biostrings_2.46.0
>        XVector_0.18.0             IRanges_2.12.0
>   S4Vectors_0.16.0
> [14] BiocParallel_1.12.0        BiocGenerics_0.24.0        setwidth_1.0-4
>             colorout_1.1-3
> 
> loaded via a namespace (and not attached):
>   [1] zlibbioc_1.24.0         lattice_0.20-35         hwriter_1.3.2
>   tools_3.4.2             grid_3.4.2              latticeExtra_0.6-28
>   Matrix_1.2-12           GenomeInfoDbData_0.99.1 RColorBrewer_1.1-2
> bitops_1.0-6            RCurl_1.95-4.8          compiler_3.4.2
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 


This email message may contain legally privileged and/or...{{dropped:2}}



More information about the Bioc-devel mailing list