[Bioc-sig-seq] Fastq File size limit in the Short Read Package

Martin Morgan mtmorgan at fhcrc.org
Sat Apr 3 00:37:45 CEST 2010


Hi Sirisha --

On 04/02/2010 02:57 PM, Sirisha Sunkara wrote:
> Hi Martin,
> 
> The readFastq function in the devel version of ShortRead, installed with
> the devel version of  R does seem to read >6.5 Gb files fine, but
> the quality scores upon extraction and conversion to a matrix, gives the
> following memory error...
> 
>> reads <- readFastq("./s_7_1_sequence.txt", qualityType="SFastqQuality")
>> qual <- quality(reads)
>> qual <- as(qual, "matrix")
> Error in asMethod(object) : allocMatrix: too many elements specified
> 
> This fastq file has >31 million 76 cycle reads. Is this a known issue?

I cc'd the bioc-sig-seq mailing list, as this might be useful to others.
R is not able to create a matrix of that size

> matrix(0, 31000000, 72)
Error in matrix(0, 3.1e+07, 72) : too many elements specified

So yes, this is a fundamental limit imposed by R. If the idea is to
summarize the quality scores in some way, then perhaps

 qual = as(quality(read)[sample(nrow(read), 1e7)], "matrix")

or looping over subsets would capture enough information to be useful?

Martin

> 
> Thank You,
> Sirisha
> 
>> sessionInfo()
> R version 2.11.0 Under development (unstable) (2010-03-07 r51225)
> x86_64-unknown-linux-gnu
> 
> locale:
> [1] C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base   
> other attached packages:
> [1] ShortRead_1.5.21    lattice_0.18-3      Biostrings_2.15.22
> [4] GenomicRanges_0.1.0 IRanges_1.5.74   
> loaded via a namespace (and not attached):
> [1] Biobase_2.7.5 grid_2.11.0   hwriter_1.2
> 
> 
> 
> Martin Morgan wrote:
>> On 03/23/2010 05:00 PM, Sirisha Sunkara wrote:
>>  
>>> Hi Martin,
>>>
>>> Using the ShortRead package, for files > 6.5 Gb size, I seem to be
>>> running into this error using the readFastq function:
>>>
>>> Error in .Call(.read_solexa_fastq, src, withIds) :
>>>  negative length vectors are not allowed
>>>
>>> If this is memory related - is there a work-around to working with the
>>> entire file?
>>>     
>>
>> Hi Sirisha,
>>
>> This is addressed in the 'devel' version of ShortRead, for which you
>> would need to install the 'devel' version of R and then re-install
>> Bioconductor packages. The workaround is to use an external tool (e.g.,
>> the command 'split' in linux) to split the file into smaller chunks
>> (split files using the -l command and such that lines are multiples of
>> 4).
>>
>> Martin
>>
>>  
>>> Thank You,
>>> Sirisha
>>>
>>>    
>>>> sessionInfo()
>>>>       
>>> R version 2.10.1 (2009-12-14)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods  
>>> base   other attached packages:
>>> [1] ShortRead_1.4.0    lattice_0.17-26    BSgenome_1.14.2  
>>> Biostrings_2.14.12
>>> [5] IRanges_1.4.11  loaded via a namespace (and not attached):
>>> [1] Biobase_2.6.1 grid_2.10.1   hwriter_1.1
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>     
>>
>>
>>   
> 


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list