[Bioc-sig-seq] Size of Illumina fastaq files to be read in shortReads

Martin Morgan mtmorgan at fhcrc.org
Wed Jun 24 21:04:49 CEST 2009


Hi Anastasia --

Anastasia Gioti wrote:
> Dear list,
> I just started playing with shortReads package in order to read fastaq
> files from the illumina analyzer, and i have some issues.
> The most important is the fact that the readFastaq crashes because of
> memory I suppose when i try to read files >1GB. Ex:
> fqpattern='s_3_1_sequence.txt'

It is worth being more careful in specifying the pattern, e.g., as
'^s_3_1_sequence.txt$', otherwise the pattern also matches
s_3_1_sequence.txt.gz. Using list.files(dirPath, pattern) is a good way
to evaluate which files will be parsed.

>> afrN=file.path(analysisPath(sp), fqpattern)
>> afrN
> [1]
> "/Users/nat/Data/Illumina/Solexa_disk_modforR/Data/HJSN_FC1_280409_3//Data/C1-C55Firecrest/Bustard1.3.2_06-05-2009_rdixon/GERALD_06-05-2009_rdixon/s_3_1_sequence.txt"
> 
>> afrNq=readFastq(sp, fqpattern)
> Error: cannot allocate vector of size 27.0 Mb
> R(1337,0xa07a2720) malloc: *** mmap(size=28340224) failed (error code=12)
> *** error: can't allocate region
> *** set a breakpoint in malloc_error_break to debug
> R(1337,0xa07a2720) malloc: *** mmap(size=28340224) failed (error code=12)
> *** error: can't allocate region
> *** set a breakpoint in malloc_error_break to debug
> 
> I only succeeded in reading a file < 1GB, but i suppose that the
> shortReads class is designed for big files ;-).

All the data ends up in memory, so there are definite limits. readFastq
has an argument withIds (see ?readFastq) that, if set to FALSE, will not
read the identifier strings and making the data more manageable.

There is some effort under way to make the parsers more space-efficient,
but you'll only see this in the development version of the package,
which requires using the development version of R (until the next
Bioconductor release, in the fall). Usually R-devel is fairly stable,
but both it and the Bioconductor development packages can go through
periods of reduced functionality.

> Another minor issue is the names of the folders in the Illumina output
> directory that I need to designate in exptPath so that
> p=SolexaPath(exptPath) is correctly parsed. I finally managed to find
> the logic behind this, but I would like to confirm that the path
> absolutely needs to contain this string: Data/C1-C(readlength)Firecrest.

SolexaPath has additional arguments for the different components, so you
can provide those directly to over-ride the defaults. Also SolexaPath
was meant as a convenient way of navigating the folder hierarchy,  so if
it's getting in the way then just provide regular directory paths to the
I/O functions.

> At least in my hands it would not work with other names (which are
> currently produced by illumina, for ex IPAR instead of Firecrest). Is

ShortRead tries to keep up with the major file formats; Parsing
Ipar-specific files was introduced into the development branch in March;
it is in the released version of ShortRead, available with R-2.9.

> that correct? Maybe this parser is hard coded for previous versions of
> Illumina outputs? In that case is there any plan to update it? Although
> this is not very important
> 
> I use R2.8 on a Leopard with 8GB of memory, so I think that my problem
> with fastq does not come from my computer...
> Any help /suggestions are welcome!
> Thank you,
> 
> Anastasia Gioti
> Post-Doc, Evolutionary Biology Department
> Upssala University
> Norbyvagen 18D
> SE-752 36  UPPSALA
> anastasia.gioti at ebc.uu.se
> Tel: +46-18-471 6465
> Fax: +46-18-471 6310
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



More information about the Bioc-sig-sequencing mailing list