[R] R "write" strange behavior in huge file

Maxime Vallee ValleeM at iarc.fr
Thu Sep 18 16:40:40 CEST 2014

Thank you, it is exactly that.

I have followed your idea of chunks (1 GB chunks, on the safe side), and appended them. Worked like charm, thank you.


From: "Stefan Evert (Mailing Lists)" <stefanML at collocations.de<mailto:stefanML at collocations.de>>
Date: mercredi 17 septembre 2014 15:39
To: Maxime Vallée <valleem at iarc.fr<mailto:valleem at iarc.fr>>
Cc: R-help Mailing List <r-help at r-project.org<mailto:r-help at r-project.org>>
Subject: Re: [R] R "write" strange behavior in huge file

You probably told R to write out the file as a single long line with fields separated alternately by 380 TABs and one newline – that’s what the ncol argument does (write is just a small wrapper around cat()).

cat() doesn’t print lines that are longer than 2 GiB, so it will insert an extra \n after every 2 GiB of data. (IIRC, this is because in the C code, fill=FALSE is replaced by fill=MAX_INT or so.)

The only way around this limitation that I can think of is to write a wrapper function that breaks up the matrix or list of vectors in smaller chunks and appends them separately to the output file.  I’m planning to add such a function to one of my packages, so I’d be interested if somebody has a better solution.


On 16 Sep 2014, at 18:54, Maxime Vallee <ValleeM at iarc.fr<mailto:ValleeM at iarc.fr>> wrote:

In my script I have one list of 1,132,533 vectors (each vector contains
381 elements).

When I use "write" to save this list in a flat text file (I unlist my
list, separate by tabs, and set ncol to 381), I end up with a file of
1,132,535 lines (2 additional lines). I checked back, my R list do not
have those two additional items before writing.

With awk, I determined if lines where not made of 381 fields: there were
two, separated by around 400k lines.

I made sub-files, using those "incomplete" lines as boundaries. My files
are very close in size : 1.9 GB (respectively 1971841853 B and 1972614897
B). It feels like a 32 bit / 64 bit issue.

My R version is this:
./Rscript -e 'sessionInfo()$platform'
[1] "x86_64-unknown-linux-gnu (64-bit)"

There is somewhere, reaching 1.9 GB, something that is changing my tabs to
unwanted carriage returns...
Any idea that might cause this, and if it looks solvable in R?

This message and its attachments are strictly confidenti...{{dropped:11}}

More information about the R-help mailing list