[R] R "write" strange behavior in huge file

Maxime Vallee ValleeM at iarc.fr
Thu Sep 18 16:40:40 CEST 2014


Thank you, it is exactly that.

I have followed your idea of chunks (1 GB chunks, on the safe side), and appended them. Worked like charm, thank you.

--Maxime



From: "Stefan Evert (Mailing Lists)" <stefanML at collocations.de<mailto:stefanML at collocations.de>>
Date: mercredi 17 septembre 2014 15:39
To: Maxime Vallée <valleem at iarc.fr<mailto:valleem at iarc.fr>>
Cc: R-help Mailing List <r-help at r-project.org<mailto:r-help at r-project.org>>
Subject: Re: [R] R "write" strange behavior in huge file

You probably told R to write out the file as a single long line with fields separated alternately by 380 TABs and one newline – that’s what the ncol argument does (write is just a small wrapper around cat()).

cat() doesn’t print lines that are longer than 2 GiB, so it will insert an extra \n after every 2 GiB of data. (IIRC, this is because in the C code, fill=FALSE is replaced by fill=MAX_INT or so.)

The only way around this limitation that I can think of is to write a wrapper function that breaks up the matrix or list of vectors in smaller chunks and appends them separately to the output file.  I’m planning to add such a function to one of my packages, so I’d be interested if somebody has a better solution.

Best,
Stefan


On 16 Sep 2014, at 18:54, Maxime Vallee <ValleeM at iarc.fr<mailto:ValleeM at iarc.fr>> wrote:

In my script I have one list of 1,132,533 vectors (each vector contains
381 elements).

When I use "write" to save this list in a flat text file (I unlist my
list, separate by tabs, and set ncol to 381), I end up with a file of
1,132,535 lines (2 additional lines). I checked back, my R list do not
have those two additional items before writing.

With awk, I determined if lines where not made of 381 fields: there were
two, separated by around 400k lines.

I made sub-files, using those "incomplete" lines as boundaries. My files
are very close in size : 1.9 GB (respectively 1971841853 B and 1972614897
B). It feels like a 32 bit / 64 bit issue.

My R version is this:
./Rscript -e 'sessionInfo()$platform'
[1] "x86_64-unknown-linux-gnu (64-bit)"

There is somewhere, reaching 1.9 GB, something that is changing my tabs to
unwanted carriage returns...
Any idea that might cause this, and if it looks solvable in R?


-----------------------------------------------------------------------
This message and its attachments are strictly confidenti...{{dropped:11}}



More information about the R-help mailing list