[R] reading data saved with writeBin() into anything other than R
Mike Miller
mbmiller+l at gmail.com
Tue Apr 22 02:59:47 CEST 2014
After saving a file like so...
con <- gzcon("file.gz", "wb"))
writeBin(vector, con, size=2)
close(con)
I can read it back into R like so...
con <- gzcon("file.gz", "rb"))
vector <- readBin(con, integer(), 48000000, size=2, signed=FALSE)
close(con)
...and I'm wondering what other programs might be able to read in these
data. It seems to be very straightforward: When I store 5436 integers
for each of 7694 subjects, at two bytes per integer that ought to be
5436*7696*2 = 83670912 bytes, and it is exactly that:
$ zcat file.gz | wc -c
83670912
So if I just convert every pair of bytes to an integer, I guess that will
do it. I stored them this way because it was compact, but I guess this
system also can work well when other software needs to read the data.
For me that other software would probably be Octave. I'm interested if
anyone here has read in these files using Octave, or a C program or
anything else. If I don't get a good answer here, I'll try the Octave
list, and I'll send my best answers here.
The rest of this is some related info for readers of this list. You don't
need to read below to answer my question above. Thanks.
In case anyone is interested, I did some comparisons of loading speed and
file size for a number of ways of storing my data. These data all consist
of positive numbers between 0 and 2, with three digits to the right of the
decimal, so I can save them as floating point double-precision, or
multiply by 1000 and store them as integers. The test here as for a
matrix of 5000 x 7845 = 39,225,000 values. These are the file sizes:
202.1 MB tab-delimited text file, original, uncompressed
29.9 MB tab-delimited text file, original, gzip compressed
187.7 MB tab-delimited text file, integers, uncompressed
24.6 MB tab-delimited text file, integers, gzip compressed
38.9 MB R save() original numeric values (doubles)
27.0 MB R save() integers
19.7 MB R writeBin() 16-bit integer gzipped
So, for file size (important in my case), the gzipped writeBin() method
storing 16-bit integers was the winner. Impressively, storing the data
that way and dividing by 1000 on the fly to return the original numbers
was faster than reading an Rdata file of the matrix:
The integer text file:
> system.time( D <- matrix( scan( file = "D/D000", what=integer(0) ), ncol=7845, byrow=TRUE ) )
Read 39225000 items
user system elapsed
10.626 0.344 10.971
The R save() original numeric values (doubles):
> system.time( load("D000_test.Rdata") )
user system elapsed
5.579 0.119 5.698
The R save() integers:
> system.time( load("D000_test.Rdata") )
user system elapsed
4.863 0.050 4.913
The writeBin() 16-bit integer gzipped file:
> con <- gzcon(file("D000_test.gz", "rb"))
> system.time( D <- matrix( readBin( con, integer(), 7845*5000, size=2, signed=FALSE ), ncol=7845, byrow=TRUE ) )
user system elapsed
3.769 0.138 3.906
> close(con)
The writeBin() 16-bit integer gzipped file, converted to numeric by
dividing by 1000 on the fly:
> system.time( D <- matrix( readBin( con, integer(), 7845*5000, size=2, signed=FALSE ), ncol=7845, byrow=TRUE )/1000 )
user system elapsed
4.159 0.237 4.397
> close(con)
Best,
Mike
--
Michael B. Miller, Ph.D.
Minnesota Center for Twin and Family Research
Department of Psychology
University of Minnesota
http://scholar.google.com/citations?user=EV_phq4AAAAJ
More information about the R-help
mailing list