[BioC] Fastest way to read CSV files
Martin Morgan
mtmorgan at fhcrc.org
Fri Aug 20 15:40:13 CEST 2010
On 08/20/2010 06:26 AM, Stijn van Dongen wrote:
>
> Thanks Misha, that's very instructive.
> I'd like to add that this can be made quite parametrizable, in that it is
> possible to write and read the dimensions of the object as well. In fact, by
> writing some kind of 'cookie' number it would be possible to have code that can
> recognize what *type* of data it needs to read. In the example below however,
> just the dimensions are first written to and then read from file. When reading,
> the dimensions are no longer hardcoded, but read from the same connection.
>
> x <- matrix(floor(runif(1.7e4 * 20)*1000),nr=20)
> cn <- file("test.bin","wb")
> writeBin(dim(x), cn)
> writeBin(as.vector(x), cn)
> close(cn)
>
> cn <- file("test.bin", "rb")
> dims <- readBin(cn, integer(), 2)
> x2 <- matrix(readBin(cn,numeric(), dims[1] * dims[2]), nrow=dims[1], ncol=dims[2])
> close(cn)
>
> sum(x != x2)
>
> a hex dump of the file test.bin gives this for the first line:
>
> <----integer 1 ---> <--- integer 2 --->
> 0000000 0014 0000 4268 0000 0000 0000 c000 4070
>
> indeed, hexadecimal 0x14 == 20 and hexadecimal 4268 == 17000,
> this on a little endian machine.
Maybe worth mentioning save(..., compress=FALSE) / load(), which will be
fast (though not as fast as readBin, and difficult to load parts of the
data) and robust. Also SQL, NetCDF and friends which will be portable /
interoperable.
Depending on use case, it can be tricky to get good timings on these
operations -- your OS has probably cached those values when written, so
input seems very fast, whereas when they've been removed from cache the
first access could be considerably slower (order of magnitude is my
casual impression).
Martin
>
>
> best,
> Stijn
>
>
> On Fri, Aug 20, 2010 at 09:45:14AM +0100, Misha Kapushesky wrote:
>> Hi,
>>
>> If you did do this in binary, we'd see the following:
>>
>>> x <- matrix(floor(runif(1.7e6 * 20)*1000),nr=20)
>>> z <- writeBin(as.vector(x),file("test.bin","wb"))
>>
>>> system.time({zz <- readBin(file("test.bin","rb"),numeric(),20*1700000);
>>> dim(zz) <- c(20,1700000)})
>> user system elapsed
>> 0.171 0.574 0.751
>>
>> So, less than a second to read this in.
>>
>> If you were working in, say, Perl, you could write data like this as
>> follows:
>>
>> open M, ">test2.bin";
>> for($i=0; $i<20*1700000; $i++) {
>> print M pack('i',$i);
>> }
>> close M;
>>
>> and read that file into R as:
>>
>>> system.time({e <- readBin("test2.bin",integer(),20*1700000,size=4);
>> dim(e) <- c(20,1700000)})
>> user system elapsed
>> 0.093 0.273 0.370
>>
>> Even faster, specifying explicitly the int size.
>>
>> --Misha
>>
>> On Thu, 19 Aug 2010, Sean Davis wrote:
>>
>>> On Thu, Aug 19, 2010 at 7:31 PM, Stijn van Dongen <stijn at ebi.ac.uk> wrote:
>>>
>>>>
>>>> This piqued my interest, as for really large datasets it can in general
>>>> speed
>>>> up things greatly to use binary formats (1.5 million does not sound *that*
>>>> big
>>>> to me). I have no experience with this in R, but a little search brought
>>>> up
>>>> e.g. readBin(). So it might be possible, especially if your data is quite
>>>> simple (all integers), to first convert your data externally to a binary
>>>> format (using perl or python or ..) and then read it with readBin().
>>>>
>>>> Disclaimer: Quite likely a random thought from an ill-informed bystander.
>>>>
>>>>
>>> Binary is always a good thought, but reading into another language to write
>>> binary to load into R is probably not going to be a big time saver over
>>> using R's capabilities.
>>>
>>>> x=matrix(floor(runif(1.7e6 * 20)*1000),nr=20)
>>> di> dim(x)
>>> [1] 20 1700000
>>>> write.table(x,file='abc.txt',sep="\t",col.names=FALSE,row.names=FALSE)
>>>> system.time((y = matrix(scan('abc.txt',what='integer'),nr=20)))
>>> Read 34000000 items
>>> user system elapsed
>>> 17.555 0.685 18.258
>>>> dim(y)
>>> [1] 20 1700000
>>>
>>> So, a 1.7 million column by 20 row table of integers can be read in about
>>> 18
>>> seconds using scan, just to give a rough sketch of profiling results. You
>>> might be able to get close using read.table and setting column classes
>>> appropriately, also.
>>>
>>> Sean
>>>
>>>
>>>> best,
>>>> Stijn
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Aug 19, 2010 at 05:43:22PM -0400, Sean Davis wrote:
>>>>> Try using scan and then rearrange the resulting vector.
>>>>>
>>>>> Sean
>>>>>
>>>>> On Aug 19, 2010 5:32 PM, "Gaston Fiore" <gaston.fiore at gmail.com> wrote:
>>>>>
>>>>> Hello everyone,
>>>>>
>>>>> Is there a faster method to read CSV files than the read.csv function?
>>>> I've
>>>>> CSV files containing a rectangular array with about 17 rows and 1.5
>>>> million
>>>>> columns with integer entries, and read.csv is being too slow for my
>>>> needs.
>>>>>
>>>>> Thanks for your help,
>>>>>
>>>>> -Gaston
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> --
>>>> Stijn van Dongen >8< -o) O< forename pronunciation:
>>>> [Stan]
>>>> EMBL-EBI /\\ Tel: +44-(0)1223-492675
>>>> Hinxton, Cambridge, CB10 1SD, UK _\_/ http://micans.org/stijn
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list