[BioC] Fastest way to read CSV files

Fri Aug 20 10:45:14 CEST 2010

Hi,

If you did do this in binary, we'd see the following:

> x <- matrix(floor(runif(1.7e6 * 20)*1000),nr=20)
> z <- writeBin(as.vector(x),file("test.bin","wb"))

> system.time({zz <- readBin(file("test.bin","rb"),numeric(),20*1700000); dim(zz) <- c(20,1700000)})
    user  system elapsed
   0.171   0.574   0.751

So, less than a second to read this in.

If you were working in, say, Perl, you could write data like this as 
follows:

open M, ">test2.bin";
for($i=0; $i<20*1700000; $i++) {
   print M pack('i',$i);
}
close M;

and read that file into R as:

> system.time({e <- readBin("test2.bin",integer(),20*1700000,size=4); 
dim(e) <- c(20,1700000)})
    user  system elapsed
   0.093   0.273   0.370

Even faster, specifying explicitly the int size.

--Misha

On Thu, 19 Aug 2010, Sean Davis wrote:

> On Thu, Aug 19, 2010 at 7:31 PM, Stijn van Dongen <stijn at ebi.ac.uk> wrote:
>
>>
>> This piqued my interest, as for really large datasets it can in general
>> speed
>> up things greatly to use binary formats (1.5 million does not sound *that*
>> big
>> to me). I have no experience with this in R, but a little search brought up
>> e.g. readBin(). So it might be possible, especially if your data is quite
>> simple (all integers), to first convert your data externally to a binary
>> format (using perl or python or ..) and then read it with readBin().
>>
>> Disclaimer: Quite likely a random thought from an ill-informed bystander.
>>
>>
> Binary is always a good thought, but reading into another language to write
> binary to load into R is probably not going to be a big time saver over
> using R's capabilities.
>
>> x=matrix(floor(runif(1.7e6 * 20)*1000),nr=20)
> di> dim(x)
> [1]      20 1700000
>> write.table(x,file='abc.txt',sep="\t",col.names=FALSE,row.names=FALSE)
>> system.time((y = matrix(scan('abc.txt',what='integer'),nr=20)))
> Read 34000000 items
>   user  system elapsed
> 17.555   0.685  18.258
>> dim(y)
> [1]      20 1700000
>
> So, a 1.7 million column by 20 row table of integers can be read in about 18
> seconds using scan, just to give a rough sketch of profiling results.  You
> might be able to get close using read.table and setting column classes
> appropriately, also.
>
> Sean
>
>
>> best,
>> Stijn
>>
>>
>>
>>
>> On Thu, Aug 19, 2010 at 05:43:22PM -0400, Sean Davis wrote:
>>> Try using scan and then rearrange the resulting vector.
>>>
>>> Sean
>>>
>>> On Aug 19, 2010 5:32 PM, "Gaston Fiore" <gaston.fiore at gmail.com> wrote:
>>>
>>> Hello everyone,
>>>
>>> Is there a faster method to read CSV files than the read.csv function?
>> I've
>>> CSV files containing a rectangular array with about 17 rows and 1.5
>> million
>>> columns with integer entries, and read.csv is being too slow for my
>> needs.
>>>
>>> Thanks for your help,
>>>
>>> -Gaston
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> --
>> Stijn van Dongen         >8<        -o)   O<  forename pronunciation:
>> [Stan]
>> EMBL-EBI                            /\\   Tel: +44-(0)1223-492675
>> Hinxton, Cambridge, CB10 1SD, UK   _\_/   http://micans.org/stijn
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>