[R] Slow reading multiple tick data files into list of dataframes
jim holtman
jholtman at gmail.com
Tue Oct 12 02:30:17 CEST 2010
For 100,000 rows, it took about 2 seconds to read it in on my system:
> system.time(x <- read.table('/recv/test.txt', as.is=TRUE))
user system elapsed
1.92 0.08 2.08
> str(x)
'data.frame': 196588 obs. of 7 variables:
$ V1: int 1 2 3 4 1 2 3 1 2 3 ...
$ V2: chr "bid" "ask" "ask" "bid" ...
$ V3: chr "CON" "CON" "CON" "CON" ...
$ V4: chr "09:30:00.722" "09:30:00.782" "09:30:00.809" "09:30:00.783" ...
$ V5: chr "09:30:00.722" "09:30:00.810" "09:30:00.810" "09:30:00.810" ...
$ V6: num 32.7 33.1 33.1 33.1 32.7 ...
$ V7: int 98 300 414 200 98 300 414 98 300 414 ...
> object.size(x)
6291928 bytes
>
Given that you have about 85 files, I would guess that you would need
about 800MB if all were 300K lines longs. You might be getting memory
fragmentation. You might try using gc() every so often in the loop.
What are you going to do with the data? Are you going to make one big
file? In this case you might want a 64 bit version since you will
have a single instance of 800K and will probably need 2-3X that much
memory if copies are being made during processing. Object might be
larger in 64-bit.
Maybe you need to follow Gabor's advice and read it into a database
and then process it from there.
On Mon, Oct 11, 2010 at 5:48 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> On Mon, Oct 11, 2010 at 5:39 PM, rivercode <aquanyc at gmail.com> wrote:
>>
>> Hi,
>>
>> I am trying to find the best way to read 85 tick data files of format:
>>
>>> head(nbbo)
>> 1 bid CON 09:30:00.722 09:30:00.722 32.71 98
>> 2 ask CON 09:30:00.782 09:30:00.810 33.14 300
>> 3 ask CON 09:30:00.809 09:30:00.810 33.14 414
>> 4 bid CON 09:30:00.783 09:30:00.810 33.06 200
>>
>> Each file has between 100,000 to 300,300 rows.
>>
>> Currently doing nbbo.list<- lapply(filePath, read.csv) to create list
>> with 85 data.frame objects...but it is taking minutes to read in the data
>> and afterwards I get the following message on the console when taking
>> further actions (though it does then stop):
>>
>> The R Engine is busy. Please wait, and try your command again later.
>>
>> filePath in the above example is a vector of filenames:
>>> head(filePath)
>> [1] "C:/work/A/A_2010-10-07_nbbo.csv"
>> [2] "C:/work/AAPL/AAPL_2010-10-07_nbbo.csv"
>> [3] "C:/work/ADBE/ADBE_2010-10-07_nbbo.csv"
>> [4] "C:/work/ADI/ADI_2010-10-07_nbbo.csv"
>>
>> Is there a better/quicker or more R way of doing this ?
>>
>
> You could try (possibly with suitable additonal arguments):
>
> library(sqldf)
> lapply(filePath, read.csv.sql)
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
More information about the R-help
mailing list