[R] Processing large datasets
Steve Lianoglou
mailinglist.honeypot at gmail.com
Wed May 25 16:00:31 CEST 2011
Hi,
On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko <roman at bestroman.com> wrote:
> Hi R list,
>
> I'm new to R software, so I'd like to ask about it is capabilities.
> What I'm looking to do is to run some statistical tests on quite big
> tables which are aggregated quotes from a market feed.
>
> This is a typical set of data.
> Each day contains millions of records (up to 10 non filtered).
>
> 2011-05-24 750 Bid DELL 14130770 400
> 15.4800 BATS 35482391 Y 1 1 0 0
> 2011-05-24 904 Bid DELL 14130772 300
> 15.4800 BATS 35482391 Y 1 0 0 0
> 2011-05-24 904 Bid DELL 14130773 135
> 15.4800 BATS 35482391 Y 1 0 0 0
>
> I'll need to filter it out first based on some criteria.
> Since I keep it mysql database, it can be done through by query. Not
> super efficient, checked it already.
>
> Then I need to aggregate dataset into different time frames (time is
> represented in ms from midnight, like 35482391).
> Again, can be done through a databases query, not sure what gonna be faster.
> Aggregated tables going to be much smaller, like thousands rows per
> observation day.
>
> Then calculate basic statistic: mean, standard deviation, sums etc.
> After stats are calculated, I need to perform some statistical
> hypothesis tests.
>
> So, my question is: what tool faster for data aggregation and filtration
> on big datasets: mysql or R?
Why not try a few experiments and see for yourself -- I guess the
answer will depend on what exactly you are doing.
If your datasets are *really* huge, check out some packages listed
under the "Large memory and out-of-memory data" section of the
"HighPerformanceComputing" task view at CRAN:
http://cran.r-project.org/web/views/HighPerformanceComputing.html
Also, if you find yourself needing to do lots of
"grouping/summarizing" type of calculations over large data frame-like
objects, you might want to check out the data.table package:
http://cran.r-project.org/web/packages/data.table/index.html
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the R-help
mailing list