[R] Can R handle medium and large size data sets?

Wed Jan 25 10:16:36 CET 2006

>>>>> "Martin" == Martin Lam <tmlammail at yahoo.com>
>>>>>     on Tue, 24 Jan 2006 12:13:07 -0800 (PST) writes:

    Martin> Dear Gueorgui,

    >> Is it true that R generally cannot handle medium sized
    >> data sets(a couple of hundreds of thousands observations)
    >> and threrefore large date set(couple of millions of
    >> observations)?

    Martin> It depends on what you want to do with the data sets.
    Martin> Loading the data sets shouldn't be any problem I
    Martin> think. But using the data sets for analysis using self
    Martin> written R code can get (very) slow,  since R is an
    Martin> interpreted language (correct me if I'm wrong).

(Since you asked for it ;-) )
Yes, you are wrong to quite some extent (you are partially
right, too):  Of course one *can* write  ``self written R code''
that is very slow, and yes, we have seen such code more than
once.  However, 98% of the problems {never trust a statistic
unless you mad it up ... :-) :-) } are relatively easily
solvable very efficiently with R.
You are right that it is easier to write slow code in an
interpreted language than in a compiled one.
E.g., not making use of vectorized operations in R is one famous
recipe to produce slow code pretty successfully ...

    Martin>  To increase speed you will often need to experiment with
    Martin> the R code. For example, what I've noticed is that
    Martin> processing data sets as matrices works much faster
    Martin> than data.frame().

yes, indeed;  see also the other answers to Gueorgui's question.

    Martin> Writing your code in C(++), compile it and include
    Martin> it in your R code is often the best way.

    Martin> HTH,

    Martin> Martin