[R] Running *slow*

R. Michael Weylandt michael.weylandt at gmail.com
Thu Oct 6 18:41:45 CEST 2011


Patrick is right, most of the time is probably taken up for the
reasons documented in the (masterful) R Inferno, namely the rbind()
calls.

There is another problem though and it gets at the very core of R, and
for that matter, all interpreted languages that I'm familiar with.
I'll give a fairly elementary explanation and gloss over many of the
subtleties that R core worries about so we mere mortals don't have to.

At the end of the day, everything is looped, there's no way to get
around it. However, from a code perspective we have a choice of
looping in C or R. Whenever possible it is better to loop in C than R
and most of the key built-in functions, like unique(), are designed to
do just that. The reason for it is pretty straightforward: consider
what has to happen to run a loop in R:

Iterator is defined: a sequence of C calls start this
first line of loop is hit -> interpreted by R -> sent to C code ->
executed -> changed back into an R result -> passed to the next line
of the loop
iterator is increased: C again
second line of loop is hit -> interpreted by R -> sent to C code ->
executed -> changed back into an R result -> passed to the next line
of the loop
etc.

Complicated and/or multiple lines of code only compound the problem
because you have to go up and down multiple times at each iteration.

Looping on the C level gets rid of all those "translations" between
C/R, save 2, and thereby mightily increases efficiency. Hence, even if
you are using the same (or heaven forbid a faster!) algorithm on the R
level, it can look super slow because of all the moving up and down
the ladder; I don't know how unique.C is implemented, but my guess is
it's more or less like what you have now, with more efficient memory
usage/preallocation, it just looks *much* faster because of the C
architecture.

DISCLAIMER: there are quite a few inaccuracies, most small, maybe a
few large, in here, and I probably only am aware of a small fraction
thereof, but this wasn't intended to be a super accurate explanation.

On another note, I should explain my solution a little more clearly.

A straight call to unique() would check for unique ROWS not values of
x. I take x, make a copy so as not to harm the original object, strip
if of its dimensionality (thereby converting it to a vector
efficiently), and then apply unique() which will now find unique
values. It's not a huge thing, but not immediately apparent from what
I did.

Hope this helps,

Michael


On Thu, Oct 6, 2011 at 11:59 AM, Patrick Burns <pburns at pburns.seanet.com> wrote:
> Probably most of the time you're waiting
> for this you are in Circle 2 of 'The R
> Inferno'.  If the values are numbers,
> you might also be in Circle 1.
>
> On 06/10/2011 13:37, Thomas wrote:
>>
>> Anyone got any hints on how to make this code more efficient? An early
>> version (which to be fair did more than this one is) ran for 330 hours
>> and produced no output.
>>
>> I have a two column table, Dat, with 12,000,000 rows and I want to
>> produce a lookup table, ltable, in a 1 dimensional matrix with one copy
>> of each of the values in Dat:
>>
>> for (i in 1:nrow(Dat))
>> {
>> for (j in 1:2)
>> {
>> #If next value is already in ltable, do nothing
>> if (is.na(match(Dat[i,j], ltable))){ltable <- rbind(ltable,Dat[i,j])}
>> }
>> }
>>
>> but it takes forever to produce anything.
>>
>> Any advice gratefully received.
>>
>> Thomas
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> --
> Patrick Burns
> pburns at pburns.seanet.com
> twitter: @portfolioprobe
> http://www.portfolioprobe.com/blog
> http://www.burns-stat.com
> (home of 'Some hints for the R beginner'
> and 'The R Inferno')
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list