[R] Improve code efficient with do.call, rbind and split contruction

Jun Shen jun.shen.ut at gmail.com
Fri Sep 2 20:37:26 CEST 2016


Hi Bert,

This is the best method I have seen this year! do.call, rbind has just gone
to museum :)

It took ~30 second to get the results. You deserve a medal!!!!

Jun

On Fri, Sep 2, 2016 at 1:51 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:

> This is the sort of thing that dplyr or the data.table packages can
> probably do elegantly and efficiently. So you might consider looking
> at them. But as I use neither, let me suggest a base R solution. As
> you supplied no data for a reproducible example, I'll make up my own
> and hopefully I have understood you correctly. If not, maybe someone
> else will get it straight. Anyway...
>
> The "trick" is to use tapply() to select the necessary row indices of
> your data frame and forget about all the do.call and rbind stuff. e.g.
>
> > set.seed(1001)
> > df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)),
> +                  g <- factor(sample(letters[1:6],100,rep=TRUE)),
> +                  y = runif(100))
> >
> > ix <- seq_len(nrow(df))
> >
> > ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)]))
> > ix
>    a  b   c  d  e  f
> A 94 69 100 59 80 87
> B 89 57  65 90 75 88
> C 85 92  86 95 97 62
> D 47 73  72 74 99 96
>
> ## ix can now be used as an index into df as:
> df[ix,]
>
> This should help somewhat, but you still have to contend with the
> tapply() loop at the interpreted level. I'll leave speed comparisons
> to you.
>
> Cheers,
> Bert
>
> ## Note: if, in fact, your data frame is arranged in a regular way
> with, e.g. your SID, DOSENO groups all of the same size and together,
> then you can calculate the indices you want directly and skip the
> tapply business.I'm assuming this is not the case... Again, no data...
>
>
>
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Fri, Sep 2, 2016 at 10:02 AM, Jun Shen <jun.shen.ut at gmail.com> wrote:
> > Dear list,
> >
> > I have the following line of code to extract the last line of the split
> > data and put them back together.
> >
> > do.call(rbind,lapply(split(simout.s1,simout.s1[c('SID','
> DOSENO')]),function(x)x[nrow(x),]))
> >
> > the problem is when  have a huge dataset, it takes too long to run.
> > (actually it's > 3 hours and it's still running).
> >
> > The dataset is pretty big. I have 200,000 unique SID and 4 DOSENO, so
> > totally 800,000 split dataset. Is there anyway to speed it up? Thanks.
> >
> > Jun
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list