[R] merging pre-sorted data frames

Mike Miller mbmiller+l at gmail.com
Thu Jan 15 04:17:40 CET 2015


Thanks, Jeff.  You really know the packages.  I search and I guess I 
didn't use the right terms.  That package seems to do exactly what I 
wanted.

Mike


On Tue, 13 Jan 2015, Jeff Newmiller wrote:

> On Tue, 13 Jan 2015, Mike Miller wrote:
>
>> I have many pairs of data frames each with about 15 million records each 
>> and about 10 million records in common.  They are sorted by two of their 
>> fields and will be merged by those same fields.
>> 
>> The fact that the data are sorted could be used to greatly speed up a 
>> merge, but I have the impression that merge() cannot "know" in advance that 
>> the fields are already sorted.
>
> There are different versions of "merge". This sounds like a job for the 
> data.table package, which has its own way of doing merges that is likely to 
> be useful here. However, be warned that data.table takes some getting used 
> to, and if it can't figure out from your use of it how to use the fast 
> techniques then it will often fall back on the slower data.frame approaches. 
> [1] covers the single-column case... but multiple columns is quite doable.
>
> You might also find sqldf helpful if you are more comfortable with SQL than 
> data.table's way of doing things.
>
> [1] http://stackoverflow.com/questions/17331684/fast-exists-in-data-table
>
>> I'm sure that I can use merge(), but I suspect that it is doing a lot of 
>> unnecessary work and that it will take much more time than the job really 
>> should require.  Is that correct?  Can anything be done about it?
>> 
>> The inspiration for my question comes partly from the way GNU comm works.
>
> Not familiar with that.
>
>> If you have any ideas about this, I'd love to hear them.
>> 
>> Thanks in advance.
>> 
>> Mike
>> 
>> -- 
>> Michael B. Miller, Ph.D.
>> University of Minnesota
>> http://scholar.google.com/citations?user=EV_phq4AAAAJ
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
>                                      Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------
>



More information about the R-help mailing list