[R] merging pre-sorted data frames
Mike Miller
mbmiller+l at gmail.com
Thu Jan 15 04:17:40 CET 2015
Thanks, Jeff. You really know the packages. I search and I guess I
didn't use the right terms. That package seems to do exactly what I
wanted.
Mike
On Tue, 13 Jan 2015, Jeff Newmiller wrote:
> On Tue, 13 Jan 2015, Mike Miller wrote:
>
>> I have many pairs of data frames each with about 15 million records each
>> and about 10 million records in common. They are sorted by two of their
>> fields and will be merged by those same fields.
>>
>> The fact that the data are sorted could be used to greatly speed up a
>> merge, but I have the impression that merge() cannot "know" in advance that
>> the fields are already sorted.
>
> There are different versions of "merge". This sounds like a job for the
> data.table package, which has its own way of doing merges that is likely to
> be useful here. However, be warned that data.table takes some getting used
> to, and if it can't figure out from your use of it how to use the fast
> techniques then it will often fall back on the slower data.frame approaches.
> [1] covers the single-column case... but multiple columns is quite doable.
>
> You might also find sqldf helpful if you are more comfortable with SQL than
> data.table's way of doing things.
>
> [1] http://stackoverflow.com/questions/17331684/fast-exists-in-data-table
>
>> I'm sure that I can use merge(), but I suspect that it is doing a lot of
>> unnecessary work and that it will take much more time than the job really
>> should require. Is that correct? Can anything be done about it?
>>
>> The inspiration for my question comes partly from the way GNU comm works.
>
> Not familiar with that.
>
>> If you have any ideas about this, I'd love to hear them.
>>
>> Thanks in advance.
>>
>> Mike
>>
>> --
>> Michael B. Miller, Ph.D.
>> University of Minnesota
>> http://scholar.google.com/citations?user=EV_phq4AAAAJ
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ---------------------------------------------------------------------------
> Jeff Newmiller The ..... ..... Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
> Live: OO#.. Dead: OO#.. Playing
> Research Engineer (Solar/Batteries O.O#. #.O#. with
> /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
> ---------------------------------------------------------------------------
>
More information about the R-help
mailing list