[R] merging pre-sorted data frames

Alex Fun alex at glasshat.com
Thu Jan 15 04:41:02 CET 2015


package dplyr's full_join, left_join, right_join, inner_join are also
comparable in speed to data table. The syntax is also more like merge's.

On Thu, Jan 15, 2015 at 2:17 PM, Mike Miller <mbmiller+l at gmail.com> wrote:

> Thanks, Jeff.  You really know the packages.  I search and I guess I
> didn't use the right terms.  That package seems to do exactly what I wanted.
>
> Mike
>
>
>
> On Tue, 13 Jan 2015, Jeff Newmiller wrote:
>
>  On Tue, 13 Jan 2015, Mike Miller wrote:
>>
>>  I have many pairs of data frames each with about 15 million records each
>>> and about 10 million records in common.  They are sorted by two of their
>>> fields and will be merged by those same fields.
>>>
>>> The fact that the data are sorted could be used to greatly speed up a
>>> merge, but I have the impression that merge() cannot "know" in advance that
>>> the fields are already sorted.
>>>
>>
>> There are different versions of "merge". This sounds like a job for the
>> data.table package, which has its own way of doing merges that is likely to
>> be useful here. However, be warned that data.table takes some getting used
>> to, and if it can't figure out from your use of it how to use the fast
>> techniques then it will often fall back on the slower data.frame
>> approaches. [1] covers the single-column case... but multiple columns is
>> quite doable.
>>
>> You might also find sqldf helpful if you are more comfortable with SQL
>> than data.table's way of doing things.
>>
>> [1] http://stackoverflow.com/questions/17331684/fast-exists-in-data-table
>>
>>  I'm sure that I can use merge(), but I suspect that it is doing a lot of
>>> unnecessary work and that it will take much more time than the job really
>>> should require.  Is that correct?  Can anything be done about it?
>>>
>>> The inspiration for my question comes partly from the way GNU comm works.
>>>
>>
>> Not familiar with that.
>>
>>  If you have any ideas about this, I'd love to hear them.
>>>
>>> Thanks in advance.
>>>
>>> Mike
>>>
>>> --
>>> Michael B. Miller, Ph.D.
>>> University of Minnesota
>>> http://scholar.google.com/citations?user=EV_phq4AAAAJ
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/
>>> posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>> ------------------------------------------------------------
>> ---------------
>> Jeff Newmiller                        The     .....       .....  Go
>> Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>> Go...
>>                                      Live:   OO#.. Dead: OO#..  Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.
>> rocks...1k
>> ------------------------------------------------------------
>> ---------------
>>
>>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
 * Alex Fun *   Resident Scientist
  * Email: * alex at glasshat.com
* Website: * www.glasshat.com
* Address: * Level 9, 70 Pitt Street Sydney NSW 2000   * Office: * 02 9114
9515
 * Personal: * 02 9114 9515

<http://www.linkedin.com/company/glasshat?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<https://www.facebook.com/glasshattech?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<https://twitter.com/glasshattech?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<https://plus.google.com/u/0/b/106522326351402007579/106522326351402007579/posts?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
<https://au.linkedin.com/pub/alex-fun/52/666/677?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
  Latest from our blog:  Google - A Year in Search 2014
<http://www.glasshat.com/blog/2015/1/13/seo-users-are-people-after-all?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature>
  IMPORTANT: The contents of this email and any attachments are
confidential. They are intended for the named recipient(s) only. If you
have received this email by mistake, please notify the sender immediately
and do not disclose the contents to anyone or make copies thereof.

	[[alternative HTML version deleted]]



More information about the R-help mailing list