[Rd] Efficient Merging of two huge sorted data frames?---Use merge()?
Prof Brian Ripley
ripley at stats.ox.ac.uk
Tue May 9 09:50:36 CEST 2006
merge() is not optimized for large data frames. To do things on this
scale you really want to be using a DBMS not R. See the `R Data
Import/Export Manual'.
Sorting is not really relevant, especially as merge is not assuming that
the match is unique. Hashing could be used, but is not.
As R is open source, you have the source code and it would be kinder to
read it yourself rather than expect this list to read it for you. A
useful contribution to the R project would be to contribute a more
efficient version, and we look forwards to seeing your contribution.
On Mon, 8 May 2006, Charles Cheung wrote:
> Hello all,
>
> A problem I encounter today is the speed which takes to sort two huge data
> frames...
>
> I wish to sort by (X,Y)
>
> Dataframe One consists of variables:
> X, Y, sequence, position
> having ~700 000 records
>
> another dataframe consists of
> X,Y, intensities
> having ~900 000 records
>
>
> Every (X,Y) pair in dataframe One is included in dataframe Two,
> however, the reverse is not true.
> Furthermore, (X,Y, position) in data frame One makes the record unique.
> (That means there can be multiple records with the same (X,Y) records!)
>
> Added together, it makes it hard to just combine the two data frames
> together by simply going
> data.frame(dataFrameOne, dataFrameTwo) because the mapping won't correspond
> even in sorted records by X and Y.
>
>
> Intuitive, it should only require very little time <O(n) complexity> after
> the data records are sorted.
> However, it takes so long (I haven't finished the process in 20 minutes.. it
> should only take <1 min) to merge the list by X and Y using
>
> merge(dataFrameOne, dataFrameTwo, by=c("X","Y") , which leads me to suspect
> this process is not optimized for already sorted list.
>
> * assuming the two frames have been sorted, I would be able to do the
> following:
>
>
> X Y seq Pos
> 1 1 AA 32
> 1 2 AG 44
> 1 3 GC 65
>
>
> X Y intensities
> 1 1 0.4
> 1 3 0.552
>
>>> Cursor at beginning (1,1) (1,1) -->merge the (1,1) pair.. then cursor
>>> moves to (1,2) (1,3) --> can't find.. cursor moves to (1,3) (1,3) ..
>>> merge that pair
>
> Is the merge function doing that already?
>
>
> Is there an efficient way to merge the data frames? (What do you suggest I
> should do?)
>
>
> (to produce)
> X Y seq pos intensities
> 1 1 AA 32 0.4
> 1 3 GC 65 0.552
>
> Thank you in advance!
>
>
> Charles Cheung
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list