[Rd] Efficient Merging of two huge sorted data frames?---Use merge()?
Charles Cheung
boom2k1 at hotmail.com
Tue May 9 02:20:23 CEST 2006
Hello all,
A problem I encounter today is the speed which takes to sort two huge data
frames...
I wish to sort by (X,Y)
Dataframe One consists of variables:
X, Y, sequence, position
having ~700 000 records
another dataframe consists of
X,Y, intensities
having ~900 000 records
Every (X,Y) pair in dataframe One is included in dataframe Two,
however, the reverse is not true.
Furthermore, (X,Y, position) in data frame One makes the record unique.
(That means there can be multiple records with the same (X,Y) records!)
Added together, it makes it hard to just combine the two data frames
together by simply going
data.frame(dataFrameOne, dataFrameTwo) because the mapping won't correspond
even in sorted records by X and Y.
Intuitive, it should only require very little time <O(n) complexity> after
the data records are sorted.
However, it takes so long (I haven't finished the process in 20 minutes.. it
should only take <1 min) to merge the list by X and Y using
merge(dataFrameOne, dataFrameTwo, by=c("X","Y") , which leads me to suspect
this process is not optimized for already sorted list.
* assuming the two frames have been sorted, I would be able to do the
following:
X Y seq Pos
1 1 AA 32
1 2 AG 44
1 3 GC 65
X Y intensities
1 1 0.4
1 3 0.552
>>Cursor at beginning (1,1) (1,1) -->merge the (1,1) pair.. then cursor
>>moves to (1,2) (1,3) --> can't find.. cursor moves to (1,3) (1,3) ..
>>merge that pair
Is the merge function doing that already?
Is there an efficient way to merge the data frames? (What do you suggest I
should do?)
(to produce)
X Y seq pos intensities
1 1 AA 32 0.4
1 3 GC 65 0.552
Thank you in advance!
Charles Cheung
More information about the R-devel
mailing list