[Rd] Performing Merge and Duplicated on very large files
Eitan Rubin
erubin at bgu.ac.il
Wed Apr 18 05:44:06 CEST 2007
Hi,
I am working with very large matrices (>1 million records), and need to
1. Join the files (can be achieved with Merge)
2. Find lines that have the same value in some field (after the join) and
randomly sample 1 row.
I am concerned with the complexity of merge - how (un)efficient is it? I
don't have access to the real data, I need to send the script to someone who
does, so I can't just try and see what happens.
Similarly I am worried about the duplicated function - will it run on the
merged matrix? It is expected to be ~500,000 rows long, and have small
clusters of duplicated values (1-10 repeats of the same value).
ER
- - - - - -
Eitan Rubin
Dept. of Microbiology and Immunology
AND
National Institute of Biotechnology in the Negev
Ben Gurion University
Beer Sheva, Israel
Phone: 08-6479197
More information about the R-devel
mailing list