[R] Performing Analysis on Subset of External data
Thomas Lumley
tlumley at u.washington.edu
Wed Oct 6 20:13:17 CEST 2004
On Wed, 6 Oct 2004, Laura Quinn wrote:
> Hi,
>
> I want to perform some analysis on subsets of huge data files. There are
> 20 of the files and I want to select the same subsets of each one (each
> subset is a chunk of 1500 or so consecutive rows from several million). To
> save time and processing power is there a method to tell R to *only* read
> in these rows, rather than reading in the entire dataset then selecting
> subsets and deleting the extraneous data? This method takes a rather silly
> amount of time and results in memory problems.
It depends on the data format. If, for example, you have free-format text
files it isn't possible to locate a specific chunk without reading all the
earlier entries. You can still save time and space by having some other
program (?Perl) read the file and spit out a file with just the 1500 rows
you want.
A better strategy would be for the data to be either in a database or in a
format such as netCDF designed for random access.
-thomas
More information about the R-help
mailing list