[R] Performing Analysis on Subset of External data

Wed Oct 6 20:11:05 CEST 2004

Laura Quinn wrote:
> Hi,
> 
> I want to perform some analysis on subsets of huge data files. There are
> 20 of the files and I want to select the same subsets of each one (each
> subset is a chunk of 1500 or so consecutive rows from several million). To
> save time and processing power is there a method to tell R to *only* read
> in these rows, rather than reading in the entire dataset then selecting
> subsets and deleting the extraneous data? This method takes a rather silly
> amount of time and results in memory problems.
> 
> I am using R 1.9.0 on SuSe 9.0
> 
> Thanks in advance!
> 

Hi Laura,

I guess if you knew which row of the file your subset started from and 
you knew how many lines you wanted to read in you could use scan with 
arguments skip and nlines (see ?scan)

A better way that gets recommended a lot on the list is to store your 
data in a database and use the various R packages and/or tools available 
that can connect to your database and only extract the rows you need.

See the R Data Import/Export manual for more on scan and using 
relational databases with R.

Hope this helps,

Gav
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson                     [T] +44 (0)20 7679 5522
ENSIS Research Fellow             [F] +44 (0)20 7679 7565
ENSIS Ltd. & ECRC                 [E] gavin.simpson at ucl.ac.uk
UCL Department of Geography       [W] http://www.ucl.ac.uk/~ucfagls/cv/
26 Bedford Way                    [W] http://www.ucl.ac.uk/~ucfagls/
London.  WC1H 0AP.
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%