[R] Performing Analysis on Subset of External data

Wed Oct 6 20:58:32 CEST 2004

On 06-Oct-04 Laura Quinn wrote:
> I want to perform some analysis on subsets of huge data files.
> There are 20 of the files and I want to select the same subsets
> of each one (each subset is a chunk of 1500 or so consecutive
> rows from several million).
> To save time and processing power is there a method to tell R
> to *only* read in these rows, rather than reading in the entire
> dataset then selecting subsets and deleting the extraneous data?
> This method takes a rather silly amount of time and results in
> memory problems.
> 
> I am using R 1.9.0 on SuSe 9.0

Hi Laura,
If there is a neat time&memory-efficient R solution then I'm sure
someone will tell you! But since you're using Linux, I can suggest
an alternative, which is to use some combination of the Unix file
utilities which you will already have in your SuSE, entering them
as a command-line at the system prompt, or executing a shell script
file which contains the command.

For example, to read just lines (say) 500001-501500 you could use

  cat bigdata | head -501500 | tail -1500 > smalldata

which reads the first 501500 lines of bigdata and then the last
1500 lines of these, and directs the result of this into the file
smalldata.

That's OK for a single chunk of 1500, but suppose (as seems might
be the case) you want (say) the first line of the file (for names)
and 5 chunks of 1500 starting at lines 100001, 200001, 300001,
400001, 500001 respectively. Then awk will do what you want, on
the lines of

  cat bigdata | awk '
    {nr=NR;
      if(
          (nr==1) ||
          ((nr>=100001)&&(nr<=101500)) ||
          ((nr>=200001)&&(nr<=201500)) ||
          ((nr>=300001)&&(nr<=301500)) ||
          ((nr>=400001)&&(nr<=401500)) ||
          ((nr>=500001)&&(nr<=501500))
        ) {print $0}
      else {next}
    }' > smalldata

(The above can be typed in as shown, and will be a single command).

Having done this, you can then use smalldata as the dataset to
read into R, instead of bigdata.

These are just examples of what can be done externally using such
utilities.

(Now, whether or not there's a simple R-workround, I shall undoubtedly
be trumped by some perl freak).

Best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 06-Oct-04                                       Time: 19:58:32
------------------------------ XFMail ------------------------------