[R] Importing subsets of rows into R from a large text data file

Marc Schwartz MSchwartz at MedAnalytics.com
Sat Apr 23 18:32:16 CEST 2005


Just an FYI here as I think that this approach sometimes gets
overlooked.

When one has a very large dataset to read into R from a text file (as
opposed to an external database that can be queried) and in actuality
only needs a subset of sequential records for subsequent analysis, there
are options to read.table() and family, which enable you to skip records
in the incoming data file and also restrict the number of records to be
read in, using the 'skip' and 'nrow' arguments.


Simple example:

# Create a data.frame with 20 rows
> df <- data.frame(rec.no = 1:20, Col1 = sample(letters, 20), 
+                  Col2 = sample(1:100, 20))

> df
   rec.no Col1 Col2
1       1    m   30
2       2    s   44
3       3    o    1
4       4    i   45
5       5    v   97
6       6    x   34
7       7    f   91
8       8    r    4
9       9    u   99
10     10    g   81
11     11    k   64
12     12    d   68
13     13    c   96
14     14    b   13
15     15    z   15
16     16    a   35
17     17    h   11
18     18    t   67
19     19    l   93
20     20    e   37


# Write it to a text file
> write.table(df, "df.dat", row.names = FALSE)


# Now read in recs 5:10
# Note that the header row (1) plus 4 data rows (total 5)
# are skipped here. Then 6 rows are read in from that point
> read.table("df.dat", skip = 5, nrow = 6)
  V1 V2 V3
1  5  v 97
2  6  x 34
3  7  f 91
4  8  r  4
5  9  u 99
6 10  g 81

Note that the header row is not read here, therefore generic colnames
are created.


One quick way around this is to use the 'col.names' argument in
read.table() to set the colnames for the incoming data in this fashion:

> read.table("df.dat", skip = 5, nrow = 6,
+             col.names = colnames(read.table("df.dat", nrow = 1, 
+                                             header = TRUE)))
  rec.no Col1 Col2
1      5    v   97
2      6    x   34
3      7    f   91
4      8    r    4
5      9    u   99
6     10    g   81

The construct:

> colnames(read.table("df.dat", nrow = 1, header = TRUE))
[1] "rec.no" "Col1"   "Col2"

gets the colnames from the header row in the text file and then assigns
them to the subset of data that is read in using the 'col.names'
argument.

This is a very simple example, but with a large dataset and with perhaps
RAM resource restrictions on your computer, this might be helpful in
certain scenarios.

Needless to say the use of post-import indexing is easy and the subset()
function brings to bear a great deal of flexibility post import, when
more complex logic is required to subset rows based upon multiple
parameters and/or where you might also only want a subset of columns.

See ?read.table, ?Extract and ?subset for more information.

HTH,

Marc Schwartz




More information about the R-help mailing list