[R] Processing a large number of files

Douglas Bates bates at stat.wisc.edu
Wed Jul 23 00:11:14 CEST 2003

I maintain the Devore5 package which contains the data sets from the
5th edition of Jay Devore's text "Probability and Statistics for
Engineering and the Sciences".  The 6th edition has now been published
and it includes several new data sets in exercises and examples.  In
addition, some exercises and examples from the 5th edition are
renumbered in the 6th edition.

I face the daunting task of adding and documenting the new data sets
and updating the numbering.  I had thought of going back to the text
files but discovered that it was easier to work from another form.

A CD-ROM with the book provides the data sets in several different
formats, including SPSS saved data sets.  I was pleasantly surprised
that I could write an R script that read the data from the .sav file,
converted it to an R data frame, converted the SPSS name such as
ex01-11.sav to an allowable R name (ex01.11), and saved the resulting
data set in a new directory.  In the past I would have written Python
or Perl scripts to do all the manipulations of iterating over files
but with the current facilities in R for listing file names, etc., I
can do the whole thing in R.  My script, which worked on the first
try, is

SPSS = "/cdrom/Manual Install/Datasets/SPSS/"  # change as appropriate
Rdata = "/tmp/Devore6/data/"            # change as appropriate
chapters = c("CH01", "CH04", "CH06", "CH07", "CH08", "CH09",
    "CH10", "CH11", "CH12", "CH13", "CH15", "Ch14", "Ch16")
for (ch in chapters) {
    path = paste(SPSS, ch, sep = '')
    files = list.files(path = path, pattern = '*.sav')
    for (ff in files) {
        dsn = gsub('-', '.', gsub('\.sav$', '', ff))
        assign(dsn, data.frame(read.spss(paste(path, ff, sep = '/'))))
        save(list = dsn, file = paste(Rdata, dsn, ".rda", sep = ''))

In fact this script processed the 326 files so quickly that I thought
I must have made a mistake and somehow missed most of the files.  I
had to look in the output directory to convince myself that it had
indeed run properly.

I would encourage others to consider using list.files, gsub,
etc. within R for such scripting applications.

More information about the R-help mailing list