[R] Processing a large number of files

Wed Jul 23 08:58:00 CEST 2003

Douglas Bates wrote:
> I maintain the Devore5 package which contains the data sets from the
> 5th edition of Jay Devore's text "Probability and Statistics for
> Engineering and the Sciences".  The 6th edition has now been published
> and it includes several new data sets in exercises and examples.  In
> addition, some exercises and examples from the 5th edition are
> renumbered in the 6th edition.
> 
> I face the daunting task of adding and documenting the new data sets
> and updating the numbering.  I had thought of going back to the text
> files but discovered that it was easier to work from another form.
> 
> A CD-ROM with the book provides the data sets in several different
> formats, including SPSS saved data sets.  I was pleasantly surprised
> that I could write an R script that read the data from the .sav file,
> converted it to an R data frame, converted the SPSS name such as
> ex01-11.sav to an allowable R name (ex01.11), and saved the resulting
> data set in a new directory.  In the past I would have written Python
> or Perl scripts to do all the manipulations of iterating over files
> but with the current facilities in R for listing file names, etc., I
> can do the whole thing in R.  My script, which worked on the first
> try, is
> 
> library(foreign)
> SPSS = "/cdrom/Manual Install/Datasets/SPSS/"  # change as appropriate
> Rdata = "/tmp/Devore6/data/"            # change as appropriate
> chapters = c("CH01", "CH04", "CH06", "CH07", "CH08", "CH09",
>     "CH10", "CH11", "CH12", "CH13", "CH15", "Ch14", "Ch16")
> for (ch in chapters) {
>     path = paste(SPSS, ch, sep = '')
>     files = list.files(path = path, pattern = '*.sav')
>     for (ff in files) {
>         dsn = gsub('-', '.', gsub('\.sav$', '', ff))
>         assign(dsn, data.frame(read.spss(paste(path, ff, sep = '/'))))
>         save(list = dsn, file = paste(Rdata, dsn, ".rda", sep = ''))
>     }
> }
> 
> In fact this script processed the 326 files so quickly that I thought
> I must have made a mistake and somehow missed most of the files.  I
> had to look in the output directory to convince myself that it had
> indeed run properly.
> 
> I would encourage others to consider using list.files, gsub,
> etc. within R for such scripting applications.

Doug, indeed, it's great. The main part of the current automated script 
files for compiling R binary packages for Windows is done in R including 
processing of files (e.g. checking which of the 2xx CRAN packages has 
been updated) and generation of Windows *.bat files for the final 
processing and upload steps.
In principle, the whole stuff could be done in a single R script (but 
would be more difficult to debug hence not implemented that way).

Uwe

Uwe Ligges