[R] Processing a large number of files
Uwe Ligges
ligges at statistik.uni-dortmund.de
Wed Jul 23 08:58:00 CEST 2003
Douglas Bates wrote:
> I maintain the Devore5 package which contains the data sets from the
> 5th edition of Jay Devore's text "Probability and Statistics for
> Engineering and the Sciences". The 6th edition has now been published
> and it includes several new data sets in exercises and examples. In
> addition, some exercises and examples from the 5th edition are
> renumbered in the 6th edition.
>
> I face the daunting task of adding and documenting the new data sets
> and updating the numbering. I had thought of going back to the text
> files but discovered that it was easier to work from another form.
>
> A CD-ROM with the book provides the data sets in several different
> formats, including SPSS saved data sets. I was pleasantly surprised
> that I could write an R script that read the data from the .sav file,
> converted it to an R data frame, converted the SPSS name such as
> ex01-11.sav to an allowable R name (ex01.11), and saved the resulting
> data set in a new directory. In the past I would have written Python
> or Perl scripts to do all the manipulations of iterating over files
> but with the current facilities in R for listing file names, etc., I
> can do the whole thing in R. My script, which worked on the first
> try, is
>
> library(foreign)
> SPSS = "/cdrom/Manual Install/Datasets/SPSS/" # change as appropriate
> Rdata = "/tmp/Devore6/data/" # change as appropriate
> chapters = c("CH01", "CH04", "CH06", "CH07", "CH08", "CH09",
> "CH10", "CH11", "CH12", "CH13", "CH15", "Ch14", "Ch16")
> for (ch in chapters) {
> path = paste(SPSS, ch, sep = '')
> files = list.files(path = path, pattern = '*.sav')
> for (ff in files) {
> dsn = gsub('-', '.', gsub('\.sav$', '', ff))
> assign(dsn, data.frame(read.spss(paste(path, ff, sep = '/'))))
> save(list = dsn, file = paste(Rdata, dsn, ".rda", sep = ''))
> }
> }
>
> In fact this script processed the 326 files so quickly that I thought
> I must have made a mistake and somehow missed most of the files. I
> had to look in the output directory to convince myself that it had
> indeed run properly.
>
> I would encourage others to consider using list.files, gsub,
> etc. within R for such scripting applications.
Doug, indeed, it's great. The main part of the current automated script
files for compiling R binary packages for Windows is done in R including
processing of files (e.g. checking which of the 2xx CRAN packages has
been updated) and generation of Windows *.bat files for the final
processing and upload steps.
In principle, the whole stuff could be done in a single R script (but
would be more difficult to debug hence not implemented that way).
Uwe
Uwe Ligges
More information about the R-help
mailing list