[R] efficiency and "forcing" questions

Wed Mar 28 19:43:28 CEST 2001

Dear R listers --

The program below does the following tasks:

1.  It creates a file (wintemp4) that is a subset of alldata4 consisting of
"winner" records in 50 industry groups (about 5400 obs);

2.  It defines a function (myppr1) that runs the ppr function in modreg
once to generate goodness of fit (sum of squared errors) measures by number
of terms included in model and then reruns ppr using the number of terms
with the lowest sum of squared errors.

3.  It grinds through a loop, subsetting wintemp4 by group and running
myppr1 for each
group subset; and

4.  It puts the ppr output into a separate vector element for each group
(in an attempt to avoid "growing" the vector).

I am using R version 1.2.2 in Emacs/ESS on Win98 with 256mb RAM.

I have two questions; I would be most grateful for any help the list can
provide:

A.  This program *seems* to take a long time.  I have been careful to free
as much memory as I can, and the gc()'s seem to help avoid using the
swapfile and to keep available system resources above 90%.  Is there
anything else I can do to make the program more efficient?

B.  I say "seems" because after running the program for an hour, I type
ctl-G to quit.  The *R* session seemed to be terminated, with about 40 or
so groups processed, so I opened up another R session to try to see what
had happened.  After I quit the second session, suddenly the first session
seemed to come back to life and spit out the printed output for the rest of
the groups!  So I wonder if there is something I need to add to my program
to "force" it to finish processing?  (I apologize for the inarticulate way
I am posing this question!)

Thanks in advance.

David N. Beede
Economist
Office of Policy Development
Economics and Statistics Administration
U.S. Department of Commerce
Room 4858 HCHB
14th Street and Pennsylvania Avenue, N.W.
Washington, DC  20230
Voice:  202.482.1226
Fax:    202.482.0325
e-mail:  david.beede at mail.doc.gov

#Here is the program
for(i in 1:4) gc()
load("alldata4.Rdata")
assign("wintemp4", subset(alldata4, 1 <= group & group <= 50 & winner==1))
rm(alldata4)
for(i in 1:4) gc()
library(modreg)
attach(wintemp4)

myppr1 <- function(x)
{
#run pprfile once to get list of sum of squared errors corresponding to differen numbers of terms
      pprfile.ppr <- ppr(
               award~
               ilogemp+ilogage+sdb+allsmall+
               size2+size3+size4+size5+size6+size7+size8+size9+size10+
               X.Iprimnaic.2+X.Iprimnaic.3+X.Iprimnaic.4+X.Iprimnaic.5+X.Iprimnaic.6+
               X.Iprimnaic.7+X.Iprimnaic.8+X.Iprimnaic.9+X.Iprimnaic.10+X.Iprimnaic.11+
               X.Iprimnaic.12+X.Iprimnaic.13+X.Iprimnaic.14+X.Iprimnaic.15+X.Iprimnaic.16+
               X.Iprimnaic.17+X.Iprimnaic.18+X.Iprimnaic.19+X.Iprimnaic.20+X.Iprimnaic.21+
               X.Iprimnaic.22+X.Iprimnaic.23+X.Iprimnaic.24+X.Iprimnaic.25+X.Iprimnaic.26,
               data=x, nterms=1, max.terms= min(nrow(x),40), optlevel=3
                        )
#pick number of terms giving best fit
         numterm <- which.min(pprfile.ppr$gofn[pprfile.ppr$gofn>0])
         pprfile.ppr <- ppr(
               award~
               ilogemp+ilogage+sdb+allsmall+
               size2+size3+size4+size5+size6+size7+size8+size9+size10+
               X.Iprimnaic.2+X.Iprimnaic.3+X.Iprimnaic.4+X.Iprimnaic.5+X.Iprimnaic.6+
               X.Iprimnaic.7+X.Iprimnaic.8+X.Iprimnaic.9+X.Iprimnaic.10+X.Iprimnaic.11+
               X.Iprimnaic.12+X.Iprimnaic.13+X.Iprimnaic.14+X.Iprimnaic.15+X.Iprimnaic.16+
               X.Iprimnaic.17+X.Iprimnaic.18+X.Iprimnaic.19+X.Iprimnaic.20+X.Iprimnaic.21+
               X.Iprimnaic.22+X.Iprimnaic.23+X.Iprimnaic.24+X.Iprimnaic.25+X.Iprimnaic.26,
               data=x, nterms=numterm, max.terms= min(nrow(x),40), optlevel=3
                            )
      cat("group =", x$group[1],"\n")
      cat("NAIC =", x$naic4[1],"\n")
      cat("cendiv =", as.character(x$cendiv[1]),"\n")
      cat("number of obs used =", nrow(x),"\n")
      print(summary(pprfile.ppr))
}

grouparr <- levels(as.factor(wintemp4$group))
pprest <- vector(mode="list",length=length(grouparr))

for(i in seq(along=grouparr))
  {
    subi <- subset(wintemp4,wintemp4$group==grouparr[i])
    if(nrow(subi) > 40) pprest[i][[1]] <- myppr1(subi)
    rm(subi)
    print(gc())
  }

detach(wintemp4)

2. How can one prevent "for loop" output data frame growth?
On p. 178 of "S Programming" by VR, there is a suggestion that it is more
efficient to create an object at least the size of the ultimate output
object, in order to avoid generating copies of the object at each iteration
of a for loop.  This seems easy enough for a vector, as illustrated by VR.
However, it is not obvious to me how to do this for the data frame I wish
to

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._