[R] rbind and data.frame [simplified]
Göran Broström
gb at stat.umu.se
Mon Dec 10 09:01:17 CET 2001
Thanks for the interest in my timing problem. I have scaled off all
calculations in order to purify it, and it is obvious that size
matters a lot. Also that 'matrices are faster than data frames'.
I give you the full listing here, but it is
really the last few lines that are interesting (= slow):
The test function koll ('koll' ~ 'check', Swedish):
--------------------------------------------------------------------
koll <- function(dat, com.dat, com.ins, no.of.outrows = 1000){
## 'dat' is a data frame with variables:
## bdate = birth date
## enter = left truncation time
## exit = right censoring/event time
## event = event indicator (0 if no event).
## other covariates.
## com.dat is a data frame with columns communal covariates
## com.ins is a description of com.dat: (Is a vector for now!)
## start year, period (length == 2)
## NOTE: any names(com.dat) must be != any names(dat) !!!
nn <- nrow(dat)
n.years <- nrow(com.dat)
n.com <- ncol(com.dat) ## No. of communal covariates.
## if (nrow(com.ins) != n.com) stop("Error in com.ins: wrong no of rows")
iv.length <- com.ins[2]
cuts <- com.ins[1] + c(0, (1:n.years) * iv.length)
beg.per <- cuts[1]
n.yearsp1 <- n.years + 1
end.per <- cuts[n.years + 1]
get.iv <- function(dates)
cbind(pmin(pmax(1, ceiling((dates[, 1] - beg.per) / iv.length)),
n.years),
pmin(pmax(1, ceiling((dates[, 2] - beg.per) / iv.length)),
n.years))
## First, find the size of the new data frame (nn.out):
nn.out <- 0
ind.date <- cbind(dat$bdate + dat$enter, dat$bdate + dat$exit)
cases <- ( (ind.date[, 1] < end.per) && (ind.date[, 2] > beg.per) )
ind.iv <- get.iv(ind.date)
##return(ind.iv)
nn.out <- sum(ind.iv[cases, 2] - ind.iv[cases, 1] + 1)
##return(nn.out)
## We now have 'nn.out'. We next create an empty data frame 'dat.out':
xx <- cbind(dat[1, , drop = FALSE], com.dat[1, , drop = FALSE])
dat.out <- matrix(NA, ncol = ncol(xx), nrow = nn.out)
dat.out <- data.frame(dat.out)
names(dat.out) <- names(xx)
dat.out <- rbind(xx, dat.out)[-1, ]
##return(dat.out)
## And so we fill it!
cat("Loop starting:\n")
fixed.rec <- cbind(dat[1, , drop = FALSE], com.dat[1, , drop = FALSE])
## This part is the slow one (and simplified here) :
for (cur.row in (1:no.of.outrows)){
dat.out[cur.row, ] <- fixed.rec
## cbind(fixed.rec, com.dat[1, , drop = FALSE])
## cat("row = ", cur.row, "\n")
}
## return(dat.out)
}
------------------------------------------------------------------------
> str(com.dat)
`data.frame': 215 obs. of 7 variables:
$ V1: num 0.0000 0.0000 0.0807 0.0987 0.1801 ...
$ V2: num 0.0277 0.0467 0.0654 0.0831 0.0992 ...
$ V3: num -0.0277 -0.0467 0.0153 0.0156 0.0809 ...
$ V4: num 0.0000 0.0000 0.0000 0.0000 0.0162 ...
$ V5: num 0.00083 0.00132 0.00180 0.00224 0.00262 ...
$ V6: num -0.00083 -0.00132 -0.00180 -0.00224 0.01360 ...
$ V7: num 0.1905 0.0447 -0.4172 -0.1982 0.7761 ...
> str(dat)
`data.frame': 19848 obs. of 15 variables:
$ enter : num 57 58 59 60 63 ...
$ exit : num 58 59 60 63 64 ...
$ stdod2 : num 0 0 0 0 0 0 0 0 1 0 ...
$ stdod : num 0 0 0 0 0 0 0 0 29 0 ...
$ bdate : num 1754 1754 1754 1754 1754 ...
$ birthdate: num 1754 1754 1754 1754 1754 ...
$ sex : num 1 1 1 1 1 1 1 1 1 0 ...
$ stparity : num 0 0 0 0 0 0 0 0 0 0 ...
$ bthq : num 4 4 4 4 4 4 4 4 4 3 ...
$ bthpar : num 1 1 1 1 1 1 1 1 1 1 ...
$ socc : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 1 ...
$ parish : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ indiv : num 1e+08 1e+08 1e+08 1e+08 1e+08 ...
$ famil : num 1e+05 1e+05 1e+05 1e+05 1e+05 ...
$ familnu : num 1e+05 1e+05 1e+05 1e+05 1e+05 ...
Now some timings: In the first two examples (identical) the output data
frame is of order 55000 cases times 22 variables, but we only fill 100
of these cases:
> unix.time(koll(dat, com.dat, com.info[1, 1:2], 100))
[1] 48.70 23.86 74.00 0.00 0.00
Note that R seems to be 'learning':
> unix.time(koll(dat, com.dat, com.info[1, 1:2], 100))
[1] 33.00 23.28 57.69 0.00 0.0
In this example the output data frame is of size only around 300 x 22,
while exactly the same amount of information is written to it as above:
> unix.time(koll(dat[1:100, ], com.dat, com.info[1, 1:2], 100))
[1] 0.44 0.13 0.74 0.00 0.00
According to 'top' (I'm on Linux), no swapping is involved ( I have
1.2 GB memory).
> gc()
used (Mb) gc trigger (Mb)
Ncells 1346357 36.0 2251281 60.2
Vcells 12622828 96.4 23650735 180.5
So size matters! Note that the full scale function will take a couple
of hours even without any calulations at all.
Now the good part. If I rewrite 'koll' so that data are matrices instead
of data frames:
> unix.time(hej <- koll(haag, com.dat, com.info[1, 1:2], 50000))
[1] 1.67 0.22 1.89 0.00 0.00 ^^^^^
NOTE!
This is only ~3 times the compiled code. That's great!!
(Of course, some will be added with the real calculations.)
Sens moral: Avoid data frames for manipulations of this kind.
(Am I right?)
Göran
On Fri, 7 Dec 2001 james.holtman at convergys.com wrote:
>
> Heres some timings from a 700MHZ laptop running WIN/2000:
>
> > x.1 <- data.frame(a=integer(85000), b=double(85000), c=character(85000))
> > str(x.1)
> `data.frame': 85000 obs. of 3 variables:
> $ a: int 0 0 0 0 0 0 0 0 0 0 ...
> $ b: num 0 0 0 0 0 0 0 0 0 0 ...
> $ c: Factor w/ 1 level "": 1 1 1 1 1 1 1 1 1 1 ...
> #
> # loading up a variable with a vector takes very little time
> #
> > system.time(x.1$a <- 1:85000)
> [1] 0.03 0.00 0.03 NA NA
> > str(x.1)
> `data.frame': 85000 obs. of 3 variables:
> $ a: int 1 2 3 4 5 6 7 8 9 10 ...
> $ b: num 0 0 0 0 0 0 0 0 0 0 ...
> $ c: Factor w/ 1 level "": 1 1 1 1 1 1 1 1 1 1 ...
> #
> # a 'for' loop by itself is only 0.3 seconds
> #
> > system.time(for (i in 1:85000)invisible(1))
> [1] 0.30 0.00 0.31 NA NA
> #
> # it takes me 5 seconds to initialize 85,000 of a variable, so I would
> assume
> # it would depend on how many and what type. If 'factors', I would assume
> you would
> # declare those as 'character' and then convert to 'factor' at the end.
> # so it seems fast; is there something I am missing?
> #
> > system.time(for (i in 1:85000) x.1$a[i] <- i)
> [1] 5.12 0.04 5.22 NA NA
> >
>
>
>
>
> "Liaw, Andy" <andy_liaw at merck.com>@stat.math.ethz.ch on 12/07/2001 10:32:31
>
> Sent by: owner-r-help at stat.math.ethz.ch
>
>
> To: r-help at stat.math.ethz.ch
> cc:
> Subject: RE: [R] rbind and data.frame
>
>
> Are you sure that the time difference is *only* in creating the data frame,
> rather than other computations in the loop?
>
> Andy
>
> > -----Original Message-----
> > From: Göran Broström [mailto:gb at stat.umu.se]
> > Sent: Friday, December 07, 2001 7:25 AM
> > To: Prof Brian Ripley
> > Cc: r-help at stat.math.ethz.ch
> > Subject: Re: [R] rbind and data.frame
> >
> >
> > On Fri, 7 Dec 2001, Prof Brian Ripley wrote:
> >
> > > On Fri, 7 Dec 2001, [iso-8859-1] Göran Broström wrote:
> > >
> > > > On Wed, 5 Dec 2001, Göran Broström wrote:
> > > >
> > > > [...]
> > > >
> > > > > My real problem is how to create a data frame in a
> > sequentially growing
> > > > > manner, when I know the final size (no of cases). I
> > want to avoid to
> > > > > call 'rbind' many times, and instead create an 'empty'
> > data frame in
> > > > > one call, and then fill it. Are there better ways of doing this?
> > > >
> > > > Got no answer to this one, so I provide one myself:
> > >
> > > The usual answer is to create a data frame of the desired size and
> > > populate it via indexing. That's in some books I know!
> >
> > I know that book too (thanks!). I did what you suggest, and
> > that took 7
> > hours to run. Definitely.
> >
> > Göran
> >
> > > >
> > > > The answer is: Yes, definitely. I did this, with pure R
> > code, and
> > > > created a new data frame with around 58000 records. It
> > took 7 hours to
> > > > run. I then did it with compiled code (Fortran), and that
> > made a slight
> > > > difference: It took 4.8 seconds(!).
> > > >
> > > > Göran
> > > >
> > > >
> > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> > -.-.-.-.-.-.-.-.-
> > > > r-help mailing list -- Read
> > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > > > Send "info", "help", or "[un]subscribe"
> > > > (in the "body", not the subject !) To:
> > r-help-request at stat.math.ethz.ch
> > > >
> > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> > _._._._._._._._._
> > > >
> > >
> > >
> >
> > --
> > Göran Broström tel: +46 90 786 5223
> > professor fax: +46 90 786 6614
> > Department of Statistics http://www.stat.umu.se/egna/gb/
> > Umeå University
> > SE-90187 Umeå, Sweden e-mail: gb at stat.umu.se
> >
> > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> > -.-.-.-.-.-.-.-.-
> > r-help mailing list -- Read
> > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > Send "info", "help", or "[un]subscribe"
> > (in the "body", not the subject !) To:
> > r-help-request at stat.math.ethz.ch
> > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> > _._._._._._._._._
> >
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> -.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> _._._
>
>
>
> --
>
> NOTICE: The information contained in this electronic mail transmission is
> intended by Convergys Corporation for the use of the named individual or
> entity to which it is directed and may contain information that is
> privileged or otherwise confidential. If you have received this electronic
> mail transmission in error, please delete it from your system without
> copying or forwarding it, and notify the sender of the error by reply email
> or by telephone (collect), so that the sender's address records can be
> corrected.
>
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>
--
Göran Broström tel: +46 90 786 5223
professor fax: +46 90 786 6614
Department of Statistics http://www.stat.umu.se/egna/gb/
Umeå University
SE-90187 Umeå, Sweden e-mail: gb at stat.umu.se
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list