[R] Optimise huge data.frame construction - [ ] Message is from an unknown sender

Daniele Amberti daniele.amberti at ors.it
Wed Feb 24 11:29:25 CET 2010


Thanks Moshe,
I already allocate a matrix and grow it by 5000 row at a time (I found empirically that there is not much performance gain going above this number). This allow me to have a close to linear behavior in computational time but it is steel slow. Accessing DB and calculations takes another 50% of the time (and 30% - 40% accessing the matrix and storing results) and create just once the object correctly sized do not solve the problem of how to access and store values inside data.frame/matrix for further data analysis.
Any other suggestion?

Regards
Daniele


-----Original Message-----
From: Moshe Olshansky [mailto:m_olshansky at yahoo.com]
Sent: 24 February 2010 11:09
To: r-help at r-project.org; Daniele Amberti
Subject: Re: [R] Optimise huge data.frame construction - [ ] Message is from an unknown sender

Hi Daniele,

One possibility would be to make two runs. In the first run you are not building the matrix but just calculating the number of rows you need (in a loop). Then you allocate such matrix (only once) and fill it in the second run.

Regards,
Moshe.

--- On Wed, 24/2/10, Daniele Amberti <daniele.amberti at ors.it> wrote:

> From: Daniele Amberti <daniele.amberti at ors.it>
> Subject: [R] Optimise huge data.frame construction
> To: "r-help at r-project.org" <r-help at r-project.org>
> Received: Wednesday, 24 February, 2010, 8:34 PM
> I have data for different items (ID)
> in a database.
> For each ID I have to get:
>
> -          Timestamp of the
> observation (timestamp);
>
> -          numerical value (val)
> that will be my response variable in some kind of model;
>
> -          a variable number of
> variables in a know set (if value for a specific variable is
> not present in DB it is 0).
>
> To get to the above mentioned values I have to cycle over
> IDs, make some calculation and store results to construct a
> huge data.frame for subsequent estimations. The number of
> rows for each ID is random (typically 14 to 200).
>
> My current approach is to construct a matrix like this:
>
> out <- c('A', 'B', 'C', 'D')
> out <- matrix(-1, 5000, 3 + length(out), dimnames =
> list(1:5000, c('ID', 'timestamp' , 'val', out)))
>
> I access to out matrix by numerical index to substitute
> values ( out[1:n,1] <- k )
> When matrix is full I add 5000 rows and go on.
> Afterward I clean rows with ID set to -1 and than all other
> -1 values with 0
>
> For my application typically an ID have something between
> 14 and 200 observations (mean around 50) but I have 15000
> IDs ...
> After profiling I realize that accessing the out matrix
> this way is too slow.
>
> Do you have any idea on how to speed up this kind of
> process?
> I think something can be done creating a data.frame for
> each ID and bind them in the end. Is it a good idea? How can
> I implement that? List of data.frame? And than?
>
> Below some code that can be useful if someone would like to
> experiment ...
>
> alist <- vector('list', 2)
> alist[[1]] <- data.frame( ID = 1, timestamp = 1:14, val
> = rnorm(14), A = 1, B = 2, C = 3 )
> alist[[2]] <- data.frame( ID = 2, timestamp = 2:15, val
> = rnorm(14), B = 2, C = 3, D = 4 )
> alist[[3]] <- data.frame( ID = 3, timestamp = 3:30, val
> = rnorm(28), C = 1, D = 2 )
>
>
> Thanks in advance for your valuable help.
> Daniele
>
> ________________________________
> ORS Srl
>
> Via Agostino Morando 1/3 12060 Roddi (Cn) - Italy
> Tel. +39 0173 620211
> Fax. +39 0173 620299 / +39 0173 433111
> Web Site www.ors.it
>
> ------------------------------------------------------------------------------------------------------------------------
> Qualsiasi utilizzo non autorizzato del presente messaggio e
> dei suoi allegati ? vietato e potrebbe costituire reato.
> Se lei avesse ricevuto erroneamente questo messaggio, Le
> saremmo grati se provvedesse alla distruzione dello stesso
> e degli eventuali allegati.
> Opinioni, conclusioni o altre informazioni riportate nella
> e-mail, che non siano relative alle attivit? e/o
> alla missione aziendale di O.R.S. Srl si intendono non
> attribuibili alla societ? stessa, n? la impegnano in alcun
> modo.
>
>     [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained,
> reproducible code.
>

ORS Srl

Via Agostino Morando 1/3 12060 Roddi (Cn) - Italy
Tel. +39 0173 620211
Fax. +39 0173 620299 / +39 0173 433111
Web Site www.ors.it

------------------------------------------------------------------------------------------------------------------------
Qualsiasi utilizzo non autorizzato del presente messaggio e dei suoi allegati è vietato e potrebbe costituire reato.
Se lei avesse ricevuto erroneamente questo messaggio, Le saremmo grati se provvedesse alla distruzione dello stesso
e degli eventuali allegati.
Opinioni, conclusioni o altre informazioni riportate nella e-mail, che non siano relative alle attività e/o
alla missione aziendale di O.R.S. Srl si intendono non  attribuibili alla società stessa, né la impegnano in alcun modo.



More information about the R-help mailing list