[R] Agreggating data using external aggregation rules
ANGELO.LINARDI at bancaditalia.it
ANGELO.LINARDI at bancaditalia.it
Fri Jun 6 12:12:36 CEST 2008
Dear R experts,
I am currently facing a tricky problem which I have read a lot about in
the various R mailing lists without finding exactly what I need.
I have a big data frame DF (about 2,000,000 rows) with 7 columns being
variables and 1 being a measure (using reshape package nomeclature).
There are no "duplicates" in it.
Fot each of the variables I have some "rules" to apply, being COD_IN the
value of the variable in the DF, COD_OUT the one to be transformed to;
once obtained the "new codes" in the DF I have to aggregate the "new DF"
(for example summing the measure).
Usually the total transformation (merge+aggregate) really decreases the
number of lines in the data frame, but sometimes it can grows depending
on the rule. Just to give an idea, the first "rule" in v1 maps 820
different values into 7 ones.
Using SQL and a database this can be done in a very straightforward way
(for example on the variable v1):
Select COD_OUT, v2, v3, v4, v5, v6, v7, sum(measure)
>From DF, RULE_v1
Where v1=COD_IN
Group by v2, v3,v4, v5, v6, v7
So the first choice would be using a database; the second one would be
splitting the data frame and then joining the results.
Is there any other possibility to merge+aggregate caused by the merge ?
Thank you in advance
Angelo Linardi
** Le e-mail provenienti dalla Banca d'Italia sono trasmesse in buona fede e non
comportano alcun vincolo ne' creano obblighi per la Banca stessa, salvo che cio' non
sia espressamente previsto da un accordo scritto.
Questa e-mail e' confidenziale. Qualora l'avesse ricevuta per errore, La preghiamo di
comunicarne via e-mail la ricezione al mittente e di distruggerne il contenuto. La
informiamo inoltre che l'utilizzo non autorizzato del messaggio o dei suoi allegati
potrebbe costituire reato. Grazie per la collaborazione.
-- E-mails from the Bank of Italy are sent in good faith but they are neither binding on
the Bank nor to be understood as creating any obligation on its part except where
provided for in a written agreement. This e-mail is confidential. If you have received it
by mistake, please inform the sender by reply e-mail and delete it from your system.
Please also note that the unauthorized disclosure or use of the message or any
attachments could be an offence. Thank you for your cooperation. **
More information about the R-help
mailing list