[R] Sum of column from another df based of row values of df1

Shivam shivamsingh at gmail.com
Sun Sep 9 20:58:49 CEST 2012


Dear All,

I need to sum a column from another dataframe based on the row values
of one dataframe. I am stuck in a loop trying to accomplish it and at
current speed it will take more than 80 hours to complete. Needless to
say I am looking for a more elegant/quicker solution. Really need some
help here. Here is the issue:

I have a dataframe CALL (the dput of head is given below) which has
close to a million rows. There are 2 date columns which are of
importance, DATE and EXPDATE. There is another dataframe, VOL (dput of
head given), which has 2 columns, DATE and VOL. It has the volatility
corresponding to each day and it has a total of 124 records
(corresponding to 6 months). I want to add another column in the CALL
dataframe which would contain the sum of all the volatilities from the
VOL df for the period specified by the interval of DATE and EXPDATE in
each row of CALL df.

For ex: In the first row, DATE is '03-01-2011' and EXPDATE is
'27-01-2011'. So I want the SUM column (A new column in CALL df) to
contain the sum of volatilities of 03-01, 04-01, 05-01 .... till 27-01
from the VOL dataframe.

I have to repeat this process for all the rows in the dataframe. Here
is the for-loop version of the solution:

for (k in 1:nrow(CALL)){
CALL$SUM[k] = sum(subset(VOL$VOL, VOL$DATE >= CALL$DATE[k] & VOL$DATE
<= CALL$EXPDATE[k]))
}

The loop will run for close to a million times, it has been running
for more than 10 hours and its just 12% complete. It would take more
than 80 hours to complete, not the mention the toll it would take on
my laptop. So is there a better way that I can accomplish this task?
Any input would be greatly appreciated. Below are the dput of the two
dataframes.

One point of note is that there are only 124 DISTINCT values of DATE
and 6 DISTINCT values of EXPDATE, in case it can be used in some way.

> dput(CALL)
structure(list(NAME = c("STK", "STK", "STK", "STK", "STK",
"STK"), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029,
15029), class = "Date"), STRIKE = c(6300L, 6300L, 6300L, 6300L,
6300L, 6300L), TMSTMP = c("14:18:36", "15:23:42", "15:22:30",
"15:24:13", "15:22:07", "15:22:27"), PRICE = c(107, 102.05, 101.3,
101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L,
2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977,
14977), class = "Date"), DTTM = structure(c(1294044516, 1294048422,
1294048350, 1294048453, 1294048327, 1294048347), class = c("POSIXct",
"POSIXt"), tzone = ""), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c("NAME",
"EXPDATE", "STRIKE", "TMSTMP", "PRICE", "QUANT", "DATE", "DTTM",
"TTE"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")


> dput(VOL)
structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800,
1294252200, 1294338600, 1294597800), class = c("POSIXct", "POSIXt"
), tzone = ""), VOL = c(2.32666706461792e-05, 6.79164443640051e-05,
5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865,
0.000216076713994619)), .Names = c("DATE", "VOL"), row.names = c(NA,
6L), class = "data.frame")

Please do let me know if any more information from my side would help
or if I need to explain the issue more clearly.

Any minor improvement will be great help.

Thanks in advance.

-Shivam


-- 
*Victoria Concordia Crescit*



More information about the R-help mailing list