[R] Sum of column from another df based of row values of df1
jim holtman
jholtman at gmail.com
Sun Sep 9 22:03:19 CEST 2012
How about an improvement to 16 seconds. The first thing to do is to
convert you data to a matrix because accessing data in a dataframe is
very expensive. If you run Rprof on your code you will see that all
the time is spent in retrieving the information. Converting to a
matrix and using matrix accessing is considerably faster. I did
convert the POSIXct to Date. You were also paying a lot in the
constant conversion of POSIXct to Date for your comparisons. I just
replicated your CALL to 1 million rows for testing.
> CALL <-
+ structure(list(NAME = c("STK", "STK", "STK", "STK", "STK",
+ "STK"), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029,
+ 15029), class = "Date"), STRIKE = c(6300L, 6300L, 6300L, 6300L,
+ 6300L, 6300L), TMSTMP = c("14:18:36", "15:23:42", "15:22:30",
+ "15:24:13", "15:22:07", "15:22:27"), PRICE = c(107, 102.05, 101.3,
+ 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L,
+ 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977,
+ 14977), class = "Date"), DTTM = structure(c(1294044516, 1294048422,
+ 1294048350, 1294048453, 1294048327, 1294048347), class = c("POSIXct",
+ "POSIXt"), tzone = ""), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c("NAME",
+ "EXPDATE", "STRIKE", "TMSTMP", "PRICE", "QUANT", "DATE", "DTTM",
+ "TTE"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
>
> VOL <-
+ structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800,
+ 1294252200, 1294338600, 1294597800), class = c("POSIXct", "POSIXt"
+ ), tzone = ""), VOL = c(2.32666706461792e-05, 6.79164443640051e-05,
+ 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865,
+ 0.000216076713994619)), .Names = c("DATE", "VOL"), row.names = c(NA,
+ 6L), class = "data.frame")
>
> # convert to matrices for faster testing
> mCALL <- cbind(CALL$DATE, CALL$EXPDATE)
> mVOL <- cbind(as.Date(VOL$DATE), VOL$VOL) # convert POSIXct to Date
>
> # create 1M rows in mCALL
> mCALL <- rbind(mCALL, mCALL[rep(1L, 1e6),])
>
> result <- numeric(nrow(mCALL))
> system.time({
+ for (i in 1:nrow(mCALL)){
+ result[i] <- sum(mVOL[(mVOL[, 1L] >= mCALL[i,1L])
+ & (mVOL[, 1L] <= mCALL[i, 2L]), 2L])
+ }
+ })
user system elapsed
15.94 0.00 16.07
>
>
>
>
On Sun, Sep 9, 2012 at 2:58 PM, Shivam <shivamsingh at gmail.com> wrote:
> Dear All,
>
> I need to sum a column from another dataframe based on the row values
> of one dataframe. I am stuck in a loop trying to accomplish it and at
> current speed it will take more than 80 hours to complete. Needless to
> say I am looking for a more elegant/quicker solution. Really need some
> help here. Here is the issue:
>
> I have a dataframe CALL (the dput of head is given below) which has
> close to a million rows. There are 2 date columns which are of
> importance, DATE and EXPDATE. There is another dataframe, VOL (dput of
> head given), which has 2 columns, DATE and VOL. It has the volatility
> corresponding to each day and it has a total of 124 records
> (corresponding to 6 months). I want to add another column in the CALL
> dataframe which would contain the sum of all the volatilities from the
> VOL df for the period specified by the interval of DATE and EXPDATE in
> each row of CALL df.
>
> For ex: In the first row, DATE is '03-01-2011' and EXPDATE is
> '27-01-2011'. So I want the SUM column (A new column in CALL df) to
> contain the sum of volatilities of 03-01, 04-01, 05-01 .... till 27-01
> from the VOL dataframe.
>
> I have to repeat this process for all the rows in the dataframe. Here
> is the for-loop version of the solution:
>
> for (k in 1:nrow(CALL)){
> CALL$SUM[k] = sum(subset(VOL$VOL, VOL$DATE >= CALL$DATE[k] & VOL$DATE
> <= CALL$EXPDATE[k]))
> }
>
> The loop will run for close to a million times, it has been running
> for more than 10 hours and its just 12% complete. It would take more
> than 80 hours to complete, not the mention the toll it would take on
> my laptop. So is there a better way that I can accomplish this task?
> Any input would be greatly appreciated. Below are the dput of the two
> dataframes.
>
> One point of note is that there are only 124 DISTINCT values of DATE
> and 6 DISTINCT values of EXPDATE, in case it can be used in some way.
>
>> dput(CALL)
> structure(list(NAME = c("STK", "STK", "STK", "STK", "STK",
> "STK"), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029,
> 15029), class = "Date"), STRIKE = c(6300L, 6300L, 6300L, 6300L,
> 6300L, 6300L), TMSTMP = c("14:18:36", "15:23:42", "15:22:30",
> "15:24:13", "15:22:07", "15:22:27"), PRICE = c(107, 102.05, 101.3,
> 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L,
> 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977,
> 14977), class = "Date"), DTTM = structure(c(1294044516, 1294048422,
> 1294048350, 1294048453, 1294048327, 1294048347), class = c("POSIXct",
> "POSIXt"), tzone = ""), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c("NAME",
> "EXPDATE", "STRIKE", "TMSTMP", "PRICE", "QUANT", "DATE", "DTTM",
> "TTE"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
>
>
>> dput(VOL)
> structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800,
> 1294252200, 1294338600, 1294597800), class = c("POSIXct", "POSIXt"
> ), tzone = ""), VOL = c(2.32666706461792e-05, 6.79164443640051e-05,
> 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865,
> 0.000216076713994619)), .Names = c("DATE", "VOL"), row.names = c(NA,
> 6L), class = "data.frame")
>
> Please do let me know if any more information from my side would help
> or if I need to explain the issue more clearly.
>
> Any minor improvement will be great help.
>
> Thanks in advance.
>
> -Shivam
>
>
> --
> *Victoria Concordia Crescit*
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
More information about the R-help
mailing list