[R] How to speed up interpolation

jim holtman jholtman at gmail.com
Mon Jul 18 04:11:27 CEST 2011


Here is what I did; convert the data to a numeric matrix for faster
processing.  You can convert back to a dataframe since you have the
indices into the levels for the flights and runways.

> # read in data
> source('/temp/df/df')
> # convert to matrix
> df.mat <- cbind(pt = as.numeric(df$PredTime)
+               , dt = as.numeric(df$dt)
+               , rw = as.numeric(df$lrw)  # index into 'levels'
+               , flight = as.numeric(df$flightfact)
+               )
> # create a list of row numbers for each flight for processing
> flgt.list <- split(seq(nrow(df.mat)), df.mat[, 'flight'])
> # remove lists with only 1 entry
> flgt.list <- flgt.list[sapply(flgt.list, length) > 1]
>
> # create the interval we want data for
> interval <- as.numeric(0:60)
>
> # now process the flights
> times <- lapply(flgt.list, function(.flt){
+     interp <- approx(df.mat[.flt, 'pt']
+                     , df.mat[.flt, 'dt']
+                     , xout = interval
+                     , rule = 1
+                     )
+     # return vector
+     cbind(time = interp$x
+       , error = interp$y
+       , runway = df.mat[.flt[1L], 'rw']
+       , flight = df.mat[.flt[1L], 'flight']
+       )
+ })
> # sample output -- is this correct?
> times[[1]]
      time       error runway flight
 [1,]    0          NA      2      1
 [2,]    1          NA      2      1
 [3,]    2 -0.13795380      2      1
 [4,]    3 -0.20726073      2      1
 [5,]    4 -0.27309237      2      1
 [6,]    5 -0.33333333      2      1
 [7,]    6 -0.09322419      2      1
 [8,]    7  0.14688495      2      1
 [9,]    8  0.38699409      2      1
[10,]    9  0.62710323      2      1
[11,]   10  0.86721237      2      1
[12,]   11  1.10732151      2      1
[13,]   12  1.34743065      2      1
[14,]   13  1.58753979      2      1
[15,]   14  1.82764893      2      1
[16,]   15  2.06775807      2      1
[17,]   16  2.30786721      2      1
[18,]   17  2.54797635      2      1
[19,]   18  6.66600000      2      1
[20,]   19  4.82600000      2      1
[21,]   20  3.00436508      2      1
[22,]   21  2.22316562      2      1
[23,]   22  1.34895178      2      1
[24,]   23  0.47473795      2      1
[25,]   24 -0.39947589      2      1
[26,]   25 -1.27368973      2      1
[27,]   26 -2.12478632      2      1
[28,]   27 -1.61196581      2      1
[29,]   28 -1.09914530      2      1
[30,]   29 -0.58632479      2      1
[31,]   30 -0.07350427      2      1
[32,]   31  0.43931624      2      1
[33,]   32  0.95213675      2      1
[34,]   33  1.46495726      2      1
[35,]   34  1.97777778      2      1
[36,]   35  2.49059829      2      1
[37,]   36  3.00341880      2      1
[38,]   37  3.51623932      2      1
[39,]   38  4.02905983      2      1
[40,]   39  4.54188034      2      1
[41,]   40  5.05470085      2      1
[42,]   41  5.53360434      2      1
[43,]   42  5.53766938      2      1
[44,]   43  5.54173442      2      1
[45,]   44  5.54579946      2      1
[46,]   45  5.54986450      2      1
[47,]   46  5.55392954      2      1
[48,]   47  5.55799458      2      1
[49,]   48  5.56205962      2      1
[50,]   49  5.56612466      2      1
[51,]   50  5.57018970      2      1
[52,]   51  5.57425474      2      1
[53,]   52  5.57831978      2      1
[54,]   53  5.58238482      2      1
[55,]   54  5.58644986      2      1
[56,]   55  5.59051491      2      1
[57,]   56  5.59457995      2      1
[58,]   57  5.59864499      2      1
[59,]   58  5.60271003      2      1
[60,]   59  5.60677507      2      1
[61,]   60  5.61084011      2      1


On Sun, Jul 17, 2011 at 6:58 PM, James Rome <jamesrome at gmail.com> wrote:
> I thought I had included the data... Here it is again.
>
> What I want to do is to make box and whisker plots with each flight
> counted the same number of times in each time bin. Hence the
> interpolation to minute time hacks.
>
>
> On 7/17/2011 4:16 PM, jim holtman wrote:
>> It would be nice if you had some sample data included so that we could
>> see how the code worked.  Have you use Rprof on the code to see where
>> you are spending your time?  You might want to use 'matrix' instead of
>> 'data.frames' since there is a big performance impact with dataframes
>> when indexing.  A little more description of the problem you are
>> trying to solve would also be useful.  I tend to ask people "tell me
>> what you want to do, not how you want to do it".
>>
>> On Sun, Jul 17, 2011 at 1:30 PM, James Rome <jamesrome at gmail.com> wrote:
>>> df is a very large data frame with arrival estimates for many flights
>>> (DF$flightfact) at random times (df$PredTime). The error of the estimate
>>> is df$dt.
>>> My problem is that I want to know the prediction error at each minute
>>> before landing. This code works, but is very slow, and dominates
>>> everything. I tried using split(), but that rapidly ate up my 12 GB of
>>> memory. So, is there a better R way of doing this?
>>>
>>> Thanks,
>>> Jim Rome
>>>
>>>    flights = table(df$flightfact[1:dim(df)[1], drop=TRUE])
>>>    nflights = length(flights)
>>>    flights = as.data.frame(flights)
>>>    times = data.frame()
>>>    # Split by flight
>>>    for(i in 1:nflights) {
>>>        tf = df[as.numeric(df$flightfact)==flights[i,1],]    # This flight
>>>        #check for at least 2 entries
>>>        if(dim(tf)[1] < 2) {
>>>            next
>>>        }
>>>        idf = interpolateTimes(tf)
>>>        times = rbind(times, idf)
>>>    }
>>>
>>> # Interpolate the times to every minute for 60 minutes
>>> # Return a new data frame
>>> interpolateTimes = function(df) {
>>>   x = as.numeric(seq(from=0,to=60)) # The times to interpolate to
>>>   dti = approx(as.numeric(df$PredTime), as.numeric(df$dt), x,
>>> method="linear",rule=1:1)
>>>   # Make a new data frame of interpolated values
>>>   idf = data.frame(time=dti$x, error=dti$y,
>>>       runway=rep(df$lrw[1],length(dti$x)),
>>> flight=rep(df$flightfact[1], length(dti$x)))
>>>   return(idf)
>>> }
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>>
>
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?



More information about the R-help mailing list