[R] How to speed up interpolation

Tue Jul 19 00:58:30 CEST 2011

To recover the runways, try:

levels(df$lrw)[times[, 'runway']]

The 'runway' column has the index into 'levels(df$lrw)'

On Mon, Jul 18, 2011 at 4:35 PM, James Rome <jamesrome at gmail.com> wrote:
> There is one problem. No matter what I do, I can't recover the correct
> runway in the final list.
> You had "rw = as.numeric(df$lrw) # index into 'levels' "
>
> I have tried
>   df$lrw = factor(df$lrw, ordered=TRUE)
>    rwys = factor(unique(df$lrw), ordered=TRUE)    # Get the names of
> the runways
>
>  > rwys
>  [1] 04R 27  04L 33L 15R 22L NON
>  Levels: 04L < 04R < 15R < 22L < 27  < 33L < NON
>  > head(df$lrw)
>  [1] 04L 04L 04L 04L 04L 04L
>  Levels: 04L < 04R < 15R < 22L < 27  < 33L < NON
> Which seem to order things the same way.
>  > rn = as.numeric(head(df$lrw))
>  > rn
>  [1] 1 1 1 1 1 1
>
> So I should be able to get back my original runways with
>> rwys[rn]
> [1] 04R 04R 04R 04R 04R 04R
> Levels: 04L < 04R < 15R < 22L < 27  < 33L < NON
>
> So I get 04R instead of 04L
>> rwys[1]
> [1] 04R
> Levels: 04L < 04R < 15R < 22L < 27  < 33L < NON
>> rwys[2]
> [1] 27
> Levels: 04L < 04R < 15R < 22L < 27  < 33L < NON
>
> I note that
>> rwys = as.vector(rwys)
>> rwys
> [1] "04R" "27 " "04L" "33L" "15R" "22L" "NON"
>
> So what dumb thing am I doing here? How do I reorder the original df$lrw
> to match the order in rwys?
>
> Thanks,
> Jim
>
> On 7/17/2011 10:11 PM, jim holtman wrote:
>> Here is what I did; convert the data to a numeric matrix for faster
>> processing.  You can convert back to a dataframe since you have the
>> indices into the levels for the flights and runways.
>>
>>> # read in data
>>> source('/temp/df/df')
>>> # convert to matrix
>>> df.mat <- cbind(pt = as.numeric(df$PredTime)
>> +               , dt = as.numeric(df$dt)
>> +               , rw = as.numeric(df$lrw)  # index into 'levels'
>> +               , flight = as.numeric(df$flightfact)
>> +               )
>>> # create a list of row numbers for each flight for processing
>>> flgt.list <- split(seq(nrow(df.mat)), df.mat[, 'flight'])
>>> # remove lists with only 1 entry
>>> flgt.list <- flgt.list[sapply(flgt.list, length) > 1]
>>>
>>> # create the interval we want data for
>>> interval <- as.numeric(0:60)
>>>
>>> # now process the flights
>>> times <- lapply(flgt.list, function(.flt){
>> +     interp <- approx(df.mat[.flt, 'pt']
>> +                     , df.mat[.flt, 'dt']
>> +                     , xout = interval
>> +                     , rule = 1
>> +                     )
>> +     # return vector
>> +     cbind(time = interp$x
>> +       , error = interp$y
>> +       , runway = df.mat[.flt[1L], 'rw']
>> +       , flight = df.mat[.flt[1L], 'flight']
>> +       )
>> + })
>>> # sample output -- is this correct?
>>> times[[1]]
>>       time       error runway flight
>>  [1,]    0          NA      2      1
>>  [2,]    1          NA      2      1
>>  [3,]    2 -0.13795380      2      1
>>  [4,]    3 -0.20726073      2      1
>>  [5,]    4 -0.27309237      2      1
>>  [6,]    5 -0.33333333      2      1
>>  [7,]    6 -0.09322419      2      1
>>  [8,]    7  0.14688495      2      1
>>  [9,]    8  0.38699409      2      1
>> [10,]    9  0.62710323      2      1
>> [11,]   10  0.86721237      2      1
>> [12,]   11  1.10732151      2      1
>> [13,]   12  1.34743065      2      1
>> [14,]   13  1.58753979      2      1
>> [15,]   14  1.82764893      2      1
>> [16,]   15  2.06775807      2      1
>> [17,]   16  2.30786721      2      1
>> [18,]   17  2.54797635      2      1
>> [19,]   18  6.66600000      2      1
>> [20,]   19  4.82600000      2      1
>> [21,]   20  3.00436508      2      1
>> [22,]   21  2.22316562      2      1
>> [23,]   22  1.34895178      2      1
>> [24,]   23  0.47473795      2      1
>> [25,]   24 -0.39947589      2      1
>> [26,]   25 -1.27368973      2      1
>> [27,]   26 -2.12478632      2      1
>> [28,]   27 -1.61196581      2      1
>> [29,]   28 -1.09914530      2      1
>> [30,]   29 -0.58632479      2      1
>> [31,]   30 -0.07350427      2      1
>> [32,]   31  0.43931624      2      1
>> [33,]   32  0.95213675      2      1
>> [34,]   33  1.46495726      2      1
>> [35,]   34  1.97777778      2      1
>> [36,]   35  2.49059829      2      1
>> [37,]   36  3.00341880      2      1
>> [38,]   37  3.51623932      2      1
>> [39,]   38  4.02905983      2      1
>> [40,]   39  4.54188034      2      1
>> [41,]   40  5.05470085      2      1
>> [42,]   41  5.53360434      2      1
>> [43,]   42  5.53766938      2      1
>> [44,]   43  5.54173442      2      1
>> [45,]   44  5.54579946      2      1
>> [46,]   45  5.54986450      2      1
>> [47,]   46  5.55392954      2      1
>> [48,]   47  5.55799458      2      1
>> [49,]   48  5.56205962      2      1
>> [50,]   49  5.56612466      2      1
>> [51,]   50  5.57018970      2      1
>> [52,]   51  5.57425474      2      1
>> [53,]   52  5.57831978      2      1
>> [54,]   53  5.58238482      2      1
>> [55,]   54  5.58644986      2      1
>> [56,]   55  5.59051491      2      1
>> [57,]   56  5.59457995      2      1
>> [58,]   57  5.59864499      2      1
>> [59,]   58  5.60271003      2      1
>> [60,]   59  5.60677507      2      1
>> [61,]   60  5.61084011      2      1
>>
>>
>> On Sun, Jul 17, 2011 at 6:58 PM, James Rome <jamesrome at gmail.com> wrote:
>>> I thought I had included the data... Here it is again.
>>>
>>> What I want to do is to make box and whisker plots with each flight
>>> counted the same number of times in each time bin. Hence the
>>> interpolation to minute time hacks.
>>>
>>>
>>> On 7/17/2011 4:16 PM, jim holtman wrote:
>>>> It would be nice if you had some sample data included so that we could
>>>> see how the code worked.  Have you use Rprof on the code to see where
>>>> you are spending your time?  You might want to use 'matrix' instead of
>>>> 'data.frames' since there is a big performance impact with dataframes
>>>> when indexing.  A little more description of the problem you are
>>>> trying to solve would also be useful.  I tend to ask people "tell me
>>>> what you want to do, not how you want to do it".
>>>>
>>>> On Sun, Jul 17, 2011 at 1:30 PM, James Rome <jamesrome at gmail.com> wrote:
>>>>> df is a very large data frame with arrival estimates for many flights
>>>>> (DF$flightfact) at random times (df$PredTime). The error of the estimate
>>>>> is df$dt.
>>>>> My problem is that I want to know the prediction error at each minute
>>>>> before landing. This code works, but is very slow, and dominates
>>>>> everything. I tried using split(), but that rapidly ate up my 12 GB of
>>>>> memory. So, is there a better R way of doing this?
>>>>>
>>>>> Thanks,
>>>>> Jim Rome
>>>>>
>>>>>    flights = table(df$flightfact[1:dim(df)[1], drop=TRUE])
>>>>>    nflights = length(flights)
>>>>>    flights = as.data.frame(flights)
>>>>>    times = data.frame()
>>>>>    # Split by flight
>>>>>    for(i in 1:nflights) {
>>>>>        tf = df[as.numeric(df$flightfact)==flights[i,1],]    # This flight
>>>>>        #check for at least 2 entries
>>>>>        if(dim(tf)[1] < 2) {
>>>>>            next
>>>>>        }
>>>>>        idf = interpolateTimes(tf)
>>>>>        times = rbind(times, idf)
>>>>>    }
>>>>>
>>>>> # Interpolate the times to every minute for 60 minutes
>>>>> # Return a new data frame
>>>>> interpolateTimes = function(df) {
>>>>>   x = as.numeric(seq(from=0,to=60)) # The times to interpolate to
>>>>>   dti = approx(as.numeric(df$PredTime), as.numeric(df$dt), x,
>>>>> method="linear",rule=1:1)
>>>>>   # Make a new data frame of interpolated values
>>>>>   idf = data.frame(time=dti$x, error=dti$y,
>>>>>       runway=rep(df$lrw[1],length(dti$x)),
>>>>> flight=rep(df$flightfact[1], length(dti$x)))
>>>>>   return(idf)
>>>>> }
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?