[R] How to speed up interpolation
jim holtman
jholtman at gmail.com
Tue Jul 19 00:58:30 CEST 2011
To recover the runways, try:
levels(df$lrw)[times[, 'runway']]
The 'runway' column has the index into 'levels(df$lrw)'
On Mon, Jul 18, 2011 at 4:35 PM, James Rome <jamesrome at gmail.com> wrote:
> There is one problem. No matter what I do, I can't recover the correct
> runway in the final list.
> You had "rw = as.numeric(df$lrw) # index into 'levels' "
>
> I have tried
> df$lrw = factor(df$lrw, ordered=TRUE)
> rwys = factor(unique(df$lrw), ordered=TRUE) # Get the names of
> the runways
>
> > rwys
> [1] 04R 27 04L 33L 15R 22L NON
> Levels: 04L < 04R < 15R < 22L < 27 < 33L < NON
> > head(df$lrw)
> [1] 04L 04L 04L 04L 04L 04L
> Levels: 04L < 04R < 15R < 22L < 27 < 33L < NON
> Which seem to order things the same way.
> > rn = as.numeric(head(df$lrw))
> > rn
> [1] 1 1 1 1 1 1
>
> So I should be able to get back my original runways with
>> rwys[rn]
> [1] 04R 04R 04R 04R 04R 04R
> Levels: 04L < 04R < 15R < 22L < 27 < 33L < NON
>
> So I get 04R instead of 04L
>> rwys[1]
> [1] 04R
> Levels: 04L < 04R < 15R < 22L < 27 < 33L < NON
>> rwys[2]
> [1] 27
> Levels: 04L < 04R < 15R < 22L < 27 < 33L < NON
>
> I note that
>> rwys = as.vector(rwys)
>> rwys
> [1] "04R" "27 " "04L" "33L" "15R" "22L" "NON"
>
> So what dumb thing am I doing here? How do I reorder the original df$lrw
> to match the order in rwys?
>
> Thanks,
> Jim
>
> On 7/17/2011 10:11 PM, jim holtman wrote:
>> Here is what I did; convert the data to a numeric matrix for faster
>> processing. You can convert back to a dataframe since you have the
>> indices into the levels for the flights and runways.
>>
>>> # read in data
>>> source('/temp/df/df')
>>> # convert to matrix
>>> df.mat <- cbind(pt = as.numeric(df$PredTime)
>> + , dt = as.numeric(df$dt)
>> + , rw = as.numeric(df$lrw) # index into 'levels'
>> + , flight = as.numeric(df$flightfact)
>> + )
>>> # create a list of row numbers for each flight for processing
>>> flgt.list <- split(seq(nrow(df.mat)), df.mat[, 'flight'])
>>> # remove lists with only 1 entry
>>> flgt.list <- flgt.list[sapply(flgt.list, length) > 1]
>>>
>>> # create the interval we want data for
>>> interval <- as.numeric(0:60)
>>>
>>> # now process the flights
>>> times <- lapply(flgt.list, function(.flt){
>> + interp <- approx(df.mat[.flt, 'pt']
>> + , df.mat[.flt, 'dt']
>> + , xout = interval
>> + , rule = 1
>> + )
>> + # return vector
>> + cbind(time = interp$x
>> + , error = interp$y
>> + , runway = df.mat[.flt[1L], 'rw']
>> + , flight = df.mat[.flt[1L], 'flight']
>> + )
>> + })
>>> # sample output -- is this correct?
>>> times[[1]]
>> time error runway flight
>> [1,] 0 NA 2 1
>> [2,] 1 NA 2 1
>> [3,] 2 -0.13795380 2 1
>> [4,] 3 -0.20726073 2 1
>> [5,] 4 -0.27309237 2 1
>> [6,] 5 -0.33333333 2 1
>> [7,] 6 -0.09322419 2 1
>> [8,] 7 0.14688495 2 1
>> [9,] 8 0.38699409 2 1
>> [10,] 9 0.62710323 2 1
>> [11,] 10 0.86721237 2 1
>> [12,] 11 1.10732151 2 1
>> [13,] 12 1.34743065 2 1
>> [14,] 13 1.58753979 2 1
>> [15,] 14 1.82764893 2 1
>> [16,] 15 2.06775807 2 1
>> [17,] 16 2.30786721 2 1
>> [18,] 17 2.54797635 2 1
>> [19,] 18 6.66600000 2 1
>> [20,] 19 4.82600000 2 1
>> [21,] 20 3.00436508 2 1
>> [22,] 21 2.22316562 2 1
>> [23,] 22 1.34895178 2 1
>> [24,] 23 0.47473795 2 1
>> [25,] 24 -0.39947589 2 1
>> [26,] 25 -1.27368973 2 1
>> [27,] 26 -2.12478632 2 1
>> [28,] 27 -1.61196581 2 1
>> [29,] 28 -1.09914530 2 1
>> [30,] 29 -0.58632479 2 1
>> [31,] 30 -0.07350427 2 1
>> [32,] 31 0.43931624 2 1
>> [33,] 32 0.95213675 2 1
>> [34,] 33 1.46495726 2 1
>> [35,] 34 1.97777778 2 1
>> [36,] 35 2.49059829 2 1
>> [37,] 36 3.00341880 2 1
>> [38,] 37 3.51623932 2 1
>> [39,] 38 4.02905983 2 1
>> [40,] 39 4.54188034 2 1
>> [41,] 40 5.05470085 2 1
>> [42,] 41 5.53360434 2 1
>> [43,] 42 5.53766938 2 1
>> [44,] 43 5.54173442 2 1
>> [45,] 44 5.54579946 2 1
>> [46,] 45 5.54986450 2 1
>> [47,] 46 5.55392954 2 1
>> [48,] 47 5.55799458 2 1
>> [49,] 48 5.56205962 2 1
>> [50,] 49 5.56612466 2 1
>> [51,] 50 5.57018970 2 1
>> [52,] 51 5.57425474 2 1
>> [53,] 52 5.57831978 2 1
>> [54,] 53 5.58238482 2 1
>> [55,] 54 5.58644986 2 1
>> [56,] 55 5.59051491 2 1
>> [57,] 56 5.59457995 2 1
>> [58,] 57 5.59864499 2 1
>> [59,] 58 5.60271003 2 1
>> [60,] 59 5.60677507 2 1
>> [61,] 60 5.61084011 2 1
>>
>>
>> On Sun, Jul 17, 2011 at 6:58 PM, James Rome <jamesrome at gmail.com> wrote:
>>> I thought I had included the data... Here it is again.
>>>
>>> What I want to do is to make box and whisker plots with each flight
>>> counted the same number of times in each time bin. Hence the
>>> interpolation to minute time hacks.
>>>
>>>
>>> On 7/17/2011 4:16 PM, jim holtman wrote:
>>>> It would be nice if you had some sample data included so that we could
>>>> see how the code worked. Have you use Rprof on the code to see where
>>>> you are spending your time? You might want to use 'matrix' instead of
>>>> 'data.frames' since there is a big performance impact with dataframes
>>>> when indexing. A little more description of the problem you are
>>>> trying to solve would also be useful. I tend to ask people "tell me
>>>> what you want to do, not how you want to do it".
>>>>
>>>> On Sun, Jul 17, 2011 at 1:30 PM, James Rome <jamesrome at gmail.com> wrote:
>>>>> df is a very large data frame with arrival estimates for many flights
>>>>> (DF$flightfact) at random times (df$PredTime). The error of the estimate
>>>>> is df$dt.
>>>>> My problem is that I want to know the prediction error at each minute
>>>>> before landing. This code works, but is very slow, and dominates
>>>>> everything. I tried using split(), but that rapidly ate up my 12 GB of
>>>>> memory. So, is there a better R way of doing this?
>>>>>
>>>>> Thanks,
>>>>> Jim Rome
>>>>>
>>>>> flights = table(df$flightfact[1:dim(df)[1], drop=TRUE])
>>>>> nflights = length(flights)
>>>>> flights = as.data.frame(flights)
>>>>> times = data.frame()
>>>>> # Split by flight
>>>>> for(i in 1:nflights) {
>>>>> tf = df[as.numeric(df$flightfact)==flights[i,1],] # This flight
>>>>> #check for at least 2 entries
>>>>> if(dim(tf)[1] < 2) {
>>>>> next
>>>>> }
>>>>> idf = interpolateTimes(tf)
>>>>> times = rbind(times, idf)
>>>>> }
>>>>>
>>>>> # Interpolate the times to every minute for 60 minutes
>>>>> # Return a new data frame
>>>>> interpolateTimes = function(df) {
>>>>> x = as.numeric(seq(from=0,to=60)) # The times to interpolate to
>>>>> dti = approx(as.numeric(df$PredTime), as.numeric(df$dt), x,
>>>>> method="linear",rule=1:1)
>>>>> # Make a new data frame of interpolated values
>>>>> idf = data.frame(time=dti$x, error=dti$y,
>>>>> runway=rep(df$lrw[1],length(dti$x)),
>>>>> flight=rep(df$flightfact[1], length(dti$x)))
>>>>> return(idf)
>>>>> }
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
More information about the R-help
mailing list