[R] Corrected - R 3.0.2 How to Split-Apply-Combine using various Columns

jim holtman jholtman at gmail.com
Sun Jan 26 03:55:06 CET 2014


FYI, increased it to 3M rows of data and it took 0.3 seconds to summarize it:

> system.time({
+ # summarize the slot & Class
+ result <- DF[
+     , list(Total = .N  # count of entries
+         , mVel = mean(velocity)  # average velocity
+         )
+     , keyby = 'slot,Class'
+     ]
+ })
   user  system elapsed
   0.30    0.00    0.29
> dim(DF)
[1] 3000000       5
> head(result, 20)
                   slot Class Total     mVel
 1:            [22,322]     1 32449 37.70693
 2:            [22,322]     2 32741 37.49304
 3:            [22,322]     3 32213 37.40232
 4:           (322,622]     1 32257 37.38503
 5:           (322,622]     2 31912 37.35941
 6:           (322,622]     3 32360 37.43757
 7:           (622,922]     1 32359 37.48244
 8:           (622,922]     2 32505 37.57031
 9:           (622,922]     3 31966 37.58736
10:      (922,1.22e+03]     1 32228 37.34592
11:      (922,1.22e+03]     2 32099 37.44663
12:      (922,1.22e+03]     3 32054 37.51975
13: (1.22e+03,1.52e+03]     1 32230 37.50131
14: (1.22e+03,1.52e+03]     2 32353 37.50793
15: (1.22e+03,1.52e+03]     3 32280 37.46094
16: (1.52e+03,1.82e+03]     1 31957 37.51671
17: (1.52e+03,1.82e+03]     2 32030 37.48419
18: (1.52e+03,1.82e+03]     3 32365 37.49100
19: (1.82e+03,2.12e+03]     1 32244 37.59558
20: (1.82e+03,2.12e+03]     2 32309 37.49966
>

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Sat, Jan 25, 2014 at 9:52 PM, jim holtman <jholtman at gmail.com> wrote:
> Here is another way of doing it using the data.table package.  Tried
> using new new 'dplyr' package, but it had problems.
>
>> # create some test data
>> N <- 30000  # rows of data
>> set.seed(1)
>> DF <- data.frame(vehicle = sample(1:5, N, TRUE)
> +             , frame = sample(22:9322, N, TRUE)
> +             , velocity = runif(N, 15, 60)
> +             , Class = sample(1:3, N, TRUE)
> +             )
>> # break into slot (as previous post)
>> DF$slot <- cut(DF$frame, seq(22, 9322, 300), include.lowest = TRUE)
>> str(DF)
> 'data.frame':   30000 obs. of  5 variables:
>  $ vehicle : int  2 2 3 5 2 5 5 4 4 1 ...
>  $ frame   : int  3884 1333 4277 7490 3008 8096 2318 7662 2017 5324 ...
>  $ velocity: num  47.9 44.1 51.6 19.7 30 ...
>  $ Class   : int  3 1 3 3 1 1 3 3 3 3 ...
>  $ slot    : Factor w/ 31 levels "[22,322]","(322,622]",..: 13 5 15 25
> 10 27 8 26 7 18 ...
>>
>> # use the data.table package
>> require(data.table)
>> DF <- data.table(DF)
>>
>> # summarize the slot & Class
>> result <- DF[
> +     , list(Total = .N  # count of entries
> +         , mVel = mean(velocity)  # average velocity
> +         )
> +     , keyby = 'slot,Class'
> +     ]
>>
>> head(result, 20)
>                    slot Class Total     mVel
>  1:            [22,322]     1   317 38.68162
>  2:            [22,322]     2   291 38.20892
>  3:            [22,322]     3   356 38.18092
>  4:           (322,622]     1   313 37.46468
>  5:           (322,622]     2   326 38.35842
>  6:           (322,622]     3   329 37.46077
>  7:           (622,922]     1   327 37.06243
>  8:           (622,922]     2   321 37.01083
>  9:           (622,922]     3   351 37.12343
> 10:      (922,1.22e+03]     1   361 36.76076
> 11:      (922,1.22e+03]     2   343 38.48242
> 12:      (922,1.22e+03]     3   312 37.64318
> 13: (1.22e+03,1.52e+03]     1   304 36.89459
> 14: (1.22e+03,1.52e+03]     2   324 36.60410
> 15: (1.22e+03,1.52e+03]     3   343 36.11219
> 16: (1.52e+03,1.82e+03]     1   313 38.71039
> 17: (1.52e+03,1.82e+03]     2   305 38.29188
> 18: (1.52e+03,1.82e+03]     3   308 38.61945
> 19: (1.82e+03,2.12e+03]     1   308 37.69332
> 20: (1.82e+03,2.12e+03]     2   335 37.38025
>>
>
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>
>
> On Sat, Jan 25, 2014 at 4:38 PM, Jeff Newmiller
> <jdnewmil at dcn.davis.ca.us> wrote:
>> Sorry, messed up the second ddply example:
>>
>> dta3 <- ddply( dta2, c("slot","classf"), function(DF){data.frame( Total=nrow(DF),
>> MeanVelocity= mean( DF$TimeMeanVelocity ) ) } )
>>
>> ---------------------------------------------------------------------------
>> Jeff Newmiller                        The     .....       .....  Go Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
>>                                       Live:   OO#.. Dead: OO#..  Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
>> ---------------------------------------------------------------------------
>> Sent from my phone. Please excuse my brevity.
>>
>> Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:
>>>While you seem to be making some progress in communicating your
>>>problem, the format is still HTML (so it is a mess) and the subject and
>>>approach of the question are still a poor fit for this list. We are not
>>>here to DO your work for you, yet you seem to have an overly long list
>>>of "needs" that suggest you want a complete solution. What you should
>>>be looking for here are suggestions for how to solve pieces of this
>>>task so that you can do the work of creating your own solution.
>>>Some tools that I find useful for this kind of problem are the cut
>>>function, the plyr package, and the reshape2 package. Others might find
>>>the aggregate function or the sqldf package or the datatable package or
>>>the new dplyr package helpful. Each function and package has
>>>documentation with examples that you should read before using them
>>>(e.g. ?cut).
>>>
>>>Some example calculations are (with dta as your sample data frame):
>>>
>>>library(plyr)
>>>dta$slot <- cut( dta$frame, seq(22,9322,300))
>>>dta$classf <- factor(dta$class, levels=1:3,
>>>labels=c("motorcycle","car","truck"))
>>>dta2 <- ddply( dta, c("slot","classf","vehicle"),
>>>function(DF){data.frame( TimeMeanVelocity=mean(DF$velocity) ) } )
>>>dta3 <- ddply( dta2, c("slot","classf"), function(DF){data.frame(
>>>MeanVelocity=mean( Total=nrow(DF), DF$TimeMeanVelocity ) ) } )
>>>
>>>Then you need to fold the total and mean velocity into wide form using
>>>the dcast function from the reshape2 package (read the documentation)
>>>and merge them with the merge or cbind functions.
>>>
>>>Good luck, and keep working on making your questions clear.
>>>---------------------------------------------------------------------------
>>>Jeff Newmiller                        The     .....       .....  Go
>>>Live...
>>>DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>>>Go...
>>>                                     Live:   OO#.. Dead: OO#..  Playing
>>>Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>>/Software/Embedded Controllers)               .OO#.       .OO#.
>>>rocks...1k
>>>---------------------------------------------------------------------------
>>>
>>>Sent from my phone. Please excuse my brevity.
>>>
>>>umair durrani <umairdurrani at outlook.com> wrote:
>>>>Hello everyone,Here is the version using dput. I am sorry for the junk
>>>>I posted before. I have a large vehicle trajectory data of which
>>>>following is a small part:
>>>>structure(list(vehicle = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,2L, 2L),
>>>>frame = c(221L, 222L, 223L, 224L, 115L, 116L, 117L, 118L, 119L, 120L,
>>>>121L), globalx = c(6451259.685, 6451261.244, 6451262.831, 6451264.362,
>>>>6451181.179, 6451183.532, 6451185.884, 6451188.237, 6451190.609,
>>>>6451192.912, 6451195.132), class = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>>>>2L, 2L, 2L), velocity = c(23.37, 23.16, 22.94, 22.85, 35, 35.01,
>>>35.03,
>>>>34.92, 34.49, 33.66, 32.5), lane = c(5L, 5L, 5L, 5L, 4L, 4L, 4L, 4L,
>>>>4L, 4L, 4L)), .Names = c("vehicle", "frame", "globalx", "class",
>>>>"velocity", "lane"), row.names = c(85L, 86L, 87L, 88L, 447L, 448L,
>>>>449L, 450L, 451L, 452L, 453L), class = "data.frame")
>>>>Explanation of Columns:vehicle = unique ID of vehicle. It is repeated
>>>>(in column) for every frame in which it was observed;frame= ID of the
>>>>frame in which the vehicle was observed. One frame is 0.1 seconds
>>>>long;class = class of vehicle i.e. 1=motorcycle, 2=car,
>>>>3=truck;velocity= velocity of vehicle in feet per second;lane= lane
>>>>number in which vehicle is present in a particular frame;
>>>>
>>>>'frame' number can also repeat e.g. in frame 120 the example data
>>>shows
>>>>vehicle 2 was observed but in the original data many more vehicles
>>>>might have been observed in this frame. Similarly, 'class' is defined
>>>>above and all three classes are present in the original data (here
>>>>example data only shows classes 2 and 3 i.e. cars and trucks).
>>>>I need to determine two things:1) Number of vehicles observed in every
>>>>30 seconds i.e. 300 frames 2) Average velocity of each vehicle class
>>>in
>>>>every 30 seconds
>>>>> This means that the first step might be to determine the minimum and
>>>>maximum frame numbers and then divide them in slots so that every slot
>>>>has 300 frames. In my original data I found 22 as min and 9233 as max
>>>>frame number. This makes 30 time slots as 22-322, 322-622, ...,
>>>>9022-9233. I need following columns in one table as an output (note
>>>>that Timeslot column should contain the time intervals as described
>>>>before): TimeSlot, Total-Cars, Total-Trucks, Total-Motorcycles,
>>>>MeanVelocity-Cars, MeanVelocity-Trucks, MeanVelocity-Motorcycles
>>>>
>>>>
>>>>
>>>>      [[alternative HTML version deleted]]
>>>>
>>>>______________________________________________
>>>>R-help at r-project.org mailing list
>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>PLEASE do read the posting guide
>>>>http://www.R-project.org/posting-guide.html
>>>>and provide commented, minimal, self-contained, reproducible code.
>>>
>>>______________________________________________
>>>R-help at r-project.org mailing list
>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>PLEASE do read the posting guide
>>>http://www.R-project.org/posting-guide.html
>>>and provide commented, minimal, self-contained, reproducible code.
>>




More information about the R-help mailing list