[R] Corrected - R 3.0.2 How to Split-Apply-Combine using various Columns
jim holtman
jholtman at gmail.com
Sun Jan 26 03:52:21 CET 2014
Here is another way of doing it using the data.table package. Tried
using new new 'dplyr' package, but it had problems.
> # create some test data
> N <- 30000 # rows of data
> set.seed(1)
> DF <- data.frame(vehicle = sample(1:5, N, TRUE)
+ , frame = sample(22:9322, N, TRUE)
+ , velocity = runif(N, 15, 60)
+ , Class = sample(1:3, N, TRUE)
+ )
> # break into slot (as previous post)
> DF$slot <- cut(DF$frame, seq(22, 9322, 300), include.lowest = TRUE)
> str(DF)
'data.frame': 30000 obs. of 5 variables:
$ vehicle : int 2 2 3 5 2 5 5 4 4 1 ...
$ frame : int 3884 1333 4277 7490 3008 8096 2318 7662 2017 5324 ...
$ velocity: num 47.9 44.1 51.6 19.7 30 ...
$ Class : int 3 1 3 3 1 1 3 3 3 3 ...
$ slot : Factor w/ 31 levels "[22,322]","(322,622]",..: 13 5 15 25
10 27 8 26 7 18 ...
>
> # use the data.table package
> require(data.table)
> DF <- data.table(DF)
>
> # summarize the slot & Class
> result <- DF[
+ , list(Total = .N # count of entries
+ , mVel = mean(velocity) # average velocity
+ )
+ , keyby = 'slot,Class'
+ ]
>
> head(result, 20)
slot Class Total mVel
1: [22,322] 1 317 38.68162
2: [22,322] 2 291 38.20892
3: [22,322] 3 356 38.18092
4: (322,622] 1 313 37.46468
5: (322,622] 2 326 38.35842
6: (322,622] 3 329 37.46077
7: (622,922] 1 327 37.06243
8: (622,922] 2 321 37.01083
9: (622,922] 3 351 37.12343
10: (922,1.22e+03] 1 361 36.76076
11: (922,1.22e+03] 2 343 38.48242
12: (922,1.22e+03] 3 312 37.64318
13: (1.22e+03,1.52e+03] 1 304 36.89459
14: (1.22e+03,1.52e+03] 2 324 36.60410
15: (1.22e+03,1.52e+03] 3 343 36.11219
16: (1.52e+03,1.82e+03] 1 313 38.71039
17: (1.52e+03,1.82e+03] 2 305 38.29188
18: (1.52e+03,1.82e+03] 3 308 38.61945
19: (1.82e+03,2.12e+03] 1 308 37.69332
20: (1.82e+03,2.12e+03] 2 335 37.38025
>
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
On Sat, Jan 25, 2014 at 4:38 PM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:
> Sorry, messed up the second ddply example:
>
> dta3 <- ddply( dta2, c("slot","classf"), function(DF){data.frame( Total=nrow(DF),
> MeanVelocity= mean( DF$TimeMeanVelocity ) ) } )
>
> ---------------------------------------------------------------------------
> Jeff Newmiller The ..... ..... Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
> Live: OO#.. Dead: OO#.. Playing
> Research Engineer (Solar/Batteries O.O#. #.O#. with
> /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
> ---------------------------------------------------------------------------
> Sent from my phone. Please excuse my brevity.
>
> Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:
>>While you seem to be making some progress in communicating your
>>problem, the format is still HTML (so it is a mess) and the subject and
>>approach of the question are still a poor fit for this list. We are not
>>here to DO your work for you, yet you seem to have an overly long list
>>of "needs" that suggest you want a complete solution. What you should
>>be looking for here are suggestions for how to solve pieces of this
>>task so that you can do the work of creating your own solution.
>>Some tools that I find useful for this kind of problem are the cut
>>function, the plyr package, and the reshape2 package. Others might find
>>the aggregate function or the sqldf package or the datatable package or
>>the new dplyr package helpful. Each function and package has
>>documentation with examples that you should read before using them
>>(e.g. ?cut).
>>
>>Some example calculations are (with dta as your sample data frame):
>>
>>library(plyr)
>>dta$slot <- cut( dta$frame, seq(22,9322,300))
>>dta$classf <- factor(dta$class, levels=1:3,
>>labels=c("motorcycle","car","truck"))
>>dta2 <- ddply( dta, c("slot","classf","vehicle"),
>>function(DF){data.frame( TimeMeanVelocity=mean(DF$velocity) ) } )
>>dta3 <- ddply( dta2, c("slot","classf"), function(DF){data.frame(
>>MeanVelocity=mean( Total=nrow(DF), DF$TimeMeanVelocity ) ) } )
>>
>>Then you need to fold the total and mean velocity into wide form using
>>the dcast function from the reshape2 package (read the documentation)
>>and merge them with the merge or cbind functions.
>>
>>Good luck, and keep working on making your questions clear.
>>---------------------------------------------------------------------------
>>Jeff Newmiller The ..... ..... Go
>>Live...
>>DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
>>Go...
>> Live: OO#.. Dead: OO#.. Playing
>>Research Engineer (Solar/Batteries O.O#. #.O#. with
>>/Software/Embedded Controllers) .OO#. .OO#.
>>rocks...1k
>>---------------------------------------------------------------------------
>>
>>Sent from my phone. Please excuse my brevity.
>>
>>umair durrani <umairdurrani at outlook.com> wrote:
>>>Hello everyone,Here is the version using dput. I am sorry for the junk
>>>I posted before. I have a large vehicle trajectory data of which
>>>following is a small part:
>>>structure(list(vehicle = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,2L, 2L),
>>>frame = c(221L, 222L, 223L, 224L, 115L, 116L, 117L, 118L, 119L, 120L,
>>>121L), globalx = c(6451259.685, 6451261.244, 6451262.831, 6451264.362,
>>>6451181.179, 6451183.532, 6451185.884, 6451188.237, 6451190.609,
>>>6451192.912, 6451195.132), class = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>>>2L, 2L, 2L), velocity = c(23.37, 23.16, 22.94, 22.85, 35, 35.01,
>>35.03,
>>>34.92, 34.49, 33.66, 32.5), lane = c(5L, 5L, 5L, 5L, 4L, 4L, 4L, 4L,
>>>4L, 4L, 4L)), .Names = c("vehicle", "frame", "globalx", "class",
>>>"velocity", "lane"), row.names = c(85L, 86L, 87L, 88L, 447L, 448L,
>>>449L, 450L, 451L, 452L, 453L), class = "data.frame")
>>>Explanation of Columns:vehicle = unique ID of vehicle. It is repeated
>>>(in column) for every frame in which it was observed;frame= ID of the
>>>frame in which the vehicle was observed. One frame is 0.1 seconds
>>>long;class = class of vehicle i.e. 1=motorcycle, 2=car,
>>>3=truck;velocity= velocity of vehicle in feet per second;lane= lane
>>>number in which vehicle is present in a particular frame;
>>>
>>>'frame' number can also repeat e.g. in frame 120 the example data
>>shows
>>>vehicle 2 was observed but in the original data many more vehicles
>>>might have been observed in this frame. Similarly, 'class' is defined
>>>above and all three classes are present in the original data (here
>>>example data only shows classes 2 and 3 i.e. cars and trucks).
>>>I need to determine two things:1) Number of vehicles observed in every
>>>30 seconds i.e. 300 frames 2) Average velocity of each vehicle class
>>in
>>>every 30 seconds
>>>> This means that the first step might be to determine the minimum and
>>>maximum frame numbers and then divide them in slots so that every slot
>>>has 300 frames. In my original data I found 22 as min and 9233 as max
>>>frame number. This makes 30 time slots as 22-322, 322-622, ...,
>>>9022-9233. I need following columns in one table as an output (note
>>>that Timeslot column should contain the time intervals as described
>>>before): TimeSlot, Total-Cars, Total-Trucks, Total-Motorcycles,
>>>MeanVelocity-Cars, MeanVelocity-Trucks, MeanVelocity-Motorcycles
>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>______________________________________________
>>>R-help at r-project.org mailing list
>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>PLEASE do read the posting guide
>>>http://www.R-project.org/posting-guide.html
>>>and provide commented, minimal, self-contained, reproducible code.
>>
>>______________________________________________
>>R-help at r-project.org mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list