[R] scaling to multiple data files

Jason Edgecombe jason at rampaginggeek.com
Tue Jan 11 16:47:39 CET 2011


I have logging information for multiple machines, which I am trying to 
summarize and graph. So far, I process each host individually, but I 
would like to summarize the user count across multiple hosts. I want to 
answer the question "how many unique users logged in on a certain day 
across a group of machines"?

I'm not quite sure how to scale the data frame and analysis to summarize 
multiple hosts, though. I'm still getting a feel for using R.

Here is a snippet of data for one host. the user_count column is 
generated from the users column using my custom function "usercount()". 
the samples are taken roughly once per minute and only unique samples 
are recorded. (i.e. use na.locf() to uncompress the data). Samples may 
occur twice in the same minute and are rarely aligned on the same time.

Here is the original data before I turn t into a zoo series and run 
na.locf() over it so I can aggregate a single host by day. I'm open to a 
better way.
 > foo
                   users            datetime user_count
1         user1 & user2 2007-03-29 19:16:30          2
2         user1 & user2 2007-03-31 00:04:46          2
3         user1 & user2 2007-04-02 11:49:20          2
4         user1 & user2 2007-04-02 12:02:04          2
5         user1 & user2 2007-04-02 12:44:02          2
6 user1 & user2 & user3 2007-04-02 16:34:05          3

 > dput(foo)
structure(list(users = c("user1 & user2", "user1 & user2", "user1 & user2",
"user1 & user2", "user1 & user2", "user1 & user2 & user3"), datetime = 
1175313886, 1175528960, 1175529724, 1175532242, 1175546045), class = 
"POSIXct"), tzone = "US/Eastern"), user_count = c(2, 2, 2, 2,
2, 3)), .Names = c("users", "datetime", "user_count"), row.names = c(NA,
6L), class = "data.frame")


More information about the R-help mailing list