[R] scaling to multiple data files

jim holtman jholtman at gmail.com
Tue Jan 11 17:39:59 CET 2011


I am not sure exactly what your data represents.  For example, from
looking at the data it appears that user1 and user2 have been logged
on for about 4 days; is that what the data is saying?  If you are
keeping track of users, why not write out a file that has the
start/end time for each user's session.  The first time you see them,
put an entry in a table and as soon as they don't show up in your
sample, write out a record for them.  With that information is it easy
to create a report of the number of unique people over time.

On Tue, Jan 11, 2011 at 10:47 AM, Jason Edgecombe
<jason at rampaginggeek.com> wrote:
> Hello,
>
> I have logging information for multiple machines, which I am trying to
> summarize and graph. So far, I process each host individually, but I would
> like to summarize the user count across multiple hosts. I want to answer the
> question "how many unique users logged in on a certain day across a group of
> machines"?
>
> I'm not quite sure how to scale the data frame and analysis to summarize
> multiple hosts, though. I'm still getting a feel for using R.
>
> Here is a snippet of data for one host. the user_count column is generated
> from the users column using my custom function "usercount()". the samples
> are taken roughly once per minute and only unique samples are recorded.
> (i.e. use na.locf() to uncompress the data). Samples may occur twice in the
> same minute and are rarely aligned on the same time.
>
> Here is the original data before I turn t into a zoo series and run
> na.locf() over it so I can aggregate a single host by day. I'm open to a
> better way.
>> foo
>                  users            datetime user_count
> 1         user1 & user2 2007-03-29 19:16:30          2
> 2         user1 & user2 2007-03-31 00:04:46          2
> 3         user1 & user2 2007-04-02 11:49:20          2
> 4         user1 & user2 2007-04-02 12:02:04          2
> 5         user1 & user2 2007-04-02 12:44:02          2
> 6 user1 & user2 & user3 2007-04-02 16:34:05          3
>
>> dput(foo)
> structure(list(users = c("user1 & user2", "user1 & user2", "user1 & user2",
> "user1 & user2", "user1 & user2", "user1 & user2 & user3"), datetime =
> structure(c(1175210190,
> 1175313886, 1175528960, 1175529724, 1175532242, 1175546045), class =
> c("POSIXt",
> "POSIXct"), tzone = "US/Eastern"), user_count = c(2, 2, 2, 2,
> 2, 3)), .Names = c("users", "datetime", "user_count"), row.names = c(NA,
> 6L), class = "data.frame")
>
>
> Thanks,
> Jason
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?



More information about the R-help mailing list