[R] Newbie: Using R to analyse Apache logs

Thu Jan 31 15:21:43 CET 2008

Raj,

I've been experimenting with R to compute simple statistics from my web
logs somewhat similar to what you're describing. For instance, I'm
working on trying to classify a unique IP or domain name requestor as
'human' or 'robot' based on the number of seconds between requests for
pages. I've found that the easiest method of work, given my (elementary)
knowledge of R and my (professional) knowledge of perl, is to run my
logs through a perl program to pre-process the data, before submitting
it to R. The output of running my Apache web log through my perl program
looks like this tab-delimited output:
kevinz at cn2:~/weblogstats$ ./weblogtimediff.pl access_log.20071130.sorted
|head
DateTime        Source  TimeDiff        Type
30/Nov/2007 00:00:47    54.100.68.58.sikkanet.com       15      unknown
30/Nov/2007 00:00:48    54.100.68.58.sikkanet.com       1       unknown
30/Nov/2007 00:01:19    54.100.68.58.sikkanet.com       31      unknown
30/Nov/2007 00:01:25    54.100.68.58.sikkanet.com       6       unknown
30/Nov/2007 00:01:29    ip-61-14-181-116.asianetcom.net 15      unknown
30/Nov/2007 00:01:40    54.100.68.58.sikkanet.com       15      unknown
30/Nov/2007 00:01:41    54.100.68.58.sikkanet.com       1       unknown
30/Nov/2007 00:01:44    llf520049.crawl.yahoo.net       14      robot
30/Nov/2007 00:01:46    ip-61-14-181-116.asianetcom.net 17      unknown
kevinz at cn2:~/weblogstats$

In this, I also make a preliminary classification into 'robot' (because
it identified itself as such in the browser field), 'human' (because it
submitted a text string to my internal search engine), or 'unknown'.

Unfortunately, this approach doesn't seem to be working. The
distributions of both the 'humans' and 'robots' seemed to be Poisson by
inspection. I therefore created box plots of the log(mean(time
intervals)), but the 'humans' versus the 'robots' were indistinguishable
by inspection. As this is not exactly what I'm paid to do, I just play
with this on my spare time, so I haven't tried anything else yet.

If it's of general interest to this group, I'd be happy to publish my
program for this. Otherwise, Raj, if you're interested, I'd be happy to
send it to you privately.

One oddity I noted is that Apache logs are not always in chronological
order. The date/time stamp is when the request occurred, but it's
written in the log when the request is completed. Thus, for a long
download, several, shorter subsequent downloads may have been requested
and completed before the earlier, long one. I was confused by negative
time differences from my program until I discovered this. Subsequently,
I sort my Apache log in chronological order before passing it through my
program.

Hope this helps. Let me know if you have any other questions.

-Kevin

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of Raj Mathur
Sent: Thursday, January 31, 2008 8:31 AM
To: r-help
Subject: [R] Newbie: Using R to analyse Apache logs

hits=-2.5 tests=BAYES_00,FORGED_RCVD_HELO
X-USF-Spam-Flag: NO

Hi,

I have a requirement to scan Apache logs and discover ``exceptions''.  
Exceptions can be of two types:

1. A single IP generating a large amount of traffic within a given time
frame 
(for definable values of ``large'' and ``time frame'').

2. A single IP hitting a wide set of URLs on the server (indicates a
crawler), 
again for definable values of ``wide''.

I'm a complete newbie to R (and to statistics), so the questions are:

- Can R help me generate graphs which would help me identify these
activities?

- Has someone already done something like this?  If so, where could I
find it?

- If not, can someone help me with the stats (and R) part to help me
achieve 
these objectives?  Any software that gets created as a result would be 
released under a FOSS license.

Data massaging, tuning, etc. are not an issue.  We'd be dealing with a
few 
hundred thousand or a million records a day.

Regards,

-- Raju
-- 
Raj Mathur                raju at kandalaya.org      http://kandalaya.org/
 Freedom in Technology & Software || February 2008 || http://freed.in/
       GPG: 78D4 FC67 367F 40E2 0DD5  0FEF C968 D0EF CC68 D17F
PsyTrance & Chill: http://schizoid.in/   ||   It is the mind that moves

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.