[R] Histograms with strings, grouped by repeat count (w/ data)

Matthew Trunnell trunnell at cognix.net
Tue Jun 19 05:14:51 CEST 2007


Aha!  So to expand that from the original expression,

> table(table(d$filename, d$email_addr))

  0   1   2   3
253  20   8   9

I think that is exactly what I'm looking for.  I knew it must be
simple!!!  What does the 0 column represent?

Also, does this tell me the same thing, filtered by Japan?
> table(table(d$filename, d$email_addr, d$country_residence)[d$country_residence=="Japan"])

  0   1   2   3
958   5   2   1

How does that differ logically from this?

> table(table(d$filename, d$email_addr)[d$country_residence=="Japan"])

 0  1  2  3
51  4  2  1

I don't understand why that produces different results.  The first one
adds a third dimension to the table, but limits that third dimension
to a single element, Japan.  Shouldn't it be the same?  And again,
what's that zero column?

Thank you,
Matt


On 6/18/07, jim holtman <jholtman at gmail.com> wrote:
> If you are running on windows, make sure you have 'recording' checked in the
> history window of the graphics.  You can also put the output to a pdf file
> and view it later.
>
> If you use table on the counts matrix:
>
> > table(counts)
> counts
>   0   1   2   3
> 253  20   8   9
> >
>
> this shows that there were 20 single tries, 8 files downloaded twice and 9
> three times.  Is this what you want?
>
> You can also get the indices of the non-zero entries by:
>
> > which(counts != 0, arr.ind=TRUE)
>        row col
> file1    1   1
> file5    6   2
> file1    1   3
> file2    3   3
> file7    8   4
> file8    9   4
> file1    1   5
> file2    3   5
> file2    3   6
> .........
>
>
>
>
> On 6/18/07, Matthew Trunnell <trunnell at cognix.net> wrote:
> > Jim,
> > Thanks for the quick reply!  When I run your code, I end up with a
> > single barplot of one datapoint, file9 vs email20 == 2.0.  I see the
> > call to barplot is inside a for loop... maybe it's zooming through the
> > display of many barplots, but all I see is the last one?
> >
> > In any case, I need to figure out the distribution of the retries, such as
> > No. Retries   Count
> > 1                 6
> > 2                 13
> > 3                 5
> > 4                 3
> > 5                 2
> > 6                 1
> >
> > That is, 6 people retried the download once; 13 people retried the
> > download twice, etc.  So it would be counting the frequency of the
> > email-filename combination, and grouping those together by the number
> > of retries.  Does that make sense?
> >
> > When I look at the counts object from your code, I can see that it's
> > close to what I need.  How do I access the properties of the counts
> > object-- it's a table, right?  If I look at counts[1,1], that returns
> > 1.  But how do I get at the row/col name of that cell?  Is that cell
> > an object?  rownames(counts[1,1]) returns null.
> >
> > Thanks,
> > Matt
> >
> >
> > On 6/18/07, jim holtman <jholtman at gmail.com> wrote:
> > > You should be using barplot and not hist.  I think this produces what
> you
> > > want:
> > >
> > > x <-
> "filename,last_modified,email_addr,country_residence
> > >
> > > file1,3/4/2006 13:54,email1,Korea (South)
> > > file2,3/4/2006 14:33,email2,United States
> > > file2,3/4/2006 16:03,email2,United States
> > > file2,3/4/2006 16:17,email3,United States
> > > file2,3/4/2006 16:28,email3,United States
> > > file3,3/4/2006 19:13,email4,United States
> > > file2,3/4/2006 21:22,email5,India
> > > file4,3/4/2006 21:46,email6,United States
> > > file1,3/4/2006 22:04,email7,Japan
> > > file2,3/4/2006 22:09,email8,Croatia
> > > file1,3/4/2006 22:22,email7,Japan
> > > file1,3/4/2006 22:29,email9,India
> > > file1,3/4/2006 23:06,email6,United States
> > > file1,3/4/2006 23:33,email6,United States
> > > file5,3/4/2006 23:44,email10,China
> > > file1,3/5/2006 0:13,email9,India
> > > file2,3/5/2006 0:52,email8,Croatia
> > > file2,3/5/2006 0:54,email8,Croatia
> > > file2,3/5/2006 1:10,email5,India
> > > file6,3/5/2006 2:17,email9,India
> > > file2,3/5/2006 2:24,email11,Italy
> > > file7,3/5/2006 2:36,email12,Italy
> > > file8,3/5/2006 2:52,email12,Italy
> > > file2,3/5/2006 3:09,email13,United Kingdom
> > > file2,3/5/2006 4:02,email14,India
> > > file2,3/5/2006 4:07,email14,India
> > > file2,3/5/2006 4:14,email14,India
> > > file2,3/5/2006 4:37,email5,India
> > > file2,3/5/2006 4:44,email15,Belgium
> > > file1,3/5/2006 5:02,email9,India
> > > file1,3/5/2006 5:24,email16,Taiwan
> > > file2,3/5/2006 6:06,email17,Saudi Arabia
> > > file2,3/5/2006 7:32,email17,Saudi Arabia
> > > file2,3/5/2006 8:12,email18,Brazil
> > > file2,3/5/2006 8:26,email18,Brazil
> > > file2,3/5/2006 9:49,email19,United Kingdom
> > > file1,3/5/2006 10:49,email11,Italy
> > > file1,3/5/2006 11:16,email13,United Kingdom
> > > file1,3/5/2006 11:16,email13,United Kingdom
> > > file1,3/5/2006 11:45,email13,United Kingdom
> > > file1,3/5/2006 14:34,email20,Australia
> > > file9,3/5/2006 14:56,email20,Australia
> > > file9,3/5/2006 14:56,email20,Australia
> > > file5,3/5/2006 16:43,email21,United States
> > > file1,3/5/2006 17:17,email7,Japan
> > > file2,3/5/2006 17:26,email22,Japan
> > > file2,3/5/2006 17:27,email22,Japan
> > > file2,3/5/2006 17:33,email23,China
> > > file1,3/5/2006 17:45,email22,Japan
> > > file2,3/5/2006 17:45,email22,Japan
> > > file2,3/5/2006 17:59,email23,China
> > > file1,3/5/2006 18:27,email24,Japan
> > > file1,3/5/2006 18:47,email25,Taiwan
> > > file2,3/5/2006 18:48,email26,New Zealand
> > > file2,3/5/2006 19:15,email27,Canada
> > > file2,3/5/2006 19:23,email28,Canada
> > > file2,3/5/2006 19:24,email28,Canada
> > > file10,3/5/2006 19:49,email29,Japan
> > > file10,3/5/2006 19:52,email29,Japan
> > > file10,3/5/2006 19:57,email29,Japan
> > > file2,3/5/2006 20:01,email29,Japan
> > > file2,3/5/2006 20:02,email29,Japan
> > > file2,3/5/2006 20:06,email29,Japan"
> > > d <- read.csv(textConnection(x))
> > > barplot(table(d$filename), main="All Files", las=2)  # plot counts for
> all
> > > the files
> > > # generate plots for each file name showing which emails used them
> > > counts <- table(d$filename, d$email_addr)
> > > for (i in seq(nrow(counts))){
> > >     .index <- which(counts[i,] > 0)
> > >     barplot(counts[i, .index], las=2,
> > >         names.arg=colnames(counts)[.index], main=rownames(counts)[i])
> > > }
> > >
> > >
> > >
> > > On 6/18/07, Matthew Trunnell < trunnell at cognix.net> wrote:
> > > >
> > > > Hello R gurus,
> > > >
> > > > I just spent my first weekend wrestling with R, but so far have come
> > > > up empty handed.
> > > >
> > > > I have a dataset that represents file downloads; it has 4 dimensions:
> > > > date, filename, email, and country.  (sample data below)
> > > >
> > > > My first goal is to get an idea of the frequency of repeated
> > > > downloads.  Let me explain that.  Some people tend to download
> > > > multiple times, e.g. if the download fails they keep trying over and
> > > > over.  I'm trying to build a histogram that shows the repeat count
> > > > along the x-axis, that is, how many people downloaded once, twice,
> > > > three times, etc.  I plan to compare the median of that before and
> > > > after we switched ISPs.
> > > >
> > > > To accomplish this, I'm assuming that I'll first need to combine the
> > > > email and filename columns so as to represent a single download
> > > > attempt by an individual.  Does that sound right?  Later, it would be
> > > > nice to limit the histogram to a single filename, country, or company.
> > > > I can probably figure that out myself after I understand how to write
> > > > this funky histogram expression.
> > > >
> > > > With the help of Verzani's introductory text, I've learned how to read
> > > > in the CSV data and do some simple tables, like this:
> > > >
> > > > hist(table(d$filename))
> > > > hist(table(d$filename[substring(d$filename, 1,
> > > 5)=="file1"]))
> > > > hist(sort(table(d$filename[substring(d$filename, 1,
> > > 5)=="file1"])))
> > > >
> > > > Obviously, these commands count the frequency of the files.  What I'd
> > > > like to see are the repeats grouped along the x-axis;  I'd like to
> > > > find, for all files, the distribution of retries.  I hope that makes
> > > > sense. :)
> > > >
> > > > Can someone point me in the right direction?  I'm very new to R and to
> > > > statistics, but I write code for a living.  At this point I'd almost
> > > > be better off writing a program do this kind of simple counting... but
> > > > I have a feeling R would be so useful if I could just get past the
> > > > initial learning curve.
> > > >
> > > > Thank you in advance,
> > > > Matt
> > > >
> > > > Here's some real data, with the private info replaced :)
> > > >
> > > > d<-read.table
> > >
> (file="C:\\users\\trunnellm\\downloads\\statistics\\downloads.csv",
> > > > sep=",", quote="\"", header=TRUE)
> > > >
> > > > filename,last_modified,email_addr,country_residence
> > > > file1,3/4/2006 13:54,email1,Korea (South)
> > > > file2,3/4/2006 14:33,email2,United States
> > > > file2,3/4/2006 16:03,email2,United States
> > > > file2,3/4/2006 16:17,email3,United States
> > > > file2,3/4/2006 16:28,email3,United States
> > > > file3,3/4/2006 19:13,email4,United States
> > > > file2,3/4/2006 21:22,email5,India
> > > > file4,3/4/2006 21:46,email6,United States
> > > > file1,3/4/2006 22:04,email7,Japan
> > > > file2,3/4/2006 22:09,email8,Croatia
> > > > file1,3/4/2006 22:22,email7,Japan
> > > > file1,3/4/2006 22:29,email9,India
> > > > file1,3/4/2006 23:06,email6,United States
> > > > file1,3/4/2006 23:33,email6,United States
> > > > file5,3/4/2006 23:44,email10,China
> > > > file1,3/5/2006 0:13,email9,India
> > > > file2,3/5/2006 0:52,email8,Croatia
> > > > file2,3/5/2006 0:54,email8,Croatia
> > > > file2,3/5/2006 1:10,email5,India
> > > > file6,3/5/2006 2:17,email9,India
> > > > file2,3/5/2006 2:24,email11,Italy
> > > > file7,3/5/2006 2:36,email12,Italy
> > > > file8,3/5/2006 2:52,email12,Italy
> > > > file2,3/5/2006 3:09,email13,United Kingdom
> > > > file2,3/5/2006 4:02,email14,India
> > > > file2,3/5/2006 4:07,email14,India
> > > > file2,3/5/2006 4:14,email14,India
> > > > file2,3/5/2006 4:37,email5,India
> > > > file2,3/5/2006 4:44,email15,Belgium
> > > > file1,3/5/2006 5:02,email9,India
> > > > file1,3/5/2006 5:24,email16,Taiwan
> > > > file2,3/5/2006 6:06,email17,Saudi Arabia
> > > > file2,3/5/2006 7:32,email17,Saudi Arabia
> > > > file2,3/5/2006 8:12,email18,Brazil
> > > > file2,3/5/2006 8:26,email18,Brazil
> > > > file2,3/5/2006 9:49,email19,United Kingdom
> > > > file1,3/5/2006 10:49,email11,Italy
> > > > file1,3/5/2006 11:16,email13,United Kingdom
> > > > file1,3/5/2006 11:16,email13,United Kingdom
> > > > file1,3/5/2006 11:45,email13,United Kingdom
> > > > file1,3/5/2006 14:34,email20,Australia
> > > > file9,3/5/2006 14:56,email20,Australia
> > > > file9,3/5/2006 14:56,email20,Australia
> > > > file5,3/5/2006 16:43,email21,United States
> > > > file1,3/5/2006 17:17,email7,Japan
> > > > file2,3/5/2006 17:26,email22,Japan
> > > > file2,3/5/2006 17:27,email22,Japan
> > > > file2,3/5/2006 17:33,email23,China
> > > > file1,3/5/2006 17:45,email22,Japan
> > > > file2,3/5/2006 17:45,email22,Japan
> > > > file2,3/5/2006 17:59,email23,China
> > > > file1,3/5/2006 18:27,email24,Japan
> > > > file1,3/5/2006 18:47,email25,Taiwan
> > > > file2,3/5/2006 18:48,email26,New Zealand
> > > > file2,3/5/2006 19:15,email27,Canada
> > > > file2,3/5/2006 19:23,email28,Canada
> > > > file2,3/5/2006 19:24,email28,Canada
> > > > file10,3/5/2006 19:49,email29,Japan
> > > > file10,3/5/2006 19:52,email29,Japan
> > > > file10,3/5/2006 19:57,email29,Japan
> > > > file2,3/5/2006 20:01,email29,Japan
> > > > file2,3/5/2006 20:02,email29,Japan
> > > > file2,3/5/2006 20:06,email29,Japan
> > > >
> > > > ______________________________________________
> > > > R-help at stat.math.ethz.ch mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal, self-contained, reproducible code.
> > > >
> > >
> > >
> > >
> > > --
> > > Jim Holtman
> > > Cincinnati, OH
> > > +1 513 646 9390
> > >
> > > What is the problem you are trying to solve?
> >
>
>
>
> --
>
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem you are trying to solve?



More information about the R-help mailing list