[R] Making tapply code more efficient

William Dunlap wdunlap at tibco.com
Mon Mar 9 19:33:24 CET 2009


The sparse xtabs solution is good.  It avoids the overhead
of a long list of small vectors that tapply can require.

Here is another approach, perhaps more like what
a SAS programmer might do.  It can run quickly
and use little memory but also can be tricky to set up.
Sort the data into groups (by student_unique_id here)
and within the groups sort by teacher_unique_id.  order()
does this nicely. Once the sorting is done you only have 
to do comparisons of observations with their neighbors to
identify the first and last observations in a group (now
a run, because of the sorting).  In this case you just
need to ask if the first teacher id in a given run is
the same as the last teach id in that run.  

f2 <- function(qq, needs.sorting = TRUE) {
  # we sort by student_id, breaking ties with teacher_id, unless
  # user asserts this has already been done.
  student_unique_id <- qq$student_unique_id
  teacher_unique_id <- qq$teacher_unique_id
  if (needs.sorting) {
     ord <- order(student_unique_id, teacher_unique_id)
     student_unique_id <- student_unique_id[ord]
     teacher_unique_id <- teacher_unique_id[ord]
  }
  freq <- table(student_unique_id) # table is fast way to count by
groups
  freq <- freq[freq>0]
  # Because of sorting, if teacher_id in first record for a given
student
  # is the same as the teacher_id for the last record for that student,
  # then that student had only 1 teacher.
  # These functions fail if x contains NA's, but sorting has removed
them
  isFirstInRun <- function(x)c(TRUE, x[-1]!=x[-length(x)])
  isLastInRun <- function(x)c(x[-1]!=x[-length(x)], TRUE)
  tch <- teacher_unique_id[isFirstInRun(student_unique_id)] ==
         teacher_unique_id[isLastInRun(student_unique_id)]
  results <- data.frame(
      row.names=dimnames(freq)[[1]],
      Freq=as.vector(freq),
      tch=tch)
  results
}

In this case this uses 50% more time than the xtabs(sparse=TRUE)
approach
but it may be worthwhile to get familiar with this sort of algorithm.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com 
--------------------------------------------------
[R] Making tapply code more efficient

Doran, Harold HDoran at air.org 
Mon Mar 9 15:43:47 CET 2009
Previous message: [R] Making tapply code more efficient
Next message: [R] rcorr.cens Goodman-Kruskal gamma
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I think I might be able to answer my own question here. It turns out in
the step 

tab <- table(dats3$student_unique_id, dats3$teacher_unique_id)

The dimensions of this table in my data are significantly larger than
the simulated data, thus consuming more memory. So, I know that the lme4
package has a method for sparse crosstabs, so I tried this:

> library(lme4)

> tab <- xtabs(~ dats3$student_unique_id + dats3$teacher_unique_id,
sparse = TRUE)

> result <- data.frame(Student = rownames(tab), Freq = rowSums(tab), tch
= rowSums(tab > 0) == 1)

And the world works beautifully.



> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Doran, Harold
> Sent: Monday, March 09, 2009 10:25 AM
> To: ONKELINX, Thierry; jholtman at gmail.com
> Cc: r-help at r-project.org
> Subject: Re: [R] Making tapply code more efficient
> 
> Thierry and Jim:
> 
> Thank you both for your reply. I remain a bit baffled over something.
> Here is the sample data generated by jim and code by Thierry, 
> which works exactly as expected.
> 
> x <- cbind(sample(326397, 800967, TRUE), sample(20, 800967, 
> TRUE)) x <- data.frame(x) names(x)[1:2] <- 
> c('student_unique_id', 'teacher_unique_id') tab <- 
> table(x$student_unique_id, x$teacher_unique_id) result <- 
> data.frame(Student = rownames(tab), Freq = rowSums(tab), tch 
> = rowSums(tab > 0) == 1)
> 
> Now, here is what happens when I run this on my data (called dats3)
> 
> > tab <- table(dats3$student_unique_id, dats3$teacher_unique_id)
> Error: cannot allocate vector of size 942.8 Mb
> 
> So, let's take a look at a couple of things:
> 
> > object.size(dats3) < object.size(x)
> [1] TRUE
> 
> > str(x)
> 'data.frame':   800967 obs. of  2 variables:
>  $ student_unique_id: int  121914 89142 127790 61350 54684 
> 28018 313428
> 27595 316285 173571 ...
>  $ teacher_unique_id: int  17 1 19 20 3 18 15 1 14 15 ...
> > str(dats3)
> 'data.frame':   56204 obs. of  2 variables:
>  $ student_unique_id: int  20504 26172 20504 3609 4313 5058 5363 5669
> 6429 6560 ...
>  $ teacher_unique_id: int  35078 41029 35078 41437 41476 41456 41486
> 35415 41508 35413 ...
> 
> The sample data are smaller in size than my actual data and 
> the structure is exactly the same. Do you see any other 
> reason why the memory issue would arise here?
> 
> Harold
> 
> 
> 
>  
> 
> > -----Original Message-----
> > From: ONKELINX, Thierry [mailto:Thierry.ONKELINX at inbo.be]
> > Sent: Friday, February 27, 2009 10:24 AM
> > To: Doran, Harold; r-help at r-project.org
> > Subject: RE: [R] Making tapply code more efficient
> > 
> > Hi Harold,
> > 
> > What about this? You one have to make the crosstabulation once.
> > 
> > > qq <- data.frame(student = factor(c(1,1,2,2,2)), teacher =
> > factor(c(10,10,20,20,25)))
> > > tab <- table(qq$student, qq$teacher) data.frame(Student = 
> > > rownames(tab), Freq = rowSums(tab), tch =
> > rowSums(tab > 0) == 1)
> >   Student Freq   tch
> > 1       1    2  TRUE
> > 2       2    3 FALSE
> > 
> > HTH,
> > 
> > Thierry
> > 
> > 
> > --------------------------------------------------------------
> > ----------
> > ----
> > ir. Thierry Onkelinx
> > Instituut voor natuur- en bosonderzoek / Research Institute 
> for Nature 
> > and Forest Cel biometrie, methodologie en kwaliteitszorg / Section 
> > biometrics, methodology and quality assurance Gaverstraat 4 9500 
> > Geraardsbergen Belgium tel. + 32
> > 54/436 185 Thierry.Onkelinx at inbo.be www.inbo.be
> > 
> > To call in the statistician after the experiment is done may be no 
> > more than asking him to perform a post-mortem
> > examination: he may be able to say what the experiment died of.
> > ~ Sir Ronald Aylmer Fisher
> > 
> > The plural of anecdote is not data.
> > ~ Roger Brinner
> > 
> > The combination of some data and an aching desire for an 
> answer does 
> > not ensure that a reasonable answer can be extracted from a 
> given body 
> > of data.
> > ~ John Tukey
> > 
> > -----Oorspronkelijk bericht-----
> > Van: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org]
> > Namens Doran, Harold
> > Verzonden: vrijdag 27 februari 2009 15:47
> > Aan: r-help at r-project.org
> > Onderwerp: [R] Making tapply code more efficient
> > 
> > Previously, I posed the question pasted down below to the list and 
> > received some very helpful responses. While the code suggestions 
> > provided in response indeed work, they seem to only work 
> with *very* 
> > small data sets and so I wanted to follow up and see if anyone had 
> > ideas for better efficiency.
> > I was quite embarrased on this as our SAS programmers cranked out 
> > programs that did this in the blink of an eye (with a few 
> variables), 
> > but R was spinning for days on my Ubuntu machine and 
> ultimately I saw 
> > a message that R was "killed".
> > 
> > The data I am working with has 800967 total rows and 31 
> total columns.
> > The ID variable I use as the index variable in tapply() has
> > 326397 unique cases.
> > 
> > > length(unique(qq$student_unique_id))
> > [1] 326397
> > 
> > To give a sense of what my data look like and the actual problem, 
> > consider the following:
> > 
> > qq <- data.frame(student_unique_id = factor(c(1,1,2,2,2)), 
> > teacher_unique_id = factor(c(10,10,20,20,25)))
> > 
> > This is a student achievement database where students 
> occupy multiple 
> > rows in the data and the variable teacher_unique_id denotes 
> the class 
> > the student was in. What I am doing is looking to see if 
> the teacher 
> > is the same for each instance of the unique student ID. So, if I 
> > implement the following:
> > 
> > same <- function(x) length( unique(x) ) == 1 results <- data.frame(
> >         freq = tapply(qq$student_unique_id, qq$student_unique_id, 
> > length),
> >         tch = tapply(qq$teacher_unique_id, 
> qq$student_unique_id, same)
> > )
> > 
> > I get the following results. I can see that student 1 
> appears in the 
> > data twice and the teacher is always the same.
> > However, student 2 appears three times and the teacher is 
> not always 
> > the same.
> > 
> > > results
> >   freq   tch
> > 1    2  TRUE
> > 2    3 FALSE
> > 
> > Now, implementing this same procedure to a large data set with the 
> > characteristics described above seems to be problematic in this 
> > implementation.
> > 
> > Does anyone have reactions on how this could be more efficient such 
> > that it can run with large data as I described?
> > 
> > Harold
> > 
> > > sessionInfo()
> > R version 2.8.1 (2008-12-22)
> > x86_64-pc-linux-gnu
> > 
> > locale:
> > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLA
> > TE=en_US.U
> > TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-
> > 8;LC_NAME=
> > C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_ID
> > ENTIFICATI
> > ON=C
> > 
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base
> > 
> > 
> > 
> > 
> > ##### Original question posted on 1/13/09 Suppose I have a 
> dataframe 
> > as follows:
> > 
> > dat <- data.frame(id = c(1,1,2,2,2), var1 = 
> c(10,10,20,20,25), var2 = 
> > c('foo', 'foo', 'foo', 'foobar', 'foo'))
> > 
> > Now, if I were to subset by id, such as:
> > 
> > > subset(dat, id==1)
> >   id var1 var2
> > 1  1   10  foo
> > 2  1   10  foo
> > 
> > I can see that the elements in var1 are exactly the same and the 
> > elements in var2 are exactly the same. However,
> > 
> > > subset(dat, id==2)
> >   id var1   var2
> > 3  2   20    foo
> > 4  2   20 foobar
> > 5  2   25    foo
> > 
> > Shows the elements are not the same for either variable in this 
> > instance. So, what I am looking to create is a data frame 
> that would 
> > be like this
> > 
> > id      freq    var1    var2
> > 1       2       TRUE    TRUE   
> > 2       3       FALSE   FALSE
> > 
> > Where freq is the number of times the ID is repeated in the 
> dataframe. 
> > A TRUE appears in the cell if all elements in the column 
> are the same 
> > for the ID and FALSE otherwise. It is insignificant which values 
> > differ for my problem.
> > 
> > The way I am thinking about tackling this is to loop through the ID 
> > variable and compare the values in the various columns of the 
> > dataframe.
> > The problem I am encountering is that I don't think all.equal or 
> > identical are the right functions in this case.
> > 
> > So, say I was wanting to compare the elements of var1 for id ==1. I 
> > would have
> > 
> > x <- c(10,10)
> > 
> > Of course, the following works
> > 
> > > all.equal(x[1], x[2])
> > [1] TRUE
> > 
> > As would a similar call to identical. However, what if I 
> only have a 
> > vector of values (or if the column consists of names) that 
> I want to 
> > assess for equality when I am trying to automate a process over 
> > thousands of cases? As in the example above, the vector may contain 
> > only two values or it may contain many more. The number of 
> values in 
> > the vector differ by id.
> > 
> > Any thoughts?
> > 
> > Harold
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > 
> > Dit bericht en eventuele bijlagen geven enkel de visie van de 
> > schrijver weer en binden het INBO onder geen enkel beding, 
> zolang dit 
> > bericht niet bevestigd is door een geldig ondertekend document. The 
> > views expressed in  this message and any annex are purely 
> those of the 
> > writer and may not be regarded as stating an official position of 
> > INBO, as long as the message is not confirmed by a duly signed 
> > document.
> > 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list