[R] Making tapply code more efficient
William Dunlap
wdunlap at tibco.com
Mon Mar 9 19:33:24 CET 2009
The sparse xtabs solution is good. It avoids the overhead
of a long list of small vectors that tapply can require.
Here is another approach, perhaps more like what
a SAS programmer might do. It can run quickly
and use little memory but also can be tricky to set up.
Sort the data into groups (by student_unique_id here)
and within the groups sort by teacher_unique_id. order()
does this nicely. Once the sorting is done you only have
to do comparisons of observations with their neighbors to
identify the first and last observations in a group (now
a run, because of the sorting). In this case you just
need to ask if the first teacher id in a given run is
the same as the last teach id in that run.
f2 <- function(qq, needs.sorting = TRUE) {
# we sort by student_id, breaking ties with teacher_id, unless
# user asserts this has already been done.
student_unique_id <- qq$student_unique_id
teacher_unique_id <- qq$teacher_unique_id
if (needs.sorting) {
ord <- order(student_unique_id, teacher_unique_id)
student_unique_id <- student_unique_id[ord]
teacher_unique_id <- teacher_unique_id[ord]
}
freq <- table(student_unique_id) # table is fast way to count by
groups
freq <- freq[freq>0]
# Because of sorting, if teacher_id in first record for a given
student
# is the same as the teacher_id for the last record for that student,
# then that student had only 1 teacher.
# These functions fail if x contains NA's, but sorting has removed
them
isFirstInRun <- function(x)c(TRUE, x[-1]!=x[-length(x)])
isLastInRun <- function(x)c(x[-1]!=x[-length(x)], TRUE)
tch <- teacher_unique_id[isFirstInRun(student_unique_id)] ==
teacher_unique_id[isLastInRun(student_unique_id)]
results <- data.frame(
row.names=dimnames(freq)[[1]],
Freq=as.vector(freq),
tch=tch)
results
}
In this case this uses 50% more time than the xtabs(sparse=TRUE)
approach
but it may be worthwhile to get familiar with this sort of algorithm.
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
--------------------------------------------------
[R] Making tapply code more efficient
Doran, Harold HDoran at air.org
Mon Mar 9 15:43:47 CET 2009
Previous message: [R] Making tapply code more efficient
Next message: [R] rcorr.cens Goodman-Kruskal gamma
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I think I might be able to answer my own question here. It turns out in
the step
tab <- table(dats3$student_unique_id, dats3$teacher_unique_id)
The dimensions of this table in my data are significantly larger than
the simulated data, thus consuming more memory. So, I know that the lme4
package has a method for sparse crosstabs, so I tried this:
> library(lme4)
> tab <- xtabs(~ dats3$student_unique_id + dats3$teacher_unique_id,
sparse = TRUE)
> result <- data.frame(Student = rownames(tab), Freq = rowSums(tab), tch
= rowSums(tab > 0) == 1)
And the world works beautifully.
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Doran, Harold
> Sent: Monday, March 09, 2009 10:25 AM
> To: ONKELINX, Thierry; jholtman at gmail.com
> Cc: r-help at r-project.org
> Subject: Re: [R] Making tapply code more efficient
>
> Thierry and Jim:
>
> Thank you both for your reply. I remain a bit baffled over something.
> Here is the sample data generated by jim and code by Thierry,
> which works exactly as expected.
>
> x <- cbind(sample(326397, 800967, TRUE), sample(20, 800967,
> TRUE)) x <- data.frame(x) names(x)[1:2] <-
> c('student_unique_id', 'teacher_unique_id') tab <-
> table(x$student_unique_id, x$teacher_unique_id) result <-
> data.frame(Student = rownames(tab), Freq = rowSums(tab), tch
> = rowSums(tab > 0) == 1)
>
> Now, here is what happens when I run this on my data (called dats3)
>
> > tab <- table(dats3$student_unique_id, dats3$teacher_unique_id)
> Error: cannot allocate vector of size 942.8 Mb
>
> So, let's take a look at a couple of things:
>
> > object.size(dats3) < object.size(x)
> [1] TRUE
>
> > str(x)
> 'data.frame': 800967 obs. of 2 variables:
> $ student_unique_id: int 121914 89142 127790 61350 54684
> 28018 313428
> 27595 316285 173571 ...
> $ teacher_unique_id: int 17 1 19 20 3 18 15 1 14 15 ...
> > str(dats3)
> 'data.frame': 56204 obs. of 2 variables:
> $ student_unique_id: int 20504 26172 20504 3609 4313 5058 5363 5669
> 6429 6560 ...
> $ teacher_unique_id: int 35078 41029 35078 41437 41476 41456 41486
> 35415 41508 35413 ...
>
> The sample data are smaller in size than my actual data and
> the structure is exactly the same. Do you see any other
> reason why the memory issue would arise here?
>
> Harold
>
>
>
>
>
> > -----Original Message-----
> > From: ONKELINX, Thierry [mailto:Thierry.ONKELINX at inbo.be]
> > Sent: Friday, February 27, 2009 10:24 AM
> > To: Doran, Harold; r-help at r-project.org
> > Subject: RE: [R] Making tapply code more efficient
> >
> > Hi Harold,
> >
> > What about this? You one have to make the crosstabulation once.
> >
> > > qq <- data.frame(student = factor(c(1,1,2,2,2)), teacher =
> > factor(c(10,10,20,20,25)))
> > > tab <- table(qq$student, qq$teacher) data.frame(Student =
> > > rownames(tab), Freq = rowSums(tab), tch =
> > rowSums(tab > 0) == 1)
> > Student Freq tch
> > 1 1 2 TRUE
> > 2 2 3 FALSE
> >
> > HTH,
> >
> > Thierry
> >
> >
> > --------------------------------------------------------------
> > ----------
> > ----
> > ir. Thierry Onkelinx
> > Instituut voor natuur- en bosonderzoek / Research Institute
> for Nature
> > and Forest Cel biometrie, methodologie en kwaliteitszorg / Section
> > biometrics, methodology and quality assurance Gaverstraat 4 9500
> > Geraardsbergen Belgium tel. + 32
> > 54/436 185 Thierry.Onkelinx at inbo.be www.inbo.be
> >
> > To call in the statistician after the experiment is done may be no
> > more than asking him to perform a post-mortem
> > examination: he may be able to say what the experiment died of.
> > ~ Sir Ronald Aylmer Fisher
> >
> > The plural of anecdote is not data.
> > ~ Roger Brinner
> >
> > The combination of some data and an aching desire for an
> answer does
> > not ensure that a reasonable answer can be extracted from a
> given body
> > of data.
> > ~ John Tukey
> >
> > -----Oorspronkelijk bericht-----
> > Van: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org]
> > Namens Doran, Harold
> > Verzonden: vrijdag 27 februari 2009 15:47
> > Aan: r-help at r-project.org
> > Onderwerp: [R] Making tapply code more efficient
> >
> > Previously, I posed the question pasted down below to the list and
> > received some very helpful responses. While the code suggestions
> > provided in response indeed work, they seem to only work
> with *very*
> > small data sets and so I wanted to follow up and see if anyone had
> > ideas for better efficiency.
> > I was quite embarrased on this as our SAS programmers cranked out
> > programs that did this in the blink of an eye (with a few
> variables),
> > but R was spinning for days on my Ubuntu machine and
> ultimately I saw
> > a message that R was "killed".
> >
> > The data I am working with has 800967 total rows and 31
> total columns.
> > The ID variable I use as the index variable in tapply() has
> > 326397 unique cases.
> >
> > > length(unique(qq$student_unique_id))
> > [1] 326397
> >
> > To give a sense of what my data look like and the actual problem,
> > consider the following:
> >
> > qq <- data.frame(student_unique_id = factor(c(1,1,2,2,2)),
> > teacher_unique_id = factor(c(10,10,20,20,25)))
> >
> > This is a student achievement database where students
> occupy multiple
> > rows in the data and the variable teacher_unique_id denotes
> the class
> > the student was in. What I am doing is looking to see if
> the teacher
> > is the same for each instance of the unique student ID. So, if I
> > implement the following:
> >
> > same <- function(x) length( unique(x) ) == 1 results <- data.frame(
> > freq = tapply(qq$student_unique_id, qq$student_unique_id,
> > length),
> > tch = tapply(qq$teacher_unique_id,
> qq$student_unique_id, same)
> > )
> >
> > I get the following results. I can see that student 1
> appears in the
> > data twice and the teacher is always the same.
> > However, student 2 appears three times and the teacher is
> not always
> > the same.
> >
> > > results
> > freq tch
> > 1 2 TRUE
> > 2 3 FALSE
> >
> > Now, implementing this same procedure to a large data set with the
> > characteristics described above seems to be problematic in this
> > implementation.
> >
> > Does anyone have reactions on how this could be more efficient such
> > that it can run with large data as I described?
> >
> > Harold
> >
> > > sessionInfo()
> > R version 2.8.1 (2008-12-22)
> > x86_64-pc-linux-gnu
> >
> > locale:
> > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLA
> > TE=en_US.U
> > TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-
> > 8;LC_NAME=
> > C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_ID
> > ENTIFICATI
> > ON=C
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
> >
> >
> >
> >
> > ##### Original question posted on 1/13/09 Suppose I have a
> dataframe
> > as follows:
> >
> > dat <- data.frame(id = c(1,1,2,2,2), var1 =
> c(10,10,20,20,25), var2 =
> > c('foo', 'foo', 'foo', 'foobar', 'foo'))
> >
> > Now, if I were to subset by id, such as:
> >
> > > subset(dat, id==1)
> > id var1 var2
> > 1 1 10 foo
> > 2 1 10 foo
> >
> > I can see that the elements in var1 are exactly the same and the
> > elements in var2 are exactly the same. However,
> >
> > > subset(dat, id==2)
> > id var1 var2
> > 3 2 20 foo
> > 4 2 20 foobar
> > 5 2 25 foo
> >
> > Shows the elements are not the same for either variable in this
> > instance. So, what I am looking to create is a data frame
> that would
> > be like this
> >
> > id freq var1 var2
> > 1 2 TRUE TRUE
> > 2 3 FALSE FALSE
> >
> > Where freq is the number of times the ID is repeated in the
> dataframe.
> > A TRUE appears in the cell if all elements in the column
> are the same
> > for the ID and FALSE otherwise. It is insignificant which values
> > differ for my problem.
> >
> > The way I am thinking about tackling this is to loop through the ID
> > variable and compare the values in the various columns of the
> > dataframe.
> > The problem I am encountering is that I don't think all.equal or
> > identical are the right functions in this case.
> >
> > So, say I was wanting to compare the elements of var1 for id ==1. I
> > would have
> >
> > x <- c(10,10)
> >
> > Of course, the following works
> >
> > > all.equal(x[1], x[2])
> > [1] TRUE
> >
> > As would a similar call to identical. However, what if I
> only have a
> > vector of values (or if the column consists of names) that
> I want to
> > assess for equality when I am trying to automate a process over
> > thousands of cases? As in the example above, the vector may contain
> > only two values or it may contain many more. The number of
> values in
> > the vector differ by id.
> >
> > Any thoughts?
> >
> > Harold
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> > Dit bericht en eventuele bijlagen geven enkel de visie van de
> > schrijver weer en binden het INBO onder geen enkel beding,
> zolang dit
> > bericht niet bevestigd is door een geldig ondertekend document. The
> > views expressed in this message and any annex are purely
> those of the
> > writer and may not be regarded as stating an official position of
> > INBO, as long as the message is not confirmed by a duly signed
> > document.
> >
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list