[R] Making tapply code more efficient
jim holtman
jholtman at gmail.com
Fri Feb 27 19:59:21 CET 2009
On something the size of your data it took about 30 seconds to
determine the number of unique teachers per student.
> x <- cbind(sample(326397, 800967, TRUE), sample(20, 800967, TRUE))
> # split the data so you have the number of teachers per student
> system.time(t.s <- split(x[,2], x[,1]))
user system elapsed
0.92 0.01 0.94
> t.s[1:7] # sample data
$`1`
[1] 16
$`2`
[1] 3
$`3`
[1] 1
$`4`
[1] 17
$`6`
[1] 9 9 19
$`7`
[1] 20
$`9`
[1] 3 16 16 10 8 17
> # count number of unique teachers per student
> system.time(t.a <- sapply(t.s, function(x) length(unique(x))))
user system elapsed
20.17 0.10 20.26
>
>
>
> t.a[1:10]
1 2 3 4 6 7 9 10 11 12
1 1 1 1 2 1 5 1 1 1
On Fri, Feb 27, 2009 at 9:46 AM, Doran, Harold <HDoran at air.org> wrote:
> Previously, I posed the question pasted down below to the list and
> received some very helpful responses. While the code suggestions
> provided in response indeed work, they seem to only work with *very*
> small data sets and so I wanted to follow up and see if anyone had ideas
> for better efficiency. I was quite embarrased on this as our SAS
> programmers cranked out programs that did this in the blink of an eye
> (with a few variables), but R was spinning for days on my Ubuntu machine
> and ultimately I saw a message that R was "killed".
>
> The data I am working with has 800967 total rows and 31 total columns.
> The ID variable I use as the index variable in tapply() has 326397
> unique cases.
>
>> length(unique(qq$student_unique_id))
> [1] 326397
>
> To give a sense of what my data look like and the actual problem,
> consider the following:
>
> qq <- data.frame(student_unique_id = factor(c(1,1,2,2,2)),
> teacher_unique_id = factor(c(10,10,20,20,25)))
>
> This is a student achievement database where students occupy multiple
> rows in the data and the variable teacher_unique_id denotes the class
> the student was in. What I am doing is looking to see if the teacher is
> the same for each instance of the unique student ID. So, if I implement
> the following:
>
> same <- function(x) length( unique(x) ) == 1
> results <- data.frame(
> freq = tapply(qq$student_unique_id, qq$student_unique_id,
> length),
> tch = tapply(qq$teacher_unique_id, qq$student_unique_id, same)
> )
>
> I get the following results. I can see that student 1 appears in the
> data twice and the teacher is always the same. However, student 2
> appears three times and the teacher is not always the same.
>
>> results
> freq tch
> 1 2 TRUE
> 2 3 FALSE
>
> Now, implementing this same procedure to a large data set with the
> characteristics described above seems to be problematic in this
> implementation.
>
> Does anyone have reactions on how this could be more efficient such that
> it can run with large data as I described?
>
> Harold
>
>> sessionInfo()
> R version 2.8.1 (2008-12-22)
> x86_64-pc-linux-gnu
>
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.U
> TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=
> C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATI
> ON=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
>
>
>
> ##### Original question posted on 1/13/09
> Suppose I have a dataframe as follows:
>
> dat <- data.frame(id = c(1,1,2,2,2), var1 = c(10,10,20,20,25), var2 =
> c('foo', 'foo', 'foo', 'foobar', 'foo'))
>
> Now, if I were to subset by id, such as:
>
>> subset(dat, id==1)
> id var1 var2
> 1 1 10 foo
> 2 1 10 foo
>
> I can see that the elements in var1 are exactly the same and the
> elements in var2 are exactly the same. However,
>
>> subset(dat, id==2)
> id var1 var2
> 3 2 20 foo
> 4 2 20 foobar
> 5 2 25 foo
>
> Shows the elements are not the same for either variable in this
> instance. So, what I am looking to create is a data frame that would be
> like this
>
> id freq var1 var2
> 1 2 TRUE TRUE
> 2 3 FALSE FALSE
>
> Where freq is the number of times the ID is repeated in the dataframe. A
> TRUE appears in the cell if all elements in the column are the same for
> the ID and FALSE otherwise. It is insignificant which values differ for
> my problem.
>
> The way I am thinking about tackling this is to loop through the ID
> variable and compare the values in the various columns of the dataframe.
> The problem I am encountering is that I don't think all.equal or
> identical are the right functions in this case.
>
> So, say I was wanting to compare the elements of var1 for id ==1. I
> would have
>
> x <- c(10,10)
>
> Of course, the following works
>
>> all.equal(x[1], x[2])
> [1] TRUE
>
> As would a similar call to identical. However, what if I only have a
> vector of values (or if the column consists of names) that I want to
> assess for equality when I am trying to automate a process over
> thousands of cases? As in the example above, the vector may contain only
> two values or it may contain many more. The number of values in the
> vector differ by id.
>
> Any thoughts?
>
> Harold
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
More information about the R-help
mailing list