jim holtman
jholtman at gmail.com
Fri Feb 27 19:59:21 CET 2009
On something the size of your data it took about 30 seconds to
determine the number of unique teachers per student.
> x <- cbind(sample(326397, 800967, TRUE), sample(20, 800967, TRUE))
> # split the data so you have the number of teachers per student
> system.time(t.s <- split(x[,2], x[,1]))
user system elapsed
0.92 0.01 0.94
> t.s[1:7] # sample data
[1] 16
[1] 3
[1] 1
[1] 17
[1] 9 9 19
[1] 20
[1] 3 16 16 10 8 17
> # count number of unique teachers per student
> system.time(t.a <- sapply(t.s, function(x) length(unique(x))))
user system elapsed
20.17 0.10 20.26
> t.a[1:10]
1 2 3 4 6 7 9 10 11 12
1 1 1 1 2 1 5 1 1 1
On Fri, Feb 27, 2009 at 9:46 AM, Doran, Harold <HDoran at air.org> wrote:
> Previously, I posed the question pasted down below to the list and
> received some very helpful responses. While the code suggestions
> provided in response indeed work, they seem to only work with *very*
> small data sets and so I wanted to follow up and see if anyone had ideas
> for better efficiency. I was quite embarrased on this as our SAS
> programmers cranked out programs that did this in the blink of an eye
> (with a few variables), but R was spinning for days on my Ubuntu machine
> and ultimately I saw a message that R was "killed".
> The data I am working with has 800967 total rows and 31 total columns.
> The ID variable I use as the index variable in tapply() has 326397
> unique cases.
>> length(unique(qq$student_unique_id))
> [1] 326397
> To give a sense of what my data look like and the actual problem,
> consider the following:
> qq <- data.frame(student_unique_id = factor(c(1,1,2,2,2)),
> teacher_unique_id = factor(c(10,10,20,20,25)))
> This is a student achievement database where students occupy multiple
> rows in the data and the variable teacher_unique_id denotes the class
> the student was in. What I am doing is looking to see if the teacher is
> the same for each instance of the unique student ID. So, if I implement
> the following:
> same <- function(x) length( unique(x) ) == 1
> results <- data.frame(
> freq = tapply(qq$student_unique_id, qq$student_unique_id,
> length),
> tch = tapply(qq$teacher_unique_id, qq$student_unique_id, same)
> )
> I get the following results. I can see that student 1 appears in the
> data twice and the teacher is always the same. However, student 2
> appears three times and the teacher is not always the same.
>> results
> freq tch
> 1 2 TRUE
> 2 3 FALSE
> Now, implementing this same procedure to a large data set with the
> characteristics described above seems to be problematic in this
> implementation.
> Does anyone have reactions on how this could be more efficient such that
> it can run with large data as I described?
> Harold
>> sessionInfo()
> R version 2.8.1 (2008-12-22)
> x86_64-pc-linux-gnu
> locale:
> ON=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> ##### Original question posted on 1/13/09
> Suppose I have a dataframe as follows:
> dat <- data.frame(id = c(1,1,2,2,2), var1 = c(10,10,20,20,25), var2 =
> c('foo', 'foo', 'foo', 'foobar', 'foo'))
> Now, if I were to subset by id, such as:
>> subset(dat, id==1)
> id var1 var2
> 1 1 10 foo
> 2 1 10 foo
> I can see that the elements in var1 are exactly the same and the
> elements in var2 are exactly the same. However,
>> subset(dat, id==2)
> id var1 var2
> 3 2 20 foo
> 4 2 20 foobar
> 5 2 25 foo
> Shows the elements are not the same for either variable in this
> instance. So, what I am looking to create is a data frame that would be
> like this
> id freq var1 var2
> Where freq is the number of times the ID is repeated in the dataframe. A
> TRUE appears in the cell if all elements in the column are the same for
> the ID and FALSE otherwise. It is insignificant which values differ for
> my problem.
> The way I am thinking about tackling this is to loop through the ID
> variable and compare the values in the various columns of the dataframe.
> The problem I am encountering is that I don't think all.equal or
> identical are the right functions in this case.
> So, say I was wanting to compare the elements of var1 for id ==1. I
> would have
> x <- c(10,10)
> Of course, the following works
>> all.equal(x[1], x[2])
> [1] TRUE
> As would a similar call to identical. However, what if I only have a
> vector of values (or if the column consists of names) that I want to
> assess for equality when I am trying to automate a process over
> thousands of cases? As in the example above, the vector may contain only
> two values or it may contain many more. The number of values in the
> vector differ by id.
> Any thoughts?
> Harold
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
