[R] Restructure some data

Doran, Harold HDoran at air.org
Fri Feb 26 17:19:19 CET 2010


Thank you both for your replies; both are very useful. The larger issue at hand is that the data will actually be huge, thus the end result will be a very large, sparse data frame.

So, I decided to put all three possible solutions to a timing test and see what they yield. I simulated 15000 possible students and created an item pool of 300 total items that could be selected. I fixed the number of total items each students sees to 3, although this will truly be on the order of 50 in the real world problem.

So, first the new data for testing all three solutions.

item.pool <- paste("item", 1:300, sep = "")
N <- 15000
set.seed(54321)
dat <- data.frame(id = c(1:N), first.item = sample(item.pool, N, replace=TRUE), 
	second.item = sample(item.pool, N,replace=TRUE), third.item = sample(item.pool, N,replace=TRUE),
	score1 = sample(c(0,1), N,replace=TRUE), score2 = sample(c(0,1), N,replace=TRUE), score3 = sample(c(0,1), N,replace=TRUE))
	
Now, my original loop is in the function 'harold', I created a new function "bill" and "phil". I modified Bill's code only to reflect my original naming conventions. Timing results for each solution are below.

> system.time(result <- harold(dat))
   user  system elapsed 
1347.85  441.92 1799.75

> system.time(result <- bill(dat))
   user  system elapsed 
   0.04    0.04    0.09

> system.time(result <- phil(dat))
   user  system elapsed 
   4.42    0.00    4.42

The loop timing is laughable; so it is out. Clearly, Phil wins from the "golf" viewpoint, but Bill's solution is quite fast. Phil, it is actually quite irrelevant that the original ordering of the columns is not preserved since that can be easily remedied in a post-hoc reordering of columns.

Again, thank you both.
Harold

harold <- function(dat){
	Nstu <- nrow(dat)
	df <- matrix(NA, ncol = length(item.pool), nrow = Nstu)
	colnames(df) <- item.pool
	for(i in 1:Nstu){
		for(j in 2:4){
			rr <- which(dat[i,j] == colnames(df))
			df[i,rr] <- dat[i, (j+3)]
		}
	}
	df
}
system.time(result <- harold(dat))

bill <- function(dat) {
	L <- length(item.pool)
    items <- as.matrix(dat[2:4])
    scores <- as.matrix(dat[, 5:7])
    retval <- matrix(NA_real_, nrow = nrow(dat), ncol = L,
    dimnames = list(character(), item.pool))
    retval[cbind(dat$id, match(items, item.pool))] <- scores
    retval
  }
system.time(result <- bill(dat))

phil <- function(dat){
	df <- tapply(as.vector(as.matrix(dat[5:7])),
		list(rep(dat$id,3),as.vector(as.matrix(dat[2:4]))),I)
	df
	}
system.time(result <- phil(dat))

-----Original Message-----
From: Phil Spector [mailto:spector at stat.berkeley.edu] 
Sent: Thursday, February 25, 2010 5:38 PM
To: Doran, Harold
Cc: r-help at r-project.org
Subject: Re: [R] Restructure some data

Harold -
    Here's what I came up with:

>  tapply(as.vector(as.matrix(dat[5:7])),
+         list(rep(dat$id,3),as.vector(as.matrix(dat[2:4]))),I)
   item1 item10 item2 item3 item4 item5 item7 item9
1    NA     NA     1    NA    NA     1    NA     0
2     0     NA    NA    NA    NA     1     1    NA
3     1     NA     0     1    NA    NA    NA    NA
4    NA     NA    NA     1     0    NA     0    NA
5    NA      1    NA     0     1    NA    NA    NA

I thought there would be a way to use xtabs, but I had
trouble preserving the NAs.

The columns aren't in the right order, and the item6 column is
missing, but it's pretty close.
Thanks for the easily reproducible example, and the interesting
puzzle.

 					- Phil Spector
 					 Statistical Computing Facility
 					 Department of Statistics
 					 UC Berkeley
 					 spector at stat.berkeley.edu


On Thu, 25 Feb 2010, Doran, Harold wrote:

> Suppose I have a data frame like "dat" below. For some context, this is the format that represents student's taking a computer adaptive test. first.item is the first item that student was administered and then score.1 is the student's response to that item and so forth.
>
> item.pool <- paste("item", 1:10, sep = "")
> set.seed(54321)
> dat <- data.frame(id = c(1,2,3,4,5), first.item = sample(item.pool, 5, replace=TRUE),
>                second.item = sample(item.pool, 5,replace=TRUE), third.item = sample(item.pool, 5,replace=TRUE),
>                score1 = sample(c(0,1), 5,replace=TRUE), score2 = sample(c(0,1), 5,replace=TRUE), score3 = sample(c(0,1), 5,replace=TRUE))
>
> I need to restructure this into a new format. The new matrix df (after the loop) is exactly what I want in the end. But, I'm annoyed at myself for not thinking of a more efficient way to restructure this without using a loop.
>
> df <- matrix(NA, ncol = length(item.pool), nrow = nrow(dat))
> colnames(df) <- unique(item.pool)
>
> for(i in 1:5){
>                for(j in 2:4){
>                                rr <- which(dat[i,j] == colnames(df))
>                                df[i,rr] <- dat[i, (j+3)]
>                }
> }
>
> Any thoughts?
>
> Harold
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list