[R] Can R replicate this data manipulation in SAS?

Gabor Grothendieck ggrothendieck at gmail.com
Fri Apr 22 17:27:11 CEST 2011


Since this involves time series within each id group I thought it
would be interesting to see if this could be formulated using
zoo series.   The approach is to read it in, and convert it to a long
form by just stacking the start and stop times in a data frame
and converting that to zoo using the concatenation of date and id as
the time index.  It makes use of the fact that
- zoo can use non-standard "times" (here the concatenation of date and id)
- read.zoo can split and aggregate and
- the resulting object can be directly cumsum'd using cumsum.zoo .

Finally we convert it back to the required form.

This was mostly done by looking at the output rather than trying to
follow the SAS code so some differences are possible. It does
seem to give the same output in this case.

Yet another approach might be to use model.matrix to create the
NRTI/NNRTI/PI columns:
	model.matrix(~ drug - 1, long2) * long2$change
but below we stick with zoo.

Some of the code here could be transplanted into Ista's solution
(which uses the reshape and plyr packages) to get a combination
approach.

Lines <- "1004 NRTI  07/24/95 01/05/99
1004 NRTI  11/20/95 12/10/95
1004 NRTI  01/10/96 01/05/99
1004 PI    05/09/96 11/16/97
1004 NRTI  06/01/96 02/01/97
1004 NRTI  07/01/96 03/01/97
9999 PI    01/02/03 .
9999 NNRTI 04/05/06 07/08/09"

library(zoo)

# need na.fill from development version of zoo
source("http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/zoo/R/na.fill.R?revision=831&root=zoo")

# read in data and reshape to long form

DF <- read.table(textConnection(Lines), as.is = TRUE, na.strings = ".",
	col.names = c("id", "drug", "date", "date"), check.names = FALSE)
long <- rbind(cbind(DF[-4], change = 1), cbind(DF[-3], change = -1))

# convert to zoo.   date.id, the concatenation of date & id, becomes index.
# Then take cumulative sums of columns and append regimen.

long2 <- with(long,
	data.frame(date.id = paste(as.Date(date, "%m/%d/%y"), id), drug, change))
z <- read.zoo(long2, index = 1, split = "drug", FUN = identity, aggregate = sum)

z <- cumsum(na.fill(z, fill = 0))
z$regimen <- ave(1:nrow(z), sub(".* ", "", time(z)), FUN = seq_along)

# convert z to data frame for final processing

id <- sub(".* ", "", time(z))
start_date = as.Date(sub(" .*", "", time(z)))
d2 <- data.frame(id, start_date, coredata(z))

# from each group of rows in same id form the desired columns.
# f processes one such group.

f <- function(x) with(x, data.frame(
	id = head(id, -1),
	start_date = head(start_date, -1),
	stop_date = start_date[-1],
	head(cbind(regimen, NRTI, NNRTI, PI, all = NRTI + NNRTI + PI,
		  HAART = as.numeric((NRTI >= 3 & NNRTI==0 & PI==0) |
			(NRTI >= 2 & (NNRTI >= 1 | PI >= 1)) |
            (NRTI == 1 & NNRTI >= 1 & PI >= 1))), -1)))
result <- do.call(rbind, by(d2, id, f))
rownames(result) <- NULL

result

The end result is this:

> result
     id start_date  stop_date regimen NRTI NNRTI PI all HAART
1  1004 1995-07-24 1995-11-20       1    1     0  0   1     0
2  1004 1995-11-20 1995-12-10       2    2     0  0   2     0
3  1004 1995-12-10 1996-01-10       3    1     0  0   1     0
4  1004 1996-01-10 1996-05-09       4    2     0  0   2     0
5  1004 1996-05-09 1996-06-01       5    2     0  1   3     1
6  1004 1996-06-01 1996-07-01       6    3     0  1   4     1
7  1004 1996-07-01 1997-02-01       7    4     0  1   5     1
8  1004 1997-02-01 1997-03-01       8    3     0  1   4     1
9  1004 1997-03-01 1997-11-16       9    2     0  1   3     1
10 1004 1997-11-16 1999-01-05      10    2     0  0   2     0
11 9999 2003-01-02 2006-04-05       1    0     0  1   1     0
12 9999 2006-04-05 2009-07-08       2    0     1  1   2     0
13 9999 2009-07-08       <NA>       3    0     0  1   1     0




-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list