[R] Can R replicate this data manipulation in SAS?
Gabor Grothendieck
ggrothendieck at gmail.com
Fri Apr 22 17:27:11 CEST 2011
Since this involves time series within each id group I thought it
would be interesting to see if this could be formulated using
zoo series. The approach is to read it in, and convert it to a long
form by just stacking the start and stop times in a data frame
and converting that to zoo using the concatenation of date and id as
the time index. It makes use of the fact that
- zoo can use non-standard "times" (here the concatenation of date and id)
- read.zoo can split and aggregate and
- the resulting object can be directly cumsum'd using cumsum.zoo .
Finally we convert it back to the required form.
This was mostly done by looking at the output rather than trying to
follow the SAS code so some differences are possible. It does
seem to give the same output in this case.
Yet another approach might be to use model.matrix to create the
NRTI/NNRTI/PI columns:
model.matrix(~ drug - 1, long2) * long2$change
but below we stick with zoo.
Some of the code here could be transplanted into Ista's solution
(which uses the reshape and plyr packages) to get a combination
approach.
Lines <- "1004 NRTI 07/24/95 01/05/99
1004 NRTI 11/20/95 12/10/95
1004 NRTI 01/10/96 01/05/99
1004 PI 05/09/96 11/16/97
1004 NRTI 06/01/96 02/01/97
1004 NRTI 07/01/96 03/01/97
9999 PI 01/02/03 .
9999 NNRTI 04/05/06 07/08/09"
library(zoo)
# need na.fill from development version of zoo
source("http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/zoo/R/na.fill.R?revision=831&root=zoo")
# read in data and reshape to long form
DF <- read.table(textConnection(Lines), as.is = TRUE, na.strings = ".",
col.names = c("id", "drug", "date", "date"), check.names = FALSE)
long <- rbind(cbind(DF[-4], change = 1), cbind(DF[-3], change = -1))
# convert to zoo. date.id, the concatenation of date & id, becomes index.
# Then take cumulative sums of columns and append regimen.
long2 <- with(long,
data.frame(date.id = paste(as.Date(date, "%m/%d/%y"), id), drug, change))
z <- read.zoo(long2, index = 1, split = "drug", FUN = identity, aggregate = sum)
z <- cumsum(na.fill(z, fill = 0))
z$regimen <- ave(1:nrow(z), sub(".* ", "", time(z)), FUN = seq_along)
# convert z to data frame for final processing
id <- sub(".* ", "", time(z))
start_date = as.Date(sub(" .*", "", time(z)))
d2 <- data.frame(id, start_date, coredata(z))
# from each group of rows in same id form the desired columns.
# f processes one such group.
f <- function(x) with(x, data.frame(
id = head(id, -1),
start_date = head(start_date, -1),
stop_date = start_date[-1],
head(cbind(regimen, NRTI, NNRTI, PI, all = NRTI + NNRTI + PI,
HAART = as.numeric((NRTI >= 3 & NNRTI==0 & PI==0) |
(NRTI >= 2 & (NNRTI >= 1 | PI >= 1)) |
(NRTI == 1 & NNRTI >= 1 & PI >= 1))), -1)))
result <- do.call(rbind, by(d2, id, f))
rownames(result) <- NULL
result
The end result is this:
> result
id start_date stop_date regimen NRTI NNRTI PI all HAART
1 1004 1995-07-24 1995-11-20 1 1 0 0 1 0
2 1004 1995-11-20 1995-12-10 2 2 0 0 2 0
3 1004 1995-12-10 1996-01-10 3 1 0 0 1 0
4 1004 1996-01-10 1996-05-09 4 2 0 0 2 0
5 1004 1996-05-09 1996-06-01 5 2 0 1 3 1
6 1004 1996-06-01 1996-07-01 6 3 0 1 4 1
7 1004 1996-07-01 1997-02-01 7 4 0 1 5 1
8 1004 1997-02-01 1997-03-01 8 3 0 1 4 1
9 1004 1997-03-01 1997-11-16 9 2 0 1 3 1
10 1004 1997-11-16 1999-01-05 10 2 0 0 2 0
11 9999 2003-01-02 2006-04-05 1 0 0 1 1 0
12 9999 2006-04-05 2009-07-08 2 0 1 1 2 0
13 9999 2009-07-08 <NA> 3 0 0 1 1 0
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
More information about the R-help
mailing list