[R] Can R replicate this data manipulation in SAS?

Gabor Grothendieck ggrothendieck at gmail.com
Fri Apr 22 18:51:32 CEST 2011


On Fri, Apr 22, 2011 at 11:27 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> Since this involves time series within each id group I thought it
> would be interesting to see if this could be formulated using
> zoo series.   The approach is to read it in, and convert it to a long
> form by just stacking the start and stop times in a data frame
> and converting that to zoo using the concatenation of date and id as
> the time index.  It makes use of the fact that
> - zoo can use non-standard "times" (here the concatenation of date and id)
> - read.zoo can split and aggregate and
> - the resulting object can be directly cumsum'd using cumsum.zoo .
>
> Finally we convert it back to the required form.
>
> This was mostly done by looking at the output rather than trying to
> follow the SAS code so some differences are possible. It does
> seem to give the same output in this case.
>
> Yet another approach might be to use model.matrix to create the
> NRTI/NNRTI/PI columns:
>        model.matrix(~ drug - 1, long2) * long2$change
> but below we stick with zoo.
>
> Some of the code here could be transplanted into Ista's solution
> (which uses the reshape and plyr packages) to get a combination
> approach.
>
> Lines <- "1004 NRTI  07/24/95 01/05/99
> 1004 NRTI  11/20/95 12/10/95
> 1004 NRTI  01/10/96 01/05/99
> 1004 PI    05/09/96 11/16/97
> 1004 NRTI  06/01/96 02/01/97
> 1004 NRTI  07/01/96 03/01/97
> 9999 PI    01/02/03 .
> 9999 NNRTI 04/05/06 07/08/09"
>
> library(zoo)
>
> # need na.fill from development version of zoo
> source("http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/zoo/R/na.fill.R?revision=831&root=zoo")
>
> # read in data and reshape to long form
>
> DF <- read.table(textConnection(Lines), as.is = TRUE, na.strings = ".",
>        col.names = c("id", "drug", "date", "date"), check.names = FALSE)
> long <- rbind(cbind(DF[-4], change = 1), cbind(DF[-3], change = -1))
>
> # convert to zoo.   date.id, the concatenation of date & id, becomes index.
> # Then take cumulative sums of columns and append regimen.
>
> long2 <- with(long,
>        data.frame(date.id = paste(as.Date(date, "%m/%d/%y"), id), drug, change))
> z <- read.zoo(long2, index = 1, split = "drug", FUN = identity, aggregate = sum)
>
> z <- cumsum(na.fill(z, fill = 0))
> z$regimen <- ave(1:nrow(z), sub(".* ", "", time(z)), FUN = seq_along)
>
> # convert z to data frame for final processing
>
> id <- sub(".* ", "", time(z))
> start_date = as.Date(sub(" .*", "", time(z)))
> d2 <- data.frame(id, start_date, coredata(z))
>
> # from each group of rows in same id form the desired columns.
> # f processes one such group.
>

There was a bug in f.  The last part above should have been:

f <- function(x) with(x, data.frame(
	id = head(id, -1),
	start_date = head(start_date, -1),
	stop_date = start_date[-1]-1,
	head(cbind(regimen, NRTI, NNRTI, PI, all = NRTI + NNRTI + PI,
		  HAART = as.numeric((NRTI >= 3 & NNRTI==0 & PI==0) |
			(NRTI >= 2 & (NNRTI >= 1 | PI >= 1)) |
            (NRTI == 1 & NNRTI >= 1 & PI >= 1))), -1)))
result <- do.call(rbind, by(d2, id, f))
rownames(result) <- NULL

result

> result
     id start_date  stop_date regimen NRTI NNRTI PI all HAART
1  1004 1995-07-24 1995-11-19       1    1     0  0   1     0
2  1004 1995-11-20 1995-12-09       2    2     0  0   2     0
3  1004 1995-12-10 1996-01-09       3    1     0  0   1     0
4  1004 1996-01-10 1996-05-08       4    2     0  0   2     0
5  1004 1996-05-09 1996-05-31       5    2     0  1   3     1
6  1004 1996-06-01 1996-06-30       6    3     0  1   4     1
7  1004 1996-07-01 1997-01-31       7    4     0  1   5     1
8  1004 1997-02-01 1997-02-28       8    3     0  1   4     1
9  1004 1997-03-01 1997-11-15       9    2     0  1   3     1
10 1004 1997-11-16 1999-01-04      10    2     0  0   2     0
11 9999 2003-01-02 2006-04-04       1    0     0  1   1     0
12 9999 2006-04-05 2009-07-07       2    0     1  1   2     0
13 9999 2009-07-08       <NA>       3    0     0  1   1     0


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list