[R] sas to r

Sat Jul 17 14:36:20 CEST 2004

Greg Adkison wrote:
> I would be incredibly grateful to anyone who'll help me translate some 
> SAS code into R code.
> 
> Say for example that I have a dataset named "dat1" that includes five 
> variables:  wshed, site, species, bda, and sla.  I can calculate with the 
> following SAS code the mean, CV, se, and number of observations of 
> "bda" and "sla" for each combination of "wshed," "species," and "site," 
> restricting the species considered to only three of several species in 
> dat1 (b, c, and p).  Moreover, I can output these calculations and 
> grouping variables to a dataset named "dat2" that will reside in RAM 
> and include the variables  wshed, site, species, mBdA, msla, cBda, 
> sBdA, ssla, nBda, and nsla.
> 
> proc sort data=dat1;
>   by wshed site species;
> proc means data=dat1 noprint mean cv stderr n;
>   by wshed site species;
>   where species in ('b', 'c', 'p');
>   var BdA sla;
>   output out=dat2
>     mean=mBdA msla
>     cv=cBdA csla
>     stderr=sBdA ssla
>     n=nBdA nsla;
> 
> Thanks,
> Greg

The following handles any number of analysis variables, with automatic
naming of all statistics computed from them.  It requires the Hmisc package.

# Generate some data.  Put one NA in sla.
set.seed(1)
dat1 <- expand.grid(wshed=1:2, site=c('A','B'),
                     species=c('a','b','c','p'),
                     reps=1:10)
n <- nrow(dat1)
dat1 <- transform(dat1,
                   BdA = rnorm(n, 100, 20),
                   sla = c(rnorm(n-1, 200, 30), NA))
# Can use upData function in Hmisc in place of transform

# Summarization function, per stratum, for a matrix of analysis
# variables
g <- function(y) {
   n <- apply(y, 2, function(z) sum(!is.na(z)))
   m <- apply(y, 2, mean, na.rm=TRUE)
   s <- apply(y, 2, sd,   na.rm=TRUE)
   cv <- s/m
   se <- s/sqrt(n)
   w <- c(m, cv, se, n)
   names(w) <- t(outer(c('m','c','s','n'), colnames(y), paste, sep=''))
   w
}
library(Hmisc)
dat2 <-  with(dat1,
               summarize(cbind(BdA, sla),
                         llist(wshed, site, species),
                         g,
                         subset=species %in% c('b','c','p'),
                         stat.name='mBdA')
               )

options(digits=3)
dat2  # is a data frame

    wshed site species  mBdA msla  cBdA   csla sBdA  ssla nBdA nsla
1      1    A       b 100.5  195 0.133 0.1813 4.23 11.20   10   10
2      1    A       c  99.7  206 0.101 0.1024 3.17  6.68   10   10
3      1    A       p 101.4  188 0.239 0.1580 7.65  9.39   10   10
4      1    B       b 109.9  203 0.118 0.1433 4.09  9.21   10   10
5      1    B       c  98.4  221 0.193 0.1250 6.01  8.72   10   10
6      1    B       p 102.9  203 0.216 0.1446 7.03  9.29   10   10
7      2    A       b  95.8  195 0.241 0.2011 7.31 12.40   10   10
8      2    A       c  98.7  207 0.194 0.1274 6.04  8.33   10   10
9      2    A       p 102.2  191 0.217 0.1709 7.01 10.31   10   10
10     2    B       b  97.8  191 0.235 0.2079 7.27 12.58   10   10
11     2    B       c 100.9  194 0.164 0.0987 5.24  6.07   10   10
12     2    B       p 103.0  209 0.144 0.0769 4.69  5.35   10    9

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University