[R] combining same-day lab measurements with 'apply'

dylan boyd dylan.boyd at gmail.com
Wed Oct 15 22:45:10 CEST 2008


Another request for help implementing the 'apply' functions to avoid a
loop structure...

I am working with a data set that includes lab measurements taken at
different dates for the subjects, with some subjects having more
results than others.  I would like to average lab results for each
subject that were taken on the same day.  I can do this using a for
loop, but would like to know how to efficiently accomplish the same
thing without looping as I will likely have to do the same with a much
larger data set.

At the end of this post are examples of what I'm starting with and
what I want the result to look like:

I tried another suggestion I saw on this list using a list object for
the index of a call to 'tapply' as in:

> new.x <- tapply(x, list(id, date), mean)

but this produced a table-like object referencing every subject id
with every date in the dataset - too large for the full data set and
also would require serious re-working (at least with the tools I know)
to return to the original dataframe structure.

Another attempt was pasting the id and date together to create a
single indexing vector.  I could get this to work, but it seems clumsy
to be substring'ing the names attribute of the resulting dataframe and
implementing this with id's that range from 1 to 3 digits further
complicates things:

> new.x <- tapply(x, paste(id, date),mean)
> data.frame(
+   id  = substr(names(new.x),start=1,stop=1),
+   x   = new.x,
+   date  = as.Date(substr(names(new.x),start=3,stop=100)))
             id    x       date
2 2005-12-15  2 21.0 2005-12-15
2 2006-01-13  2 22.5 2006-01-13
3 2000-04-05  3 17.0 2000-04-05
4 2003-05-23  4 18.0 2003-05-23
4 2003-07-08  4 27.0 2003-07-08
4 2003-11-30  4 24.5 2003-11-30
5 2001-04-19  5 23.0 2001-04-19

I could get this to work, but it seems clumsy to be substring'ing the
names attribute of the resulting dataframe and implementing.  Also,
the full data set has subject id's that range from 1 to 3 digits
further complicates things the 'substr' call (although it just
occurred to me that I could use strsplit as well..).

It may be irrelevant, but the 'date' variable is a Date class object.
I've tried first converting this to a character object but didn't get
anywhere.  Further, I'll use the dates later with difftime to figure
the subjects' age at the onset of their condition, so I'd like to
avoid converting between classes too much.

Any advice would be greatly appreciated.  Here is the code to build
the sample data and the working for loop as well:

> dum <- data.frame(
+   id  = c(2,2,2,3,4,4,4,4,5,5),
+   x   = sample(15:30,length(id)),
+   date  = as.Date(c("12/15/2005","1/13/2006","1/13/2006","4/5/2000","5/23/2003",
+     "7/8/2003","11/30/2003","11/30/2003","4/19/2001","4/19/2001"),format="%m/%d/%Y")
+   )
> id.list <- unique(id)
> dum
   id  x       date
1   2 21 2005-12-15
2   2 22 2006-01-13
3   2 23 2006-01-13
4   3 17 2000-04-05
5   4 18 2003-05-23
6   4 27 2003-07-08
7   4 25 2003-11-30
8   4 24 2003-11-30
9   5 26 2001-04-19
10  5 20 2001-04-19
>


> output <- NULL
> for (i in seq(along=id.list)) {
+   sel <- dum$id==id.list[i]
+   x.averaged  <- tapply(dum$x[sel], dum$date[sel], mean, na.rm=TRUE)
+   dat  <-  data.frame(id.list[i], x.averaged, names(x.averaged))
+   output  <- rbind(output, dat)
+ }
> names(output) <- names(dum)
> rownames(output)  <- NULL
> output
  id    x       date
1  2 24.0 2005-12-15
2  2 22.0 2006-01-13
3  3 19.0 2000-04-05
4  4 22.0 2003-05-23
5  4 26.0 2003-07-08
6  4 28.5 2003-11-30
7  5 21.0 2001-04-19
>



More information about the R-help mailing list