[R] rollmean help (or similar function)

Gabor Grothendieck ggrothendieck at gmail.com
Sat Aug 21 09:01:22 CEST 2010


On Fri, Aug 20, 2010 at 3:47 PM, Hall, Ken (CDC/OSELS/NCPHI)
<kha6 at cdc.gov> wrote:
> I am working on a simple pilot project comparing the capability of SQL,
> SAS and R to perform a rolling mean per the following instructions. I
> have completed the SQL and SAS analysis, so now it's R's turn.
>
> Calculate mean values of x (x=count) for each date in the dataset where
> mean = the average count of days [t-9] through day [t-3] for each
> date/illness combination.
>
> Dataset aggpilot
>
>     date           illness x
> 1    2006/01/01    DERM 319
> 2    2006/01/02    DERM 388
> 3    2006/01/03    DERM 336
> 4    2006/01/04    DERM 255
> 5    2006/01/05    DERM 177
> 6    2006/01/06    DERM 377
> 7    2006/01/07    DERM 113
> 8    2006/01/08    DERM 253
> 9    2006/01/09    DERM 316
> 10   2006/01/10    DERM 187
> 11   2006/01/11    DERM 292
> 12   2006/01/12    DERM 275
> 13   2006/01/13    DERM 355
> ... 3102 total observations
>
> Result of command: str(aggpilot)
>
> 'data.frame':   3102 obs. of  3 variables:
>  $ date   : Factor w/ 1551 levels "2006/01/01","2006/01/02",..: 1 2 3 4
> 5 6 7 8 9 10 ...
>  $ illness: Factor w/ 2 levels "DERM","FEVER": 1 1 1 1 1 1 1 1 1 1 ...
>  $ x      : int  319 388 336 255 177 377 113 253 316 187 ...
>
> The results should look something like this:
>
> DATE    ILLNESS x       MEAN                    NOTES [ROLLING MEAN DATE
> RANGE]
> 1/1/2006        DERM    319
> 1/2/2006        DERM    388
> 1/3/2006        DERM    336
> 1/4/2006        DERM    255     319                     [1/1/2006 -
> 1/1/2006]
> 1/5/2006        DERM    177     353.5                   [1/1/2006 -
> 1/2/2006]
> 1/6/2006        DERM    377     347.6666667             [1/1/2006 -
> 1/3/2006]
> 1/7/2006        DERM    113     324.5                   [1/1/2006 -
> 1/4/2006]
> 1/8/2006        DERM    253     295                     [1/1/2006 -
> 1/5/2006]
> 1/9/2006        DERM    316     308.6666667             [1/1/2006 -
> 1/6/2006]
> 1/10/2006       DERM    187     280.7142857             [1/1/2006 -
> 1/7/2006]
> 1/11/2006       DERM    292     271.2857143             [1/2/2006 -
> 1/8/2006]
> 1/12/2006       DERM    275     261                     [1/3/2006 -
> 1/9/2006]
> 1/13/2006       DERM    355     239.7142857             [1/4/2006 -
> 1/10/2006]
>
>



Lines   <- "date           illness x
   2006/01/01    DERM 319
   2006/01/02    DERM 388
   2006/01/03    DERM 336
   2006/01/04    DERM 255
   2006/01/05    DERM 177
   2006/01/06    DERM 377
   2006/01/07    DERM 113
   2006/01/08    DERM 253
   2006/01/09    DERM 316
   2006/01/10    DERM 187
   2006/01/11    DERM 292
   2006/01/12    DERM 275
   2006/01/13    DERM 355
   2006/01/01    FEVER 3190
   2006/01/02    FEVER 3880
   2006/01/03    FEVER 3360
   2006/01/04    FEVER 2550
   2006/01/05    FEVER 1770
   2006/01/06    FEVER 3770
   2006/01/07    FEVER 1130
   2006/01/08    FEVER 2530
   2006/01/09    FEVER 3160
   2006/01/10    FEVER 1870
   2006/01/11    FEVER 2920
   2006/01/12    FEVER 2750
   2006/01/13    FEVER 3550"

##################################################
# 1. Produces same result from long form data
#    Only uses core R functionality (no packages).
#
#    For each row form ix, the set of row indices
#    over which to average.  Then form the columns.
##################################################

# read in data
DF <- read.table(textConnection(Lines), header = TRUE)
DF$date <- as.Date(DF$date)

# calculate one row of the output
one_row <- function(i, X) with(X, {
	ix <- seq(i-9, i-3)
	ix <- ix[ix > 0]
	ix <- ix[X$illness[ix] == X$illness[i]]
	date <- format(date, "%m/%d/%Y")
	tt <- date[ix]
	NOTES <- if (length(ix) > 0) paste(head(tt, 1), "-", tail(tt, 1)) else ""
	data.frame(DATE = date[i], ILLNESS = illness[i], x = x[i],
		MEAN = mean(x[ix]), NOTES)
})

do.call(rbind, lapply(1:NROW(DF), one_row, DF))

##################################################
# 2. In common situations the partial means at ends
#    would not be calculated since they are not
#    comparable to the other values -- they have
#    higher variance.  Also this data would typically
#    be represented as a multivariate time series,
#    not as long form data.
#    With these changes and making use of the
#    rollmean in the zoo package it simplifies.
#
#    read.zoo with split= reads it into a multivariate
#    time series with one column per illness.
#    We convert dates to chron class since it uses
#    the desired output date format by default.
#    It is converted to zooreg as its a weakly
#    regular series (actually its completely regular).
#    Then we merge it with its rolling mean and
#    form the desired columns.
##################################################

library(zoo)
library(chron)

x <- read.zoo(textConnection(Lines), header = TRUE, split = "illness")
time(x) <- as.chron(time(x))
x <- as.zooreg(x)
zm <- merge(x, mean = lag(rollmean(x, 7, align = "right"), -3), all = FALSE)
fmt <- function(x) format(x, "%m/%d/%Y")
data.frame(DATE = time(zm), coredata(zm),
	notes = paste(time(zm) - 9, "-", time(zm) - 3))

##################################################
# 3. An alternative approach is to use the sqldf R package
#     See http://sqldf.googlecode.com
##################################################

library(sqldf)

sqldf('select
	strftime("%m/%d/%Y", s1.date*24*60*60, \'unixepoch\') "DATE",
	s1.illness "ILLNESS",
	x,
	mean "MEAN",
	coalesce(strftime("%m/%d/%Y", fromdate*24*60*60, \'unixepoch\') ||
		\' - \' ||
	strftime("%m/%d/%Y", fromdate*24*60*60, \'unixepoch\'), \'\') "NOTES"
from DF s1
left join
	(select
		t1.date,
		avg(t2.x) mean,
		min(t2.date) fromdate,
		max(t2.date) todate,
		max(t2.illness) illness
	from  DF t1, DF t2
	where julianday(t1.date) between julianday(t2.date) + 3 AND
julianday(t2.date) + 9
		and t1.illness = t2.illness
	group by t1.illness, t1.date
	order by t1.illness, t1.date) s2
on
	s1.date = s2.date
	and s1.illness = s2.illness')

# You can also explore doing part of it with sqldf and other parts in R.



More information about the R-help mailing list