[R] Dropping "trailing zeroes" in longitudinal data
William Dunlap
wdunlap at tibco.com
Mon Apr 26 22:17:09 CEST 2010
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of David Atkins
> Sent: Monday, April 26, 2010 12:23 PM
> To: r-help at r-project.org
> Subject: [R] Dropping "trailing zeroes" in longitudinal data
>
>
> Background: Our research group collected data from students
> via the web
> about their drinking habits (alcohol) over the last 90 days. As you
> might guess, some students seem to have lost interest and
> completed some
> information but not all. Unfortunately, the survey was programmed to
> "pre-populate" the fields with zeroes (to make it easier for
> students to
> complete).
>
> Obviously, when we see a stretch of zeroes, we've no idea
> whether this
> is "true" data or not, but we'd like to at least do some sensitivity
> analyses by dropping "trailing zeroes" (ie, when there are non-zero
> responses for some duration of the data that then "flat line"
> into all
> zeroes to the end of the time period)
>
> I've included a toy dataset below.
>
> Basically, we have the data in the "long" format, and what
> I'd like to
> do is subset the data.frame by deleting rows that occur at
> the end of a
> person's data that are all zeroes. In a nutshell, select rows from a
> person that are continuously zero, up to first non-zero,
> starting at the
> end of their data (which, below, would be time = 10).
>
> With the toy data, this would be the last 6 rows of ids #10
> and #8 (for
> example). I can begin to think about how I might do this via
> grep/regexp but am a bit stumped about how to translate that to this
> type of data.
>
> Any thoughts appreciated.
>
> cheers, Dave
>
> ### toy dataset
> set.seed(123)
> toy.df <- data.frame(id = factor(rep(1:10, each=10)),
> time = rep(1:10, 10),
> dv = rnbinom(100, mu
> = 0.5, size = 100))
> toy.df
>
> library(lattice)
>
> xyplot(dv ~ time | id, data = toy.df, type = c("g","l"))
Try using rle (run length encoding) along with either ave()
or lapply(). E.g., define the function
isInTrailingRunOfZeroes <- function (x, group, minRunLength = 1) {
as.logical(ave(x, group, FUN = function(x) {
r <- rle(x)
n <- length(r$values)
if (n == 0) {
logical(0)
} else if (r$values[n] == 0 && r$lengths[n] >= minRunLength) {
rep(c(FALSE, TRUE), c(sum(r$lengths[-n]), r$lengths[n]))
} else {
rep(FALSE, sum(r$lengths))
}
}))
}
and use it to drop the trailing runs of 0's with
xyplot(data=toy.df[!isInTrailingRunOfZeroes(toy.df$dv, toy.df$id),],
dv~time|id, type=c("g","l"))
or replace them with NA's with
toy.df.copy <- toy.df
toy.df.copy[isInTrailingRunOfZeroes(toy.df.copy$dv,
toy.df.copy$id),"dv"] <- NA
The last argument, minRunLength lets you say you only want
to consider the data spurious if there are at least that many
zeroes.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
>
> --
> Dave Atkins, PhD
> Research Associate Professor
> Department of Psychiatry and Behavioral Science
> University of Washington
> datkins at u.washington.edu
>
> Center for the Study of Health and Risk Behaviors (CSHRB)
> 1100 NE 45th Street, Suite 300
> Seattle, WA 98105
> 206-616-3879
> http://depts.washington.edu/cshrb/
> (Mon-Wed)
>
> Center for Healthcare Improvement, for Addictions, Mental Illness,
> Medically Vulnerable Populations (CHAMMP)
> 325 9th Avenue, 2HH-15
> Box 359911
> Seattle, WA 98104?
> 206-897-4210
> http://www.chammp.org
> (Thurs)
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list