[R] Tools For Preparing Data For Analysis
Peter Dalgaard
p.dalgaard at biostat.ku.dk
Sun Jun 10 12:25:35 CEST 2007
Douglas Bates wrote:
> Frank Harrell indicated that it is possible to do a lot of difficult
> data transformation within R itself if you try hard enough but that
> sometimes means working against the S language and its "whole object"
> view to accomplish what you want and it can require knowledge of
> subtle aspects of the S language.
>
Actually, I think Frank's point was subtly different: It is *because* of
the differences in view that it sometimes seems difficult to find the
way to do something in R that is apparently straightforward in SAS.
I.e. the solutions exist and are often elegant, but may require some
lateral thinking.
Case in point: Finding the first or the last observation for each
subject when there are multiple records for each subject. The SAS way
would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that
you can compare the subject ID with the one from the previous record,
working with data that are sorted appropriately.
You can do the same thing in R with a for loop, but there are better
ways e.g.
subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or
maybe
do.call("rbind",lapply(split(df,df$ID), head, 1)), resp. tail. Or
something involving aggregate(). (The latter approaches generalize
better to other within-subject functionals like cumulative doses, etc.).
The hardest cases that I know of are the ones where you need to turn one
record into many, such as occurs in survival analysis with
time-dependent, piecewise constant covariates. This may require
"transposing the problem", i.e. for each interval you find out which
subjects contribute and with what, whereas the SAS way would be a
within-subject loop over intervals containing an OUTPUT statement.
Also, there are some really weird data formats, where e.g. the input
format is different in different records. Back in the 80's where
punched-card input was still common, it was quite popular to have one
card with background information on a patient plus several cards
detailing visits, and you'd get a stack of cards containing both kinds.
In R you would most likely split on the card type using grep() and then
read the two kinds separately and merge() them later.
More information about the R-help
mailing list