[R] Problem using reshape with missing values in idvar
Bing Ho
2bingho at stanford.edu
Mon Oct 31 20:54:48 CET 2005
Hello everybody,
I have been recently using reshape to convert "long" data to "wide"
data. Everything was going well until I reached some problematic
datasets. It has taken me a couple of weeks to finally figure out
what might be happening.
The problem is reproducible with test cases, and on two versions of R
(Windows 2.2.0 and x86-64 Fedora Core 3 R 2.2.0).
The data started out in Microsoft Excel 2003 before being saved as a
.csv file. The data stores the records of study participants which
may return a variable number of times for follow up (between one to
several dozen follow up visits). The research staff would start a row
of data for each study participant, and each follow up visit would be
row underneath the previous row for that study participant. Because
of the variable nature of follow up, many fields will be "NA" since
some participants may have only one study.
A sample is as follows (note that the ID appears consecutively for
each follow up, or in other words, all the ID are grouped chronologically)
ID DOB GENDER ETHNICITY TESTDATE TESTRESULT
1 1/1/1900 1 1 1/1/2005 100
1 1/1/1900 1 1 1/2/2005 110
2 8/1/1930 2 1 2/1/2005 80
3 12/1/1990 2 2 3/1/2005 200
3 12/1/1990 2 2 3/2/2005 205
3 12/1/1990 2 2 3/3/2005 220
My code is as follows:
df <- read.csv("df.csv") # Read .csv file into R
df.tt <- sequence(rle(df$ID)$length) # Create a sequence vector tt
based on the number of times ID appears
# Then reshape from long into wide format, with only the time-varying
variables repeated
df_wide <- reshape(cbind(df.tt,df),
idvar("ID","DOB","GENDER","ETHNICITY"), timevar="tt",direction="wide")
This testcase works fine.
Now taking a similar test case, with some missing values in the
idvar, like so,
ID DOB GENDER ETHNICITY TESTDATE TESTRESULT
1 1/1/1900 1 1 1/1/2005 100
1 1/1/1900 1 1 1/2/2005 110
2 8/1/1930 NA NA 2/1/2005 80
3 12/1/1990 NA NA 3/1/2005 200
3 12/1/1990 NA NA 3/2/2005 205
3 12/1/1990 NA NA 3/3/2005 220
Will result with a wide dataframe that only has id 1 and 2 (3 is dropped).
It took me some time to figure out that missing values in idvar will
result in the problem. As long as the idvar does not have any missing
values, all works out well.
Is there a way to use reshape to handle missing values in the idvar?
I'm just trying to avoid unnecessary expansion of my dataset with the
reshape command by holding the six dozen or so demographic variables
in my actual datasets constant.
Thank you for your help!
More information about the R-help
mailing list