[R] Problem using reshape with missing values in idvar

Mon Oct 31 20:54:48 CET 2005

Hello everybody,

I have been recently using reshape to convert "long" data to "wide" 
data. Everything was going well until I reached some problematic 
datasets. It has taken me a couple of weeks to finally figure out 
what might be happening.

The problem is reproducible with test cases, and on two versions of R 
(Windows 2.2.0 and x86-64 Fedora Core 3 R 2.2.0).

The data started out in Microsoft Excel 2003 before being saved as a 
.csv file. The data stores the records of study participants which 
may return a variable number of times for follow up (between one to 
several dozen follow up visits). The research staff would start a row 
of data for each study participant, and each follow up visit would be 
row underneath the previous row for that study participant. Because 
of the variable nature of follow up, many fields will be "NA" since 
some participants may have only one study.

A sample is as follows (note that the ID appears consecutively for 
each follow up, or in other words, all the ID are grouped chronologically)
ID  DOB  GENDER  ETHNICITY  TESTDATE TESTRESULT
1 1/1/1900 1 1 1/1/2005 100
1 1/1/1900 1 1 1/2/2005 110
2 8/1/1930 2 1 2/1/2005 80
3 12/1/1990 2 2 3/1/2005 200
3 12/1/1990 2 2 3/2/2005 205
3 12/1/1990 2 2 3/3/2005 220

My code is as follows:
df <- read.csv("df.csv")  # Read .csv file into R
df.tt <- sequence(rle(df$ID)$length)  # Create a sequence vector tt 
based on the number of times ID appears
# Then reshape from long into wide format, with only the time-varying 
variables repeated
df_wide <- reshape(cbind(df.tt,df), 
idvar("ID","DOB","GENDER","ETHNICITY"), timevar="tt",direction="wide")

This testcase works fine.

Now taking a similar test case, with some missing values in the 
idvar, like so,
ID  DOB  GENDER  ETHNICITY  TESTDATE TESTRESULT
1 1/1/1900 1 1 1/1/2005 100
1 1/1/1900 1 1 1/2/2005 110
2 8/1/1930 NA NA 2/1/2005 80
3 12/1/1990 NA NA 3/1/2005 200
3 12/1/1990 NA NA 3/2/2005 205
3 12/1/1990 NA NA 3/3/2005 220

Will result with a wide dataframe that only has id 1 and 2 (3 is dropped).

It took me some time to figure out that missing values in idvar will 
result in the problem. As long as the idvar does not have any missing 
values, all works out well.

Is there a way to use reshape to handle missing values in the idvar? 
I'm just trying to avoid unnecessary expansion of my dataset with the 
reshape command by holding the six dozen or so demographic variables 
in my actual datasets constant.

Thank you for your help!