[R] Converting SAS Data code to R.

Sun Sep 27 18:10:13 CEST 2009

On Sep 27, 2009, at 11:49 AM, Douglas Bates wrote:

> On Sat, Sep 26, 2009 at 11:33 PM, David Winsemius
> <dwinsemius at comcast.net> wrote:
>> I am contemplating bringing in and merging three NHANES-III  
>> datasets from
>> the National Center for Health Statistics that are fixed format  
>> with record
>> length=3348, line counts around 20,000 and described by SAS DATA  
>> steps. I
>> have downloaded and linked similar datasets from the Continuous  
>> NHANES
>> public data releases, but never ones with this many variables at  
>> once. In
>> the prior effort I managed the task by some cut-paste-editing from  
>> the SAS
>> code file into a corresponding read.fwf R call, but the earlier  
>> NHANES-III
>> data is far more voluminous than the more recent "Continuous"  
>> version. I am
>> wondering if anyone has experience with such a process and would be  
>> willing
>> to share some advice? The SAS code can be seen here:
>
>> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas
>
>> The main code file Data step starts out...
>>    FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348;
>>    *** LRECL includes 2 positions for CRLF, assuming use of PC SAS;
>>    DATA WORK;
>>      INFILE ADULT MISSOVER;
>>      LENGTH
>>        SEQN      7
>>        DMPFSEQ   5
>>        DMPSTAT   3
>>        DMARETHN  3
>>        DMARACER  3
>>        DMAETHNR  3
>>        HSSEX     3
>> The corresponding positions in the INPUT section are
>>     INPUT
>>        SEQN     1-5
>>        DMPFSEQ  6-10
>>        DMPSTAT  11
>>        DMARETHN 12
>>        DMARACER 13
>>        DMAETHNR 14
>>        HSSEX    15
>> The note about CRLF appears to be implying that those characters  
>> are being
>> counted as part of the length of the first variable, SEQN, but that  
>> there
>> are only 5 meaningful positions. I suppose I can find out by trial  
>> and error
>> how to read such files, but it would save me some time if anyone in  
>> the
>> audience has worked through this on this data before.
>> One thought would be to import the data with the SAS work-alike  
>> program,
>> WKS, (which I have not used before) and then to read in with  
>> read.xport from
>> the foreign library. That would obviate the need to understand the  
>> character
>> position issue, but probably has a time commitment to get it up and  
>> running
>> and learn how to use it.
>> Another thought would be to parse the fixed width SAS Data step  
>> code into
>> pieces and build a data.frame from which I then extract the  
>> row.names,
>> col.names, and colClasses from that centralized structure.
>
> Are the data available to the public somewhere or could just a few
> records be made available?

Yes. Just trim the file name and the CDC ftp server accepts the path  
specification:

ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/

The file that goes with that SAS code is adult.dat

ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.dat

>
> The reason I ask is because I imagine there are a lot of missing data
> in each record (the data are arranged in the "wide" format for
> longitudinal data and includes follow-up questions that will not apply
> to most respondents).  The missing data indicator, if any, and the
> format of the other fields will be important in deciding how to split
> the data.

Thanks for that. It was not designed as a longitudinal study, but  
rather as cross-sectional study that was spaced over several years.  
They did a re-exam of some sort, but that was not the primary purpose,  
nor will it be my particular interest. I have tried to determine by  
examination whether "." or " " is the missing value indicator and it  
appears that both may used although there are many more spaces. Most  
of the input suggests to my 15-year-old memories of SAS that the data  
is numeric but there are 17 variables where input spec is "$nn"

 > varLines[grep("[[:punct:]]", varLines)]
  [1] "        HAX11AG  $6"  "        HAX11AH  $6"  "        HAX11AI   
$6"
  [4] "        HAX11AJ  $6"  "        HAX11AK  $6"  "        HAX11AL   
$6"
  [7] "        HAX11AM  $6"  "        HAX11AN  $6"  "        HAX11AO   
$6"
[10] "        HAX11AP  $6"  "        HAX11AQ  $6"  "        HAX11AR  $6"
[13] "        HAX11AS  $6"  "        HAX11AT  $6"  "        HAX11AU  $6"
[16] "        HAX11AV  $6"  "        HAZA1CC  $30"

-- 
David Winsemius, MD
Heritage Laboratories
West Hartford, CT