[R] Converting SAS Data code to R.

Mon Sep 28 00:01:43 CEST 2009

On Sep 27, 2009, at 12:10 PM, David Winsemius wrote:

>
> On Sep 27, 2009, at 11:49 AM, Douglas Bates wrote:
>
>> On Sat, Sep 26, 2009 at 11:33 PM, David Winsemius
>> <dwinsemius at comcast.net> wrote:
>>> I am contemplating bringing in and merging three NHANES-III  
>>> datasets from
>>> the National Center for Health Statistics that are fixed format  
>>> with record
>>> length=3348, line counts around 20,000 and described by SAS DATA  
>>> steps. I
>>> have downloaded and linked similar datasets from the Continuous  
>>> NHANES
>>> public data releases, but never ones with this many variables at  
>>> once. In
>>> the prior effort I managed the task by some cut-paste-editing from  
>>> the SAS
>>> code file into a corresponding read.fwf R call, but the earlier  
>>> NHANES-III
>>> data is far more voluminous than the more recent "Continuous"  
>>> version. I am
>>> wondering if anyone has experience with such a process and would  
>>> be willing
>>> to share some advice? The SAS code can be seen here:
>>
>>> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas
>>
>>> The main code file Data step starts out...
>>>   FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348;
>>>   *** LRECL includes 2 positions for CRLF, assuming use of PC SAS;
>>>   DATA WORK;
>>>     INFILE ADULT MISSOVER;
>>>     LENGTH
>>>       SEQN      7
>>>       DMPFSEQ   5
>>>       DMPSTAT   3
>>>       DMARETHN  3
>>>       DMARACER  3
>>>       DMAETHNR  3
>>>       HSSEX     3
>>> The corresponding positions in the INPUT section are
>>>    INPUT
>>>       SEQN     1-5
>>>       DMPFSEQ  6-10
>>>       DMPSTAT  11
>>>       DMARETHN 12
>>>       DMARACER 13
>>>       DMAETHNR 14
>>>       HSSEX    15
>>> The note about CRLF appears to be implying that those characters  
>>> are being
>>> counted as part of the length of the first variable, SEQN, but  
>>> that there
>>> are only 5 meaningful positions. I suppose I can find out by trial  
>>> and error
>>> how to read such files, but it would save me some time if anyone  
>>> in the
>>> audience has worked through this on this data before.
>>> One thought would be to import the data with the SAS work-alike  
>>> program,
>>> WKS, (which I have not used before) and then to read in with  
>>> read.xport from
>>> the foreign library. That would obviate the need to understand the  
>>> character
>>> position issue, but probably has a time commitment to get it up  
>>> and running
>>> and learn how to use it.
>>> Another thought would be to parse the fixed width SAS Data step  
>>> code into
>>> pieces and build a data.frame from which I then extract the  
>>> row.names,
>>> col.names, and colClasses from that centralized structure.
>>
>> Are the data available to the public somewhere or could just a few
>> records be made available?
>
> Yes. Just trim the file name and the CDC ftp server accepts the path  
> specification:
>
> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/
>
> The file that goes with that SAS code is adult.dat
>
> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.dat
>
>>
>> The reason I ask is because I imagine there are a lot of missing data
>> in each record (the data are arranged in the "wide" format for
>> longitudinal data and includes follow-up questions that will not  
>> apply
>> to most respondents).  The missing data indicator, if any, and the
>> format of the other fields will be important in deciding how to split
>> the data.
>
> Thanks for that. It was not designed as a longitudinal study, but  
> rather as cross-sectional study that was spaced over several years.  
> They did a re-exam of some sort, but that was not the primary  
> purpose, nor will it be my particular interest. I have tried to  
> determine by examination whether "." or " " is the missing value  
> indicator and it appears that both may used although there are many  
> more spaces. Most of the input suggests to my 15-year-old memories  
> of SAS that the data is numeric but there are 17 variables where  
> input spec is "$nn"
>
> > varLines[grep("[[:punct:]]", varLines)]
> [1] "        HAX11AG  $6"  "        HAX11AH  $6"  "        HAX11AI   
> $6"
> [4] "        HAX11AJ  $6"  "        HAX11AK  $6"  "        HAX11AL   
> $6"
> [7] "        HAX11AM  $6"  "        HAX11AN  $6"  "        HAX11AO   
> $6"
> [10] "        HAX11AP  $6"  "        HAX11AQ  $6"  "        HAX11AR   
> $6"
> [13] "        HAX11AS  $6"  "        HAX11AT  $6"  "        HAX11AU   
> $6"
> [16] "        HAX11AV  $6"  "        HAZA1CC  $30"

My progress on this effort so far consists of having figured out how  
to extract the variable names and their associated lengths so I can  
set up a call to read.fwf(). This is waht I did on hte section of the  
SAS code following INPUT that contains those elements:

trim.ws <- function(x) gsub("^[[:space:]]+|[[:space:]]+$", "",x)
# courtesy of a Grothendieck r-help posting of a couple or three years  
ago.

adult.var <- data.frame(varnames =  
sapply( strsplit(trim.ws(varLines) , " +") ,  "[", 1:2)[1,], varlen=  
sapply( strsplit(trim.ws(varLines) , " +") ,  "[", 1:2)[2,])
#so that I can split the trimmed strings on an arbitrary number of  
spaces.

 > adult.var[,][1:5,]
   varnames varlen
1     SEQN      7
2  DMPFSEQ      5
3  DMPSTAT      3
4 DMARETHN      3
5 DMARACER      3

 > adult.var[grep("\\$", adult.var$varlen),][1:5,]
      varnames varlen
1064  HAX11AG     $6
1069  HAX11AH     $6
1074  HAX11AI     $6
1079  HAX11AJ     $6
1084  HAX11AK     $6

I still have a small number of "varlen" which have the form "$nn" but  
I suspect that won't be much of a challenge to substitute "" for "$".  
I think I will first create a column that is "numeric" for all the  
rows without "$" and "character" for all the ones with "$".

-- 

David Winsemius, MD
Heritage Laboratories
West Hartford, CT