[R] Converting SAS Data code to R.

Mon Sep 28 04:00:08 CEST 2009

On Sep 27, 2009, at 6:01 PM, David Winsemius wrote:

>
> On Sep 27, 2009, at 12:10 PM, David Winsemius wrote:
>
>>
>> On Sep 27, 2009, at 11:49 AM, Douglas Bates wrote:
>>
>>> On Sat, Sep 26, 2009 at 11:33 PM, David Winsemius
>>> <dwinsemius at comcast.net> wrote:
>>>> I am contemplating bringing in and merging three NHANES-III  
>>>> datasets from
>>>> the National Center for Health Statistics that are fixed format  
>>>> with record
>>>> length=3348, line counts around 20,000 and described by SAS DATA  
>>>> steps. I
>>>> have downloaded and linked similar datasets from the Continuous  
>>>> NHANES
>>>> public data releases, but never ones with this many variables at  
>>>> once. In
>>>> the prior effort I managed the task by some cut-paste-editing  
>>>> from the SAS
>>>> code file into a corresponding read.fwf R call, but the earlier  
>>>> NHANES-III
>>>> data is far more voluminous than the more recent "Continuous"  
>>>> version. I am
>>>> wondering if anyone has experience with such a process and would  
>>>> be willing
>>>> to share some advice? The SAS code can be seen here:
>>>
>>>> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas
>>>
>>>> The main code file Data step starts out...
>>>>  FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348;
>>>>  *** LRECL includes 2 positions for CRLF, assuming use of PC SAS;
>>>>  DATA WORK;
>>>>    INFILE ADULT MISSOVER;
>>>>    LENGTH
>>>>      SEQN      7
>>>>      DMPFSEQ   5
>>>>      DMPSTAT   3
>>>>      DMARETHN  3
>>>>      DMARACER  3
>>>>      DMAETHNR  3
>>>>      HSSEX     3
>>>> The corresponding positions in the INPUT section are
>>>>   INPUT
>>>>      SEQN     1-5
>>>>      DMPFSEQ  6-10
>>>>      DMPSTAT  11
>>>>      DMARETHN 12
>>>>      DMARACER 13
>>>>      DMAETHNR 14
>>>>      HSSEX    15
>>>> The note about CRLF appears to be implying that those characters  
>>>> are being
>>>> counted as part of the length of the first variable, SEQN, but  
>>>> that there
>>>> are only 5 meaningful positions. I suppose I can find out by  
>>>> trial and error
>>>> how to read such files, but it would save me some time if anyone  
>>>> in the
>>>> audience has worked through this on this data before.
>>>> One thought would be to import the data with the SAS work-alike  
>>>> program,
>>>> WKS, (which I have not used before) and then to read in with  
>>>> read.xport from
>>>> the foreign library. That would obviate the need to understand  
>>>> the character
>>>> position issue, but probably has a time commitment to get it up  
>>>> and running
>>>> and learn how to use it.
>>>> Another thought would be to parse the fixed width SAS Data step  
>>>> code into
>>>> pieces and build a data.frame from which I then extract the  
>>>> row.names,
>>>> col.names, and colClasses from that centralized structure.
>>>
>>> Are the data available to the public somewhere or could just a few
>>> records be made available?
>>
>> Yes. Just trim the file name and the CDC ftp server accepts the  
>> path specification:
>>
>> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/
>>
>> The file that goes with that SAS code is adult.dat
>>
>> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.dat
>>
>>>
>>> The reason I ask is because I imagine there are a lot of missing  
>>> data
>>> in each record (the data are arranged in the "wide" format for
>>> longitudinal data and includes follow-up questions that will not  
>>> apply
>>> to most respondents).  The missing data indicator, if any, and the
>>> format of the other fields will be important in deciding how to  
>>> split
>>> the data.
>>
>> Thanks for that. It was not designed as a longitudinal study, but  
>> rather as cross-sectional study that was spaced over several years.  
>> They did a re-exam of some sort, but that was not the primary  
>> purpose, nor will it be my particular interest. I have tried to  
>> determine by examination whether "." or " " is the missing value  
>> indicator and it appears that both may used although there are many  
>> more spaces. Most of the input suggests to my 15-year-old memories  
>> of SAS that the data is numeric but there are 17 variables where  
>> input spec is "$nn"
>>
>> > varLines[grep("[[:punct:]]", varLines)]
>> [1] "        HAX11AG  $6"  "        HAX11AH  $6"  "        HAX11AI   
>> $6"
>> [4] "        HAX11AJ  $6"  "        HAX11AK  $6"  "        HAX11AL   
>> $6"
>> [7] "        HAX11AM  $6"  "        HAX11AN  $6"  "        HAX11AO   
>> $6"
>> [10] "        HAX11AP  $6"  "        HAX11AQ  $6"  "         
>> HAX11AR  $6"
>> [13] "        HAX11AS  $6"  "        HAX11AT  $6"  "         
>> HAX11AU  $6"
>> [16] "        HAX11AV  $6"  "        HAZA1CC  $30"
>
>
> My progress on this effort so far consists of having figured out how  
> to extract the variable names and their associated lengths so I can  
> set up a call to read.fwf(). This is waht I did on hte section of  
> the SAS code following INPUT that contains those elements:
>
> trim.ws <- function(x) gsub("^[[:space:]]+|[[:space:]]+$", "",x)
> # courtesy of a Grothendieck r-help posting of a couple or three  
> years ago.
>
> adult.var <- data.frame(varnames =  
> sapply( strsplit(trim.ws(varLines) , " +") ,  "[", 1:2)[1,], varlen=  
> sapply( strsplit(trim.ws(varLines) , " +") ,  "[", 1:2)[2,])
> #so that I can split the trimmed strings on an arbitrary number of  
> spaces.
>
> > adult.var[,][1:5,]
>  varnames varlen
> 1     SEQN      7
> 2  DMPFSEQ      5
> 3  DMPSTAT      3
> 4 DMARETHN      3
> 5 DMARACER      3

As it turned out the "LENGTH" numbers in the SAS code are not the  
number of characters. I needed to calculate the number of characters  
by subtracting the starting and ending positions following the INPUT  
statements (pasted them in from my editor) :

varlen.ss <- scan(textConnection("SEQN     1-5
         DMPFSEQ  6-10
         DMPSTAT  11
         DMARETHN 12
snipped hundreds of varialbe names and postion...
        HAZMNK1R 3337-3339
         HAZNOK1R 3340-3341
         HAZMNK5R 3342-3344
         HAZNOK5R 3345-3346") ,what=c("character", "character") )

varlen.ssm <- matrix(varlen.ss, ncol=2, byrow=TRUE)
adult.var$vname2 <- varlen.ssm[ ,1]
adult.var$vlen2 <- varlen.ssm[ ,2]

adult.var$end <- unlist(lapply( strsplit( adult.var[,"vlen2"], "-"),  
tail, 1))
adult.var$start <- unlist(lapply( strsplit( adult.var[,"vlen2"], "-"),  
head, 1))
adult.var$varlen3 <- as.numeric(adult.var$end)-as.numeric(adult.var 
$start) + 1
>
> > adult.var[grep("\\$", adult.var$varlen),][1:5,]
>     varnames varlen
> 1064  HAX11AG     $6
> 1069  HAX11AH     $6
> 1074  HAX11AI     $6
> 1079  HAX11AJ     $6
> 1084  HAX11AK     $6
>
> I still have a small number of "varlen" which have the form "$nn"  
> but I suspect that won't be much of a challenge to substitute "" for  
> "$". I think I will first create a column that is "numeric" for all  
> the rows without "$" and "character" for all the ones with "$".

adult.var$charvar <- NA
  adult.var$charvar[grep("\\$", adult.var$varlen)] <- "character"
adult.var$charvar[-grep("\\$", adult.var$varlen)] <- "numeric"

So the successful read operation then followed:

adult.read <- with( adult.var, read.fwf( "/Users/davidwinsemius/ 
Documents/Mortality/NHANES3/NH3.adult.dat",  widths=varlen3,  
colClasses=charvar, col.names=varnames) )

 > str(adult.read)
'data.frame':	20050 obs. of  1238 variables:
  $ SEQN    : num  3 4 9 10 11 19 34 40 44 45 ...
  $ DMPFSEQ : num  3872 4115 4064 5386 8142 ...
  $ DMPSTAT : num  2 2 2 2 2 2 2 2 3 3 ...
  $ DMARETHN: num  3 3 1 1 3 2 2 3 1 2 ...
  $ DMARACER: num  1 1 1 1 1 2 2 1 1 2 ...
  $ DMAETHNR: num  1 1 3 3 1 3 3 1 3 3 ...
  $ HSSEX   : num  1 2 2 1 1 1 2 2 2 2 ...
.....snipped the rest of the variables....

-- 

David Winsemius, MD
Heritage Laboratories
West Hartford, CT