[R] Converting SAS Data code to R.
David Winsemius
dwinsemius at comcast.net
Mon Sep 28 04:00:08 CEST 2009
On Sep 27, 2009, at 6:01 PM, David Winsemius wrote:
>
> On Sep 27, 2009, at 12:10 PM, David Winsemius wrote:
>
>>
>> On Sep 27, 2009, at 11:49 AM, Douglas Bates wrote:
>>
>>> On Sat, Sep 26, 2009 at 11:33 PM, David Winsemius
>>> <dwinsemius at comcast.net> wrote:
>>>> I am contemplating bringing in and merging three NHANES-III
>>>> datasets from
>>>> the National Center for Health Statistics that are fixed format
>>>> with record
>>>> length=3348, line counts around 20,000 and described by SAS DATA
>>>> steps. I
>>>> have downloaded and linked similar datasets from the Continuous
>>>> NHANES
>>>> public data releases, but never ones with this many variables at
>>>> once. In
>>>> the prior effort I managed the task by some cut-paste-editing
>>>> from the SAS
>>>> code file into a corresponding read.fwf R call, but the earlier
>>>> NHANES-III
>>>> data is far more voluminous than the more recent "Continuous"
>>>> version. I am
>>>> wondering if anyone has experience with such a process and would
>>>> be willing
>>>> to share some advice? The SAS code can be seen here:
>>>
>>>> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas
>>>
>>>> The main code file Data step starts out...
>>>> FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348;
>>>> *** LRECL includes 2 positions for CRLF, assuming use of PC SAS;
>>>> DATA WORK;
>>>> INFILE ADULT MISSOVER;
>>>> LENGTH
>>>> SEQN 7
>>>> DMPFSEQ 5
>>>> DMPSTAT 3
>>>> DMARETHN 3
>>>> DMARACER 3
>>>> DMAETHNR 3
>>>> HSSEX 3
>>>> The corresponding positions in the INPUT section are
>>>> INPUT
>>>> SEQN 1-5
>>>> DMPFSEQ 6-10
>>>> DMPSTAT 11
>>>> DMARETHN 12
>>>> DMARACER 13
>>>> DMAETHNR 14
>>>> HSSEX 15
>>>> The note about CRLF appears to be implying that those characters
>>>> are being
>>>> counted as part of the length of the first variable, SEQN, but
>>>> that there
>>>> are only 5 meaningful positions. I suppose I can find out by
>>>> trial and error
>>>> how to read such files, but it would save me some time if anyone
>>>> in the
>>>> audience has worked through this on this data before.
>>>> One thought would be to import the data with the SAS work-alike
>>>> program,
>>>> WKS, (which I have not used before) and then to read in with
>>>> read.xport from
>>>> the foreign library. That would obviate the need to understand
>>>> the character
>>>> position issue, but probably has a time commitment to get it up
>>>> and running
>>>> and learn how to use it.
>>>> Another thought would be to parse the fixed width SAS Data step
>>>> code into
>>>> pieces and build a data.frame from which I then extract the
>>>> row.names,
>>>> col.names, and colClasses from that centralized structure.
>>>
>>> Are the data available to the public somewhere or could just a few
>>> records be made available?
>>
>> Yes. Just trim the file name and the CDC ftp server accepts the
>> path specification:
>>
>> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/
>>
>> The file that goes with that SAS code is adult.dat
>>
>> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.dat
>>
>>>
>>> The reason I ask is because I imagine there are a lot of missing
>>> data
>>> in each record (the data are arranged in the "wide" format for
>>> longitudinal data and includes follow-up questions that will not
>>> apply
>>> to most respondents). The missing data indicator, if any, and the
>>> format of the other fields will be important in deciding how to
>>> split
>>> the data.
>>
>> Thanks for that. It was not designed as a longitudinal study, but
>> rather as cross-sectional study that was spaced over several years.
>> They did a re-exam of some sort, but that was not the primary
>> purpose, nor will it be my particular interest. I have tried to
>> determine by examination whether "." or " " is the missing value
>> indicator and it appears that both may used although there are many
>> more spaces. Most of the input suggests to my 15-year-old memories
>> of SAS that the data is numeric but there are 17 variables where
>> input spec is "$nn"
>>
>> > varLines[grep("[[:punct:]]", varLines)]
>> [1] " HAX11AG $6" " HAX11AH $6" " HAX11AI
>> $6"
>> [4] " HAX11AJ $6" " HAX11AK $6" " HAX11AL
>> $6"
>> [7] " HAX11AM $6" " HAX11AN $6" " HAX11AO
>> $6"
>> [10] " HAX11AP $6" " HAX11AQ $6" "
>> HAX11AR $6"
>> [13] " HAX11AS $6" " HAX11AT $6" "
>> HAX11AU $6"
>> [16] " HAX11AV $6" " HAZA1CC $30"
>
>
> My progress on this effort so far consists of having figured out how
> to extract the variable names and their associated lengths so I can
> set up a call to read.fwf(). This is waht I did on hte section of
> the SAS code following INPUT that contains those elements:
>
> trim.ws <- function(x) gsub("^[[:space:]]+|[[:space:]]+$", "",x)
> # courtesy of a Grothendieck r-help posting of a couple or three
> years ago.
>
> adult.var <- data.frame(varnames =
> sapply( strsplit(trim.ws(varLines) , " +") , "[", 1:2)[1,], varlen=
> sapply( strsplit(trim.ws(varLines) , " +") , "[", 1:2)[2,])
> #so that I can split the trimmed strings on an arbitrary number of
> spaces.
>
> > adult.var[,][1:5,]
> varnames varlen
> 1 SEQN 7
> 2 DMPFSEQ 5
> 3 DMPSTAT 3
> 4 DMARETHN 3
> 5 DMARACER 3
As it turned out the "LENGTH" numbers in the SAS code are not the
number of characters. I needed to calculate the number of characters
by subtracting the starting and ending positions following the INPUT
statements (pasted them in from my editor) :
varlen.ss <- scan(textConnection("SEQN 1-5
DMPFSEQ 6-10
DMPSTAT 11
DMARETHN 12
snipped hundreds of varialbe names and postion...
HAZMNK1R 3337-3339
HAZNOK1R 3340-3341
HAZMNK5R 3342-3344
HAZNOK5R 3345-3346") ,what=c("character", "character") )
varlen.ssm <- matrix(varlen.ss, ncol=2, byrow=TRUE)
adult.var$vname2 <- varlen.ssm[ ,1]
adult.var$vlen2 <- varlen.ssm[ ,2]
adult.var$end <- unlist(lapply( strsplit( adult.var[,"vlen2"], "-"),
tail, 1))
adult.var$start <- unlist(lapply( strsplit( adult.var[,"vlen2"], "-"),
head, 1))
adult.var$varlen3 <- as.numeric(adult.var$end)-as.numeric(adult.var
$start) + 1
>
> > adult.var[grep("\\$", adult.var$varlen),][1:5,]
> varnames varlen
> 1064 HAX11AG $6
> 1069 HAX11AH $6
> 1074 HAX11AI $6
> 1079 HAX11AJ $6
> 1084 HAX11AK $6
>
> I still have a small number of "varlen" which have the form "$nn"
> but I suspect that won't be much of a challenge to substitute "" for
> "$". I think I will first create a column that is "numeric" for all
> the rows without "$" and "character" for all the ones with "$".
adult.var$charvar <- NA
adult.var$charvar[grep("\\$", adult.var$varlen)] <- "character"
adult.var$charvar[-grep("\\$", adult.var$varlen)] <- "numeric"
So the successful read operation then followed:
adult.read <- with( adult.var, read.fwf( "/Users/davidwinsemius/
Documents/Mortality/NHANES3/NH3.adult.dat", widths=varlen3,
colClasses=charvar, col.names=varnames) )
> str(adult.read)
'data.frame': 20050 obs. of 1238 variables:
$ SEQN : num 3 4 9 10 11 19 34 40 44 45 ...
$ DMPFSEQ : num 3872 4115 4064 5386 8142 ...
$ DMPSTAT : num 2 2 2 2 2 2 2 2 3 3 ...
$ DMARETHN: num 3 3 1 1 3 2 2 3 1 2 ...
$ DMARACER: num 1 1 1 1 1 2 2 1 1 2 ...
$ DMAETHNR: num 1 1 3 3 1 3 3 1 3 3 ...
$ HSSEX : num 1 2 2 1 1 1 2 2 2 2 ...
.....snipped the rest of the variables....
--
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list