[R] Converting SAS Data code to R.
David Winsemius
dwinsemius at comcast.net
Mon Sep 28 00:01:43 CEST 2009
On Sep 27, 2009, at 12:10 PM, David Winsemius wrote:
>
> On Sep 27, 2009, at 11:49 AM, Douglas Bates wrote:
>
>> On Sat, Sep 26, 2009 at 11:33 PM, David Winsemius
>> <dwinsemius at comcast.net> wrote:
>>> I am contemplating bringing in and merging three NHANES-III
>>> datasets from
>>> the National Center for Health Statistics that are fixed format
>>> with record
>>> length=3348, line counts around 20,000 and described by SAS DATA
>>> steps. I
>>> have downloaded and linked similar datasets from the Continuous
>>> NHANES
>>> public data releases, but never ones with this many variables at
>>> once. In
>>> the prior effort I managed the task by some cut-paste-editing from
>>> the SAS
>>> code file into a corresponding read.fwf R call, but the earlier
>>> NHANES-III
>>> data is far more voluminous than the more recent "Continuous"
>>> version. I am
>>> wondering if anyone has experience with such a process and would
>>> be willing
>>> to share some advice? The SAS code can be seen here:
>>
>>> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas
>>
>>> The main code file Data step starts out...
>>> FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348;
>>> *** LRECL includes 2 positions for CRLF, assuming use of PC SAS;
>>> DATA WORK;
>>> INFILE ADULT MISSOVER;
>>> LENGTH
>>> SEQN 7
>>> DMPFSEQ 5
>>> DMPSTAT 3
>>> DMARETHN 3
>>> DMARACER 3
>>> DMAETHNR 3
>>> HSSEX 3
>>> The corresponding positions in the INPUT section are
>>> INPUT
>>> SEQN 1-5
>>> DMPFSEQ 6-10
>>> DMPSTAT 11
>>> DMARETHN 12
>>> DMARACER 13
>>> DMAETHNR 14
>>> HSSEX 15
>>> The note about CRLF appears to be implying that those characters
>>> are being
>>> counted as part of the length of the first variable, SEQN, but
>>> that there
>>> are only 5 meaningful positions. I suppose I can find out by trial
>>> and error
>>> how to read such files, but it would save me some time if anyone
>>> in the
>>> audience has worked through this on this data before.
>>> One thought would be to import the data with the SAS work-alike
>>> program,
>>> WKS, (which I have not used before) and then to read in with
>>> read.xport from
>>> the foreign library. That would obviate the need to understand the
>>> character
>>> position issue, but probably has a time commitment to get it up
>>> and running
>>> and learn how to use it.
>>> Another thought would be to parse the fixed width SAS Data step
>>> code into
>>> pieces and build a data.frame from which I then extract the
>>> row.names,
>>> col.names, and colClasses from that centralized structure.
>>
>> Are the data available to the public somewhere or could just a few
>> records be made available?
>
> Yes. Just trim the file name and the CDC ftp server accepts the path
> specification:
>
> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/
>
> The file that goes with that SAS code is adult.dat
>
> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.dat
>
>>
>> The reason I ask is because I imagine there are a lot of missing data
>> in each record (the data are arranged in the "wide" format for
>> longitudinal data and includes follow-up questions that will not
>> apply
>> to most respondents). The missing data indicator, if any, and the
>> format of the other fields will be important in deciding how to split
>> the data.
>
> Thanks for that. It was not designed as a longitudinal study, but
> rather as cross-sectional study that was spaced over several years.
> They did a re-exam of some sort, but that was not the primary
> purpose, nor will it be my particular interest. I have tried to
> determine by examination whether "." or " " is the missing value
> indicator and it appears that both may used although there are many
> more spaces. Most of the input suggests to my 15-year-old memories
> of SAS that the data is numeric but there are 17 variables where
> input spec is "$nn"
>
> > varLines[grep("[[:punct:]]", varLines)]
> [1] " HAX11AG $6" " HAX11AH $6" " HAX11AI
> $6"
> [4] " HAX11AJ $6" " HAX11AK $6" " HAX11AL
> $6"
> [7] " HAX11AM $6" " HAX11AN $6" " HAX11AO
> $6"
> [10] " HAX11AP $6" " HAX11AQ $6" " HAX11AR
> $6"
> [13] " HAX11AS $6" " HAX11AT $6" " HAX11AU
> $6"
> [16] " HAX11AV $6" " HAZA1CC $30"
My progress on this effort so far consists of having figured out how
to extract the variable names and their associated lengths so I can
set up a call to read.fwf(). This is waht I did on hte section of the
SAS code following INPUT that contains those elements:
trim.ws <- function(x) gsub("^[[:space:]]+|[[:space:]]+$", "",x)
# courtesy of a Grothendieck r-help posting of a couple or three years
ago.
adult.var <- data.frame(varnames =
sapply( strsplit(trim.ws(varLines) , " +") , "[", 1:2)[1,], varlen=
sapply( strsplit(trim.ws(varLines) , " +") , "[", 1:2)[2,])
#so that I can split the trimmed strings on an arbitrary number of
spaces.
> adult.var[,][1:5,]
varnames varlen
1 SEQN 7
2 DMPFSEQ 5
3 DMPSTAT 3
4 DMARETHN 3
5 DMARACER 3
> adult.var[grep("\\$", adult.var$varlen),][1:5,]
varnames varlen
1064 HAX11AG $6
1069 HAX11AH $6
1074 HAX11AI $6
1079 HAX11AJ $6
1084 HAX11AK $6
I still have a small number of "varlen" which have the form "$nn" but
I suspect that won't be much of a challenge to substitute "" for "$".
I think I will first create a column that is "numeric" for all the
rows without "$" and "character" for all the ones with "$".
--
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list