[R] Converting SAS Data code to R.
David Winsemius
dwinsemius at comcast.net
Sun Sep 27 18:10:13 CEST 2009
On Sep 27, 2009, at 11:49 AM, Douglas Bates wrote:
> On Sat, Sep 26, 2009 at 11:33 PM, David Winsemius
> <dwinsemius at comcast.net> wrote:
>> I am contemplating bringing in and merging three NHANES-III
>> datasets from
>> the National Center for Health Statistics that are fixed format
>> with record
>> length=3348, line counts around 20,000 and described by SAS DATA
>> steps. I
>> have downloaded and linked similar datasets from the Continuous
>> NHANES
>> public data releases, but never ones with this many variables at
>> once. In
>> the prior effort I managed the task by some cut-paste-editing from
>> the SAS
>> code file into a corresponding read.fwf R call, but the earlier
>> NHANES-III
>> data is far more voluminous than the more recent "Continuous"
>> version. I am
>> wondering if anyone has experience with such a process and would be
>> willing
>> to share some advice? The SAS code can be seen here:
>
>> ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.sas
>
>> The main code file Data step starts out...
>> FILENAME ADULT "D:\Questionnaire\DAT\ADULT.DAT" LRECL=3348;
>> *** LRECL includes 2 positions for CRLF, assuming use of PC SAS;
>> DATA WORK;
>> INFILE ADULT MISSOVER;
>> LENGTH
>> SEQN 7
>> DMPFSEQ 5
>> DMPSTAT 3
>> DMARETHN 3
>> DMARACER 3
>> DMAETHNR 3
>> HSSEX 3
>> The corresponding positions in the INPUT section are
>> INPUT
>> SEQN 1-5
>> DMPFSEQ 6-10
>> DMPSTAT 11
>> DMARETHN 12
>> DMARACER 13
>> DMAETHNR 14
>> HSSEX 15
>> The note about CRLF appears to be implying that those characters
>> are being
>> counted as part of the length of the first variable, SEQN, but that
>> there
>> are only 5 meaningful positions. I suppose I can find out by trial
>> and error
>> how to read such files, but it would save me some time if anyone in
>> the
>> audience has worked through this on this data before.
>> One thought would be to import the data with the SAS work-alike
>> program,
>> WKS, (which I have not used before) and then to read in with
>> read.xport from
>> the foreign library. That would obviate the need to understand the
>> character
>> position issue, but probably has a time commitment to get it up and
>> running
>> and learn how to use it.
>> Another thought would be to parse the fixed width SAS Data step
>> code into
>> pieces and build a data.frame from which I then extract the
>> row.names,
>> col.names, and colClasses from that centralized structure.
>
> Are the data available to the public somewhere or could just a few
> records be made available?
Yes. Just trim the file name and the CDC ftp server accepts the path
specification:
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/
The file that goes with that SAS code is adult.dat
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/1A/adult.dat
>
> The reason I ask is because I imagine there are a lot of missing data
> in each record (the data are arranged in the "wide" format for
> longitudinal data and includes follow-up questions that will not apply
> to most respondents). The missing data indicator, if any, and the
> format of the other fields will be important in deciding how to split
> the data.
Thanks for that. It was not designed as a longitudinal study, but
rather as cross-sectional study that was spaced over several years.
They did a re-exam of some sort, but that was not the primary purpose,
nor will it be my particular interest. I have tried to determine by
examination whether "." or " " is the missing value indicator and it
appears that both may used although there are many more spaces. Most
of the input suggests to my 15-year-old memories of SAS that the data
is numeric but there are 17 variables where input spec is "$nn"
> varLines[grep("[[:punct:]]", varLines)]
[1] " HAX11AG $6" " HAX11AH $6" " HAX11AI
$6"
[4] " HAX11AJ $6" " HAX11AK $6" " HAX11AL
$6"
[7] " HAX11AM $6" " HAX11AN $6" " HAX11AO
$6"
[10] " HAX11AP $6" " HAX11AQ $6" " HAX11AR $6"
[13] " HAX11AS $6" " HAX11AT $6" " HAX11AU $6"
[16] " HAX11AV $6" " HAZA1CC $30"
--
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list