[R] SAS XPORT data format [was Re: SAS or R software]
Douglas Bates
bates at stat.wisc.edu
Sat Dec 18 17:00:31 CET 2004
Frank E Harrell Jr wrote:
... much discussion deleted ...
> Regarding CDISC, the SAS transport format that is now accepted by FDA is
> deficient because there is no place for certain metadata (e.g., units of
> measurement, value labels are remote from the datasets, variable names
> are truncated to 8 characters). The preferred format for CDISC will
> become XML.
Since you brought up the SAS XPORT data format I have to respond with my
usual rant about it.
<rant>
When it comes to the SAS XPORT data format those are at best third or
fourth order deficiencies in the metadata. The first order deficiency
in the metadata is that it does not contain the number of records in a
data set. In this format a file can contain more than one data set and
a data set consists of an unknown number of fixed-length records.
Because of the potential of more than one data set you can't just read
to the end of the file or use the file size and the record size to
calculate the number of records. You must read through the file
examining each group of 80 characters (Why 80 characters? Those of us
who remember punched cards can tell you why.) and for each such group
try to determine if this is the beginning of another record in the
current data set or the beginning of a new data set. How is the
beginning of a new data set indicated - by a magic string of characters.
What if, either perversely or accidently, this magic string of
characters were included as a text field at the beginning of a record?
You wouldn't be able to tell if you have a new record or a new data set.
Even better than that, there are situations in which the number of
records in a data set is not well-defined due to the requirement of
padding the last 80 character group with blanks. (After all when you
create a punch card deck from your data set you want to get an integer
number of punched cards.) For example, if you are writing an odd number
of 40 character records then you must pad the last 80 character group
with blanks. When reading this data set how can you distinguish the odd
number of records padded with blanks from an even number of records in
which the last record happened to be all blanks? You can't.
When I first encountered this, I thought that I must not understand the
format properly. I thought that SAS (and, through SAS, the FDA)
couldn't really be using a format in which the number of records in a
data set can be ambiguous. This would mean that the operations of
writing the XPORT data set and reading it are not guaranteed to be
inverses. I started reading material on the SAS web site and discovered
that SAS indeed was aware of this problem and had a solution - users
should not create data sets that exhibit this abiguity. That's it.
Their solution is "don't do that".
</rant>
I think that replacing the SAS XPORT data format with XML will be a step
forward.
More information about the R-help
mailing list