[R] SAS "datalines" or "cards" statement equivalent in R?

Mon Dec 7 22:45:49 CET 2009

On Dec 7, 2009, at 12:37 PM, Marshall Feldman wrote:

> I totally agree with Barry, although it's sometimes convenient to
> include data with analysis code for debugging and/or documentation  
> purposes.
>
> However, the example actually applies equally to separate data  
> files. In
> fact, the example is from the U.S. Bureau of Labor Statistics at
> ftp://ftp.bls.gov/pub/time.series/sm/, which contains nothing but data
> and documentation files. At issue is not where the data come from, but
> rather how to parse relatively complex data organized inconsistently.
> SAS has built-in the ability to parse five different organizations of
> data: list (delimited), modified list, column, formatted, and mixed  
> (see
> http://www.masil.org/sas/input.html). It seems R can parse such data,
> but only with considerable work by the user. It would be great to  
> have a
> function/package that implements something with as easy (hah!) and
> flexible as SAS.

>    Marsh
>
> Barry Rowlingson wrote:
>> On Mon, Dec 7, 2009 at 3:53 PM, Marshall Feldman <marsh at uri.edu>  
>> wrote:
>>
>>> Regarding the various methods people have suggested, what if a  
>>> typical
>>> tab-delimited data line looks like:
>>>
>>>    SMS11000000000000001 1990 M01 688.0
>>>
>>> and the SAS INPUT statement is
>>>
>>>  INPUT survey $ 1-2 seasonal $ 3 state $ 4-5 area $ 6-10  
>>> supersector $
>>> 11-12 @13 industry $8. datatype $ 21-22  year period $ value  
>>> footnote $ ;

I was thinking of passing a FWF "chopped" input to scan to handle the  
tabs but discovered that read.fwf will parse trailing tab-separated  
fields.

First a bit of experimentation:
 > testdat <- "45678\t567\t45\t6"
 > read.fwf(textConnection(testdat), c(5,100))
      V1 V2  V3 V4 V5
1 45678 NA 567 45  6

Then the test on your data source:

 > testin <- read.fwf(url("ftp://ftp.bls.gov/pub/time.series/sm/sm.data.1.Alabama 
", open="r"), c(2,1,2,5,2,8,2,100 ), header=F,  n=100, skip=1)

#Need to throw away the header, since the fields no longer match after  
parsing what you communicated were the divisions within of the  
"series_id" field.

 > str(testin)
'data.frame':	100 obs. of  12 variables:
  $ V1 : Factor w/ 1 level "SM": 1 1 1 1 1 1 1 1 1 1 ...
  $ V2 : Factor w/ 1 level "S": 1 1 1 1 1 1 1 1 1 1 ...
  $ V3 : int  1 1 1 1 1 1 1 1 1 1 ...
  $ V4 : int  0 0 0 0 0 0 0 0 0 0 ...
  $ V5 : int  0 0 0 0 0 0 0 0 0 0 ...
  $ V6 : int  1 1 1 1 1 1 1 1 1 1 ...
  $ V7 : logi  NA NA NA NA NA NA ...
  $ V8 : logi  NA NA NA NA NA NA ...
  $ V9 : int  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
  $ V10: Factor w/ 12 levels "M01","M02","M03",..: 1 2 3 4 5 6 7 8 9  
10 ...
  $ V11: num  1625 1625 1624 1635 1639 ...
  $ V12: logi  NA NA NA NA NA NA ...

Noted that the leading 7 fwf fields were parse and followed by  
trailing tab separated fields, and the floating point field is also  
complete:

 > testin$V11
   [1] 1625.0 1625.1 1623.7 1634.8 1639.1 1643.5 1641.0 1639.4 1641.2  
1636.9 1639.8 1639.3 1636.2
  [14] 1633.8 1637.0 1635.5 1638.3 1639.5 1643.5 1645.3 1647.0 1648.2  
1649.3 1650.5 1657.9 1660.4
snipped
-- 
David
>>>
>>> Note that most data lines have no footnote item, as in the sample.
>>>
>>> Here (I think) we'd want all the character variables to be read as  
>>> factors,
>>> possibly "year" as a date, and "value" as numeric.
>>>
>>
>> Actually I'm surprised that nobody has yet said what a clearly
>> bonkers thing it is to mix up your data and your analysis code in a
>> single file. Now suppose you have another set of data you want to
>> analyse with the same code? Are you going to create a new file and
>> paste the new data in? You've now got two copies of your analysis  
>> code
>> - good luck keeping corrections to that code synchronised.
>>
>> This just seems like horrendously bad practice, which is one reason
>> it's kludgy in R. If it was good practice, someone would surely have
>> written a way to do it neatly.
>>
>> Keep your data in data files, and your functions in .R function
>> files. You'll thank me later.
>>
>> Barry
>>
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT