[R] Getting information encoded in a SAS, SPSS or Stata command file into R.

Thu Nov 15 00:48:18 CET 2012

On Nov 14, 2012, at 2:33 PM, andrewH wrote:

> Dear Anthony – 
> 
> On closer examination, what I am talking about is not factor levels, but
> something different (but analogous). The data that is categorical all has
> integer codes, so the file is entirely numeric. The SAS proc format then
> gives text strings for each code for each categorical variable. Like this:
> 
> value REGION_f
>  11 = "New England Division"
>  12 = "Middle Atlantic Division"
>  21 = "East North Central Division" 
>  22 = "West North Central Division"
>  31 = "South Atlantic Division"
>  32 = "East South Central Division"
>  33 = "West South Central Division"
>  41 = "Mountain Division"
>  42 = "Pacific Division"
>  97 = "State not identified"

There will be a semi-colon to mark the end of the <integer = quoted-values> pairs.

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473472.htm

I agree it might be nice to have a function that would take this and use a match() function something like:

newfac <- val_strings [ match(REGION_f, convtbl) ]

###----code----#
> conv <- read.table(text='11 = "New England Division"
+  12 = "Middle Atlantic Division"
+  21 = "East North Central Division" 
+  22 = "West North Central Division"
+  31 = "South Atlantic Division"
+  32 = "East South Central Division"
+  33 = "West South Central Division"
+  41 = "Mountain Division"
+  42 = "Pacific Division"
+  97 = "State not identified"', sep="=", stringsAsFactors=FALSE)

> conv[[2]] [ match(c(11,97,42,31), conv[[1]] )] 
[1] " New England Division"    " State not identified"    " Pacific Division"        " South Atlantic Division"

To pretty it up, you could give names to columns of the conversion table and perhaps readLine could be se tup to start at "value" and end at the next semi-colon.

> 
> So it would make sense to have a lookup table of these codes linked to the
> variables. I’m not sure if it makes more sense to have that table live in R
> or in the database. For R purposes, I imagine it would make sense to convert
> these integer-valued variables into factors. 
> 
> What I do not understand is how SAS knows where the variables begin and end.
> I managed to break off a little hunk of the beginning of my file and look at
> it in an editor, and it is numbers without any obvious delimiters. Is the
> delimiter a particular numeric string?

Probably fixed field format.

> I thought the SAS command file would
> contain the starting location for each of the fixed-length fields, but I do
> not see anything in the file that could be interpreted that way – just a
> little wraparound code and then a long list of variable names followed by
> triplets of a code, an equals sign, and a text string, terminating with a
> semicolon. 
> 
Exactly 

> I’m sorry if I am being obtuse. When I said before that I had saved the SAS
> files as flat files, what I really meant was that I had an intern do it.
> When I was doing my own analysis, I mainly used TSP, before I switched to R
> about a year ago. I’ve never used SAS. 
> 
> I find your data project very interesting.  Very.   It is not actually
> necessary to wait for BLS to release the older CEX files, if you can lay
> your hands on the CDs. I spoke to the BLS data products office about  2
> years ago, and they have no problem with people republishing purchased data
> in any format they like, including simple duplication.  In fact, they seemed
> to like the idea.  I think the sale of data was forced on them by some kind
> of mandate from above. 

Your legal status will not depend on conversations with staff, but rather on your user-agreement.

> 
> I'll be playing with your code (which is a model of readability, and a
> lesson to me on same, BTW) and keep you posted on my progress. 
> 
> Warmly, Andrew
> 
-- 
David Winsemius, MD
Alameda, CA, USA