[R] Getting codebook data into R

Fri Feb 10 08:19:31 CET 2012

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> On Behalf Of barny
> Sent: Thursday, February 09, 2012 12:52 PM
> To: r-help at r-project.org
> Subject: [R] Getting codebook data into R
> 
> I've been trying to get some data from the National Survey for Family
> Growth
> into R - however, the data is in a .dat file and the data I need doesn't
> have any spaces or commas separating fields - rather you have to look into
> the codebook and what number of digits along the line the data you need
> is.
> The data I want are the following, where 1,12,int means that the data I'm
> interested starts in column 1 and finishes in column 12 and is an integer.
> 
>             ('caseid', 1, 12, int),
>              ('nbrnaliv', 22, 22, int),
>             ('babysex', 56, 56, int),
>             ('birthwgt_lb', 57, 58, int),
>             ('birthwgt_oz', 59, 60, int),
>             ('prglength', 275, 276, int),
>             ('outcome', 277, 277, int),
>             ('birthord', 278, 279, int),
>             ('agepreg', 284, 287, int),
>             ('finalwgt', 423, 440, float)
> 
> How can I do this using R? I've written a python programme which basically
> does it but it'd be nicer if I could skip the Python bit and just do it
> using R. Cheers for any help.
> 

I didn't have time at work to look at this, but here is one possible approach.  I did not look at how the code book file was actually structured; I just took what you presented above, cleaned it up a bit (like this) 

'caseid',1,12,int
'nbrnaliv',22,22,int
'babysex',56,56,int
'birthwgt_lb',57,58,int
'birthwgt_oz',59,60,int
'prglength',275,276,int
'outcome',277,277,int
'birthord',278,279,int
'agepreg',284,287,int
'finalwgt',423,440,float

and copied it to the clipboard.  Then read it in using the following syntax

## read in data layout
codebook <- read.table('clipboard', sep=',', as.is=TRUE)

I will leave it to you to determine how you want to get the code book into your R session.  Having done this, one can compute the fields widths and the numbers of columns to skip between fields and then build a command to read in the data.  Something like this should get you started

## get number of rows in code book
nr <- nrow(codebook)
## provide names for codebook layout data frame
names(codebook) <- c('variable','begin','end','type')

## compute number of columns to read (and skip) for each variable
## store in the vector read.col
# compute field widths
codebook$width <- codebook$end - codebook$begin + 1

# compute columns to skip between end of one field and 
# beginning of next field
codebook$skip <- c(codebook$begin[-1]-codebook$end[-nr]-1,0)

## create zero length numeric vector for holding column widths
## (required by read.fwf) to read and skip, and populate the vector
read.col <- numeric()
for(i in 1:nr){
  read.col <- c(read.col,codebook$width[i])
  if(codebook$skip[i] > 0) read.col <- c(read.col,-codebook$skip[i])
}

## recode type values to R classes
codebook$Rtype <- ifelse(codebook$type %in% c('int','float'),'numeric', 'character')

## now read in the data
fwfdata <- read.fwf('c:/tmp/testpreg.txt', col.names=codebook$variable, 
                     widths=read.col, colClasses=codebook$Rtype)

The code is clearly not bullet proof and there is no error checking, etc.  However, it does the job, given the information you provided is accurate.  If you wanted, you could wrap it all up in a function and pass the data filename and code book name as parameters.

Hope this is helpful,

Dan

Daniel Nordlund
Bothell, WA USA