[R] variable labels to accompany data.frame

Jacob Wegelin jacobwegelin at fastmail.fm
Wed Oct 28 18:27:53 CET 2009


Often it is useful to keep a "codebook" to document the contents of a dataset. (By "dataset" I mean
a rectangular structure such as a dataframe.)

The codebook has as many rows as the dataset has columns (variables, fields).  The columns (fields)
of the codebook may include:

 	•       variable name

 	•       type (character, factor, integer, etc)

 	•       variable label (e.g., a variable called "bmi2" might be labeled "BMI hand-input by
 	clinic personnel, must be checked"

 	•       permissible values

 	•       which values indicate missing (and potentially different kinds of missing)

Some statistics software (e.g., SPSS and Stata) provides at least a subset of this kind of
information automatically in a convenient form. For instance, in Stata one can define a "label" for
a variable and it is thenceforth linked to the variable. In output from certain modeling and
graphics functions, Stata by default uses the label rather than the variable name.

Furthemore: In Stata, if "myvariable" is labeled numeric (in R lingo, a factor), and I type

codebook myvariable

then Stata tells me, among other things, the "levels" of myvariable.

Does a tool of this sort exist in R?

The prompt() function is related to this, but prompt(someDataFrame) creates a text file on disk. The
text file is associated with, but not unambiguously linked to, someDataFrame.

The epicalc function codebook() provides a summary of a dataframe similar to that created by
summary() but easier to read. But this is not a way to define and keep track of labels that are
linked to variables.

To link a dataframe to its codebook, one could do the following "by hand": Create a list, say,
"somedata", where somedata$DATA is a dataframe that contains the data, and somedata$VARIABLE is also
a dataframe, but serves as the codebook. For instance, the following function creates a template
into which one could subsequently edit to insert variable labels and turn into somedata$VARIABLE.

fnJunk <-function( THESEDATA ) {
#  From a dataframe, make the start of a codebook.
    if(!is.data.frame(THESEDATA)) stop("!is.data.frame(THESEDATA)")
    data.frame(
       Variable=names(THESEDATA)
       , class=sapply(THESEDATA, class)
       , type=sapply(THESEDATA, typeof)
       , label=""
       , comment=""
       )
}


But the following automatic behavior would be nice:

 	•       We should be able to treat somedata exactly as we treat a dataframe, so that the
 	fact that it possesses a "codebook" is merely an added benefit, not an interference with the
 	usual tasks.

 	•       If we delete a column of somedata$DATA, the associated row of somedata$VARIABLE
 	should be automatically deleted.

 	•       If we add a column to somedata$DATA, the associated column should be inserted in
 	somedata$VARIABLE, and some of the fields automatically populated such as variable name and
 	type.  It could get fancier. For instance:

 	•       If we try to add a value to a field in somedata$DATA which is not permitted by the
 	"permissible values" listed for this field in somedata$VARIABLE, we get an error.

Has anyone already thought this through, maybe defined a class and associated methods?

Thanks

Jacob A. Wegelin
Assistant Professor
Department of Biostatistics
Virginia Commonwealth University
730 East Broad Street Room 3006
P. O. Box 980032
Richmond VA 23298-0032
U.S.A. 
E-mail: jwegelin at vcu.edu 
URL: http://www.people.vcu.edu/~jwegelin


More information about the R-help mailing list