[R] Keep value lables with data frame manipulation

Thu Jul 13 10:59:39 CEST 2006

At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:
>On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:
>> Dear R,
>> 
>> I import data from spss into a R data.frame. On this rawdata I do some
>> data processing (selection of observations, normalization, recoding of
>> variables etc..). The result is stored in a new data.frame, however, in
>> this new data.frame the value labels are lost.
>> 
>> Example of what I do in code:
>> 
>> # read raw data from spss
>> rawdata <- read.spss("./data/T50937.SAV",
>> 	use.value.labels=FALSE,to.data.frame=TRUE)
>> 
>> # select the observations that we need
>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 | rawdata$D22==17 |
>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>>  			rawdata$D22==24 | rawdata$D22==33,]
>> 
>> The result is that rawdata$D22 has value labels and that diarydata$D22
>> is numeric without value labels.
>> 
>> Question: How can I prevent this from happening?
>> 
>> Thanks in advance!
>> Groeten,
>> Arne
>
>Two things:
>
>1. With respect to your subsetting, your lengthy code can be replaced
>with the following:
>
>  diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, 33))
>
>See ?subset and ?"%in%" for more information.
>
>
>2. With respect to keeping the label related attributes, the
>'value.labels' attribute and the 'variable.labels' attribute will not by
>default survive the use of "[".data.frame in R (see ?Extract
>and ?"[.data.frame").
>
>On the other hand, based upon my review of ?read.spss, the SPSS value
>labels should be converted to the factor levels of the respective
>columns when 'use.value.labels = TRUE' and these would survive a
>subsetting.
>
>If you want to consider a solution to the attribute subsetting issue,
>you might want to review the following post by Gabor Grothendieck in
>May, which provides a possible solution:
>
>  https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html
>
>and this post by me, for an explanation of what is happening in Gabor's
>solution:
>
>  https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html
>
>HTH,
>
>Marc Schwartz
>
Hello Mark and Arne,

I worked on the suggestions of Gabor and Mark and programmed some functions
in this way, but they are very, very preliminary (see below).
In my view there is a lack of convenient possibilities in R to document
empirical data by variable labels, value labels, etc. I would prefer to
have these possibilities in the "standard" configuration.
So I sketched a concept, but in my view it would only be useful, if there
was some acceptance by the core developers of R.

The concept would be to define a class. For now I call it "source.data".
To design it more flexible than the Hmisc class "labelled" I would define a
related option "source.data.attributes" with default c('value.labels',
'variable.name', 'label')). This option contains all attributes that should
persist in subsetting/indexing.

I made only some very, very preliminary tests with these functions, mainly
because I am not happy with defining a new class. Instead I would prefer,
if this functionality could be integrated in the Hmisc class "labelled",
since this is in my view the best known starting point for data
documentation in R.

I would be happy, if there were some discussion about the wishes/needs of
other Rusers concerning data documentation.

Greetings,

Heinz

### intention and concept
#   There should be a convenient possibility to keep source data numerical
#   coded and at the same time have labelled categories.
#   Such labelled categorical numerical data should be easily converted
#   to factors.
#   Indexing/subsetting should preserve the concerned attributes of this data.

### description of (intended!!!) functionality
#   - a class source.data is defined. It is intended only for atomic objects.
#   - option source.data.attributes defines which attributes will be copied
#     in indexing/subsetting objects of class source.data
#   - option source.data.is.ordered sets defining factors as ordered, when
#     built from objects of class source.data by the function factsd
#   - function 'value.labels<-' assigns an attribute value.labels and sets
#     class source.data
#   - function value.labels reads the attribute value.labels
#   - the indexing method '[.source.data' defines indexing for source.data
#   - the print method print.source.data ignores source.data.attributes in
#     printing
#   - the as.data.frame method as.data.frame.source.data enables inclusion
#     of objects of class source.data in data.frames
#   - function factsd should in general behave as function factor but should
#     in case of an object of class source.data by default use the
value.labels
#     as levels and the names(value.labels) as the labels of the new built
#     factor.
#     If the parameter ordered is NULL it should create ordered factors
#     according to the option source.data.is.ordered.

### set option for source.data.attributes
options(source.data.attributes=c('value.labels', 'variable.name', 'label'))
### set option for converting source.data class in ordered factors
options(source.data.is.ordered=TRUE)

### function to assign value.labels
'value.labels<-' <- function (x, value)
  ## adapted from Hmisc function label 30.6.2006
{
  if(!is.atomic(x)) stop('value.labels<- is applicabel to atomic objects
only')
  structure(x, value.labels = value, class = c("source.data",
               attr(x, "class")[attr(x, "class") != "source.data"]))
}

### function to read value.labels
value.labels <- function (x) { attr(x, 'value.labels') }

### definition of indexing method for class=source.data
##  source.data.attributes shall be conserved
"[.source.data" <- function(x, ...)
{
  atr <- attributes(x)
  atr.names <- names(atr)
  sda <- options()$'source.data.attributes'
  sda.match <- match(atr.names, sda)
  sda.match <- sda.match[!is.na(sda.match)]
  x <- NextMethod("[")
  ## assign source.data.attributes to result
  if(length(sda.match))
    for (i in sda.match) attr(x, sda[i]) <- atr[[sda[i]]]
  ## assign class source.data to result
  class(x) <- c('source.data', attr(x, "class")[attr(x, "class")
                                                != "source.data"])       
  x
}

### print method for source.data
'print.source.data' <- function (x, ...) 
{
  ## adapted from Hmisc print.labelled 31.5.2006
  x.orig <- x
  ## look if there are source.data.attributes
  sda <- options()$'source.data.attributes'
  sda.match <- match(names(attributes(x)), sda)
  sda.match <- sda.match[!is.na(sda.match)]
  ## delete source.data.attributes for printing
  if(length(sda.match))
    for (i in sda.match) attr(x, sda[i]) <- NULL
  ## delete class source.data for printing
  class(x) <- if (length(class(x)) == 1 && class(x) == "source.data") 
    NULL
  else class(x)[class(x) != "source.data"]
  NextMethod("print")
  invisible(x.orig)
}

### Define function as.data.frame.source.data (copy from as.data.frame.vector)
#   many as.data.frame methods are identical to this
##  different functions as.data.frame are besides others:
#   as.data.frame.list, as.data.frame.default, as.data.frame.data.frame,
#   as.data.frame.character, as.data.frame.AsIs, as.data.frame.array,

as.data.frame.source.data <- 
  function (x, row.names = NULL, optional = FALSE)
  ## copy from as.data.frame.vector 1.6.2006
{
  nrows <- length(x)
  nm <- paste(deparse(substitute(x), width.cutoff = 500), collapse = " ")
  if (is.null(row.names)) {
    if (nrows == 0) 
      row.names <- character(0)
    else if (length(row.names <- names(x)) == nrows &&
!any(duplicated(row.names))) {
    }
    else if (optional) 
      row.names <- character(nrows)
    else row.names <- as.character(1:nrows)
  }
  names(x) <- NULL
  value <- list(x)
  if (!optional) 
    names(value) <- nm
  attr(value, "row.names") <- row.names
  class(value) <- "data.frame"
  value
}

### function to create factor from source.data class applying variable.labels
#   and copying all source.data.attributes
#   remark: factor(factsd(x)) drops unused factor levels and source.data class
#           factsd(x)[, drop=TRUE] drops unused factor levels but keeps
#           source.data class and attributes

factsd <- function(x = character(),
                   levels = sort(unique.default(x), na.last = TRUE),
                   labels = levels, exclude = NA, ordered = NULL)
{
  ## check if is of class source.data
  if ('source.data' %in% class(x))
    {
      if(is.null(ordered)) ordered <- options()$source.data.is.ordered
      fx <- factor(x = x, levels = value.labels(x),
                   labels = names(value.labels(x)),
                   exclude = exclude,
                   ordered = ordered)
      ## copy source.data.attributes
      atr <- attributes(x)
      atr.names <- names(atr)
      sda <- options()$'source.data.attributes'
      sda.match <- match(atr.names, sda)
      sda.match <- sda.match[!is.na(sda.match)]
      ## assign source.data.attributes to result
      if(length(sda.match))
        for (i in sda.match) attr(fx, sda[i]) <- atr[[sda[i]]]
      ## add class source.data to result
      class(fx) <- c('source.data', attr(fx, 'class'))       
    }
  else {
    if(is.null(ordered)) ordered <- is.ordered(x)
    fx <- factor(x = x, levels = levels, labels = labels,
                 exclude = exclude, ordered = ordered)
  }
  fx
}