[R] Keep value lables with data frame manipulation

Thu Jul 13 15:11:52 CEST 2006

Heinz Tuechler wrote:
> At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:
>> On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:
>>> Dear R,
>>>
>>> I import data from spss into a R data.frame. On this rawdata I do some
>>> data processing (selection of observations, normalization, recoding of
>>> variables etc..). The result is stored in a new data.frame, however, in
>>> this new data.frame the value labels are lost.
>>>
>>> Example of what I do in code:
>>>
>>> # read raw data from spss
>>> rawdata <- read.spss("./data/T50937.SAV",
>>> 	use.value.labels=FALSE,to.data.frame=TRUE)
>>>
>>> # select the observations that we need
>>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 | rawdata$D22==17 |
>>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>>>  			rawdata$D22==24 | rawdata$D22==33,]
>>>
>>> The result is that rawdata$D22 has value labels and that diarydata$D22
>>> is numeric without value labels.
>>>
>>> Question: How can I prevent this from happening?
>>>
>>> Thanks in advance!
>>> Groeten,
>>> Arne
>> Two things:
>>
>> 1. With respect to your subsetting, your lengthy code can be replaced
>> with the following:
>>
>>  diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, 33))
>>
>> See ?subset and ?"%in%" for more information.
>>
>>
>> 2. With respect to keeping the label related attributes, the
>> 'value.labels' attribute and the 'variable.labels' attribute will not by
>> default survive the use of "[".data.frame in R (see ?Extract
>> and ?"[.data.frame").
>>
>> On the other hand, based upon my review of ?read.spss, the SPSS value
>> labels should be converted to the factor levels of the respective
>> columns when 'use.value.labels = TRUE' and these would survive a
>> subsetting.
>>
>> If you want to consider a solution to the attribute subsetting issue,
>> you might want to review the following post by Gabor Grothendieck in
>> May, which provides a possible solution:
>>
>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html
>>
>> and this post by me, for an explanation of what is happening in Gabor's
>> solution:
>>
>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html
>>
>> HTH,
>>
>> Marc Schwartz
>>
> Hello Mark and Arne,
> 
> I worked on the suggestions of Gabor and Mark and programmed some functions
> in this way, but they are very, very preliminary (see below).
> In my view there is a lack of convenient possibilities in R to document
> empirical data by variable labels, value labels, etc. I would prefer to
> have these possibilities in the "standard" configuration.
> So I sketched a concept, but in my view it would only be useful, if there
> was some acceptance by the core developers of R.
> 
> The concept would be to define a class. For now I call it "source.data".
> To design it more flexible than the Hmisc class "labelled" I would define a
> related option "source.data.attributes" with default c('value.labels',
> 'variable.name', 'label')). This option contains all attributes that should
> persist in subsetting/indexing.
> 
> I made only some very, very preliminary tests with these functions, mainly
> because I am not happy with defining a new class. Instead I would prefer,
> if this functionality could be integrated in the Hmisc class "labelled",
> since this is in my view the best known starting point for data
> documentation in R.
> 
> I would be happy, if there were some discussion about the wishes/needs of
> other Rusers concerning data documentation.
> 
> Greetings,
> 
> Heinz

I feel that separating variable labels and value labels and just using 
factors for value labels works fine, and I would urge you not to create 
a new system that will not benefit from the many Hmisc functions that 
use variable labels and units.  [.data.frame in Hmisc keeps all attributes.

Frank

> 
> 
> ### intention and concept
> #   There should be a convenient possibility to keep source data numerical
> #   coded and at the same time have labelled categories.
> #   Such labelled categorical numerical data should be easily converted
> #   to factors.
> #   Indexing/subsetting should preserve the concerned attributes of this data.
> 
> ### description of (intended!!!) functionality
> #   - a class source.data is defined. It is intended only for atomic objects.
> #   - option source.data.attributes defines which attributes will be copied
> #     in indexing/subsetting objects of class source.data
> #   - option source.data.is.ordered sets defining factors as ordered, when
> #     built from objects of class source.data by the function factsd
> #   - function 'value.labels<-' assigns an attribute value.labels and sets
> #     class source.data
> #   - function value.labels reads the attribute value.labels
> #   - the indexing method '[.source.data' defines indexing for source.data
> #   - the print method print.source.data ignores source.data.attributes in
> #     printing
> #   - the as.data.frame method as.data.frame.source.data enables inclusion
> #     of objects of class source.data in data.frames
> #   - function factsd should in general behave as function factor but should
> #     in case of an object of class source.data by default use the
> value.labels
> #     as levels and the names(value.labels) as the labels of the new built
> #     factor.
> #     If the parameter ordered is NULL it should create ordered factors
> #     according to the option source.data.is.ordered.
> 
> ### set option for source.data.attributes
> options(source.data.attributes=c('value.labels', 'variable.name', 'label'))
> ### set option for converting source.data class in ordered factors
> options(source.data.is.ordered=TRUE)
> 
> ### function to assign value.labels
> 'value.labels<-' <- function (x, value)
>   ## adapted from Hmisc function label 30.6.2006
> {
>   if(!is.atomic(x)) stop('value.labels<- is applicabel to atomic objects
> only')
>   structure(x, value.labels = value, class = c("source.data",
>                attr(x, "class")[attr(x, "class") != "source.data"]))
> }
> 
> ### function to read value.labels
> value.labels <- function (x) { attr(x, 'value.labels') }
> 
> ### definition of indexing method for class=source.data
> ##  source.data.attributes shall be conserved
> "[.source.data" <- function(x, ...)
> {
>   atr <- attributes(x)
>   atr.names <- names(atr)
>   sda <- options()$'source.data.attributes'
>   sda.match <- match(atr.names, sda)
>   sda.match <- sda.match[!is.na(sda.match)]
>   x <- NextMethod("[")
>   ## assign source.data.attributes to result
>   if(length(sda.match))
>     for (i in sda.match) attr(x, sda[i]) <- atr[[sda[i]]]
>   ## assign class source.data to result
>   class(x) <- c('source.data', attr(x, "class")[attr(x, "class")
>                                                 != "source.data"])       
>   x
> }
> 
> ### print method for source.data
> 'print.source.data' <- function (x, ...) 
> {
>   ## adapted from Hmisc print.labelled 31.5.2006
>   x.orig <- x
>   ## look if there are source.data.attributes
>   sda <- options()$'source.data.attributes'
>   sda.match <- match(names(attributes(x)), sda)
>   sda.match <- sda.match[!is.na(sda.match)]
>   ## delete source.data.attributes for printing
>   if(length(sda.match))
>     for (i in sda.match) attr(x, sda[i]) <- NULL
>   ## delete class source.data for printing
>   class(x) <- if (length(class(x)) == 1 && class(x) == "source.data") 
>     NULL
>   else class(x)[class(x) != "source.data"]
>   NextMethod("print")
>   invisible(x.orig)
> }
> 
> ### Define function as.data.frame.source.data (copy from as.data.frame.vector)
> #   many as.data.frame methods are identical to this
> ##  different functions as.data.frame are besides others:
> #   as.data.frame.list, as.data.frame.default, as.data.frame.data.frame,
> #   as.data.frame.character, as.data.frame.AsIs, as.data.frame.array,
> 
> as.data.frame.source.data <- 
>   function (x, row.names = NULL, optional = FALSE)
>   ## copy from as.data.frame.vector 1.6.2006
> {
>   nrows <- length(x)
>   nm <- paste(deparse(substitute(x), width.cutoff = 500), collapse = " ")
>   if (is.null(row.names)) {
>     if (nrows == 0) 
>       row.names <- character(0)
>     else if (length(row.names <- names(x)) == nrows &&
> !any(duplicated(row.names))) {
>     }
>     else if (optional) 
>       row.names <- character(nrows)
>     else row.names <- as.character(1:nrows)
>   }
>   names(x) <- NULL
>   value <- list(x)
>   if (!optional) 
>     names(value) <- nm
>   attr(value, "row.names") <- row.names
>   class(value) <- "data.frame"
>   value
> }
> 
> ### function to create factor from source.data class applying variable.labels
> #   and copying all source.data.attributes
> #   remark: factor(factsd(x)) drops unused factor levels and source.data class
> #           factsd(x)[, drop=TRUE] drops unused factor levels but keeps
> #           source.data class and attributes
> 
> factsd <- function(x = character(),
>                    levels = sort(unique.default(x), na.last = TRUE),
>                    labels = levels, exclude = NA, ordered = NULL)
> {
>   ## check if is of class source.data
>   if ('source.data' %in% class(x))
>     {
>       if(is.null(ordered)) ordered <- options()$source.data.is.ordered
>       fx <- factor(x = x, levels = value.labels(x),
>                    labels = names(value.labels(x)),
>                    exclude = exclude,
>                    ordered = ordered)
>       ## copy source.data.attributes
>       atr <- attributes(x)
>       atr.names <- names(atr)
>       sda <- options()$'source.data.attributes'
>       sda.match <- match(atr.names, sda)
>       sda.match <- sda.match[!is.na(sda.match)]
>       ## assign source.data.attributes to result
>       if(length(sda.match))
>         for (i in sda.match) attr(fx, sda[i]) <- atr[[sda[i]]]
>       ## add class source.data to result
>       class(fx) <- c('source.data', attr(fx, 'class'))       
>     }
>   else {
>     if(is.null(ordered)) ordered <- is.ordered(x)
>     fx <- factor(x = x, levels = levels, labels = labels,
>                  exclude = exclude, ordered = ordered)
>   }
>   fx
> }
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> 

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University