[R] Keep value lables with data frame manipulation
Heinz Tuechler
tuechler at gmx.at
Thu Jul 13 10:59:39 CEST 2006
At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:
>On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:
>> Dear R,
>>
>> I import data from spss into a R data.frame. On this rawdata I do some
>> data processing (selection of observations, normalization, recoding of
>> variables etc..). The result is stored in a new data.frame, however, in
>> this new data.frame the value labels are lost.
>>
>> Example of what I do in code:
>>
>> # read raw data from spss
>> rawdata <- read.spss("./data/T50937.SAV",
>> use.value.labels=FALSE,to.data.frame=TRUE)
>>
>> # select the observations that we need
>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 | rawdata$D22==17 |
>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>> rawdata$D22==24 | rawdata$D22==33,]
>>
>> The result is that rawdata$D22 has value labels and that diarydata$D22
>> is numeric without value labels.
>>
>> Question: How can I prevent this from happening?
>>
>> Thanks in advance!
>> Groeten,
>> Arne
>
>Two things:
>
>1. With respect to your subsetting, your lengthy code can be replaced
>with the following:
>
> diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, 33))
>
>See ?subset and ?"%in%" for more information.
>
>
>2. With respect to keeping the label related attributes, the
>'value.labels' attribute and the 'variable.labels' attribute will not by
>default survive the use of "[".data.frame in R (see ?Extract
>and ?"[.data.frame").
>
>On the other hand, based upon my review of ?read.spss, the SPSS value
>labels should be converted to the factor levels of the respective
>columns when 'use.value.labels = TRUE' and these would survive a
>subsetting.
>
>If you want to consider a solution to the attribute subsetting issue,
>you might want to review the following post by Gabor Grothendieck in
>May, which provides a possible solution:
>
> https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html
>
>and this post by me, for an explanation of what is happening in Gabor's
>solution:
>
> https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html
>
>HTH,
>
>Marc Schwartz
>
Hello Mark and Arne,
I worked on the suggestions of Gabor and Mark and programmed some functions
in this way, but they are very, very preliminary (see below).
In my view there is a lack of convenient possibilities in R to document
empirical data by variable labels, value labels, etc. I would prefer to
have these possibilities in the "standard" configuration.
So I sketched a concept, but in my view it would only be useful, if there
was some acceptance by the core developers of R.
The concept would be to define a class. For now I call it "source.data".
To design it more flexible than the Hmisc class "labelled" I would define a
related option "source.data.attributes" with default c('value.labels',
'variable.name', 'label')). This option contains all attributes that should
persist in subsetting/indexing.
I made only some very, very preliminary tests with these functions, mainly
because I am not happy with defining a new class. Instead I would prefer,
if this functionality could be integrated in the Hmisc class "labelled",
since this is in my view the best known starting point for data
documentation in R.
I would be happy, if there were some discussion about the wishes/needs of
other Rusers concerning data documentation.
Greetings,
Heinz
### intention and concept
# There should be a convenient possibility to keep source data numerical
# coded and at the same time have labelled categories.
# Such labelled categorical numerical data should be easily converted
# to factors.
# Indexing/subsetting should preserve the concerned attributes of this data.
### description of (intended!!!) functionality
# - a class source.data is defined. It is intended only for atomic objects.
# - option source.data.attributes defines which attributes will be copied
# in indexing/subsetting objects of class source.data
# - option source.data.is.ordered sets defining factors as ordered, when
# built from objects of class source.data by the function factsd
# - function 'value.labels<-' assigns an attribute value.labels and sets
# class source.data
# - function value.labels reads the attribute value.labels
# - the indexing method '[.source.data' defines indexing for source.data
# - the print method print.source.data ignores source.data.attributes in
# printing
# - the as.data.frame method as.data.frame.source.data enables inclusion
# of objects of class source.data in data.frames
# - function factsd should in general behave as function factor but should
# in case of an object of class source.data by default use the
value.labels
# as levels and the names(value.labels) as the labels of the new built
# factor.
# If the parameter ordered is NULL it should create ordered factors
# according to the option source.data.is.ordered.
### set option for source.data.attributes
options(source.data.attributes=c('value.labels', 'variable.name', 'label'))
### set option for converting source.data class in ordered factors
options(source.data.is.ordered=TRUE)
### function to assign value.labels
'value.labels<-' <- function (x, value)
## adapted from Hmisc function label 30.6.2006
{
if(!is.atomic(x)) stop('value.labels<- is applicabel to atomic objects
only')
structure(x, value.labels = value, class = c("source.data",
attr(x, "class")[attr(x, "class") != "source.data"]))
}
### function to read value.labels
value.labels <- function (x) { attr(x, 'value.labels') }
### definition of indexing method for class=source.data
## source.data.attributes shall be conserved
"[.source.data" <- function(x, ...)
{
atr <- attributes(x)
atr.names <- names(atr)
sda <- options()$'source.data.attributes'
sda.match <- match(atr.names, sda)
sda.match <- sda.match[!is.na(sda.match)]
x <- NextMethod("[")
## assign source.data.attributes to result
if(length(sda.match))
for (i in sda.match) attr(x, sda[i]) <- atr[[sda[i]]]
## assign class source.data to result
class(x) <- c('source.data', attr(x, "class")[attr(x, "class")
!= "source.data"])
x
}
### print method for source.data
'print.source.data' <- function (x, ...)
{
## adapted from Hmisc print.labelled 31.5.2006
x.orig <- x
## look if there are source.data.attributes
sda <- options()$'source.data.attributes'
sda.match <- match(names(attributes(x)), sda)
sda.match <- sda.match[!is.na(sda.match)]
## delete source.data.attributes for printing
if(length(sda.match))
for (i in sda.match) attr(x, sda[i]) <- NULL
## delete class source.data for printing
class(x) <- if (length(class(x)) == 1 && class(x) == "source.data")
NULL
else class(x)[class(x) != "source.data"]
NextMethod("print")
invisible(x.orig)
}
### Define function as.data.frame.source.data (copy from as.data.frame.vector)
# many as.data.frame methods are identical to this
## different functions as.data.frame are besides others:
# as.data.frame.list, as.data.frame.default, as.data.frame.data.frame,
# as.data.frame.character, as.data.frame.AsIs, as.data.frame.array,
as.data.frame.source.data <-
function (x, row.names = NULL, optional = FALSE)
## copy from as.data.frame.vector 1.6.2006
{
nrows <- length(x)
nm <- paste(deparse(substitute(x), width.cutoff = 500), collapse = " ")
if (is.null(row.names)) {
if (nrows == 0)
row.names <- character(0)
else if (length(row.names <- names(x)) == nrows &&
!any(duplicated(row.names))) {
}
else if (optional)
row.names <- character(nrows)
else row.names <- as.character(1:nrows)
}
names(x) <- NULL
value <- list(x)
if (!optional)
names(value) <- nm
attr(value, "row.names") <- row.names
class(value) <- "data.frame"
value
}
### function to create factor from source.data class applying variable.labels
# and copying all source.data.attributes
# remark: factor(factsd(x)) drops unused factor levels and source.data class
# factsd(x)[, drop=TRUE] drops unused factor levels but keeps
# source.data class and attributes
factsd <- function(x = character(),
levels = sort(unique.default(x), na.last = TRUE),
labels = levels, exclude = NA, ordered = NULL)
{
## check if is of class source.data
if ('source.data' %in% class(x))
{
if(is.null(ordered)) ordered <- options()$source.data.is.ordered
fx <- factor(x = x, levels = value.labels(x),
labels = names(value.labels(x)),
exclude = exclude,
ordered = ordered)
## copy source.data.attributes
atr <- attributes(x)
atr.names <- names(atr)
sda <- options()$'source.data.attributes'
sda.match <- match(atr.names, sda)
sda.match <- sda.match[!is.na(sda.match)]
## assign source.data.attributes to result
if(length(sda.match))
for (i in sda.match) attr(fx, sda[i]) <- atr[[sda[i]]]
## add class source.data to result
class(fx) <- c('source.data', attr(fx, 'class'))
}
else {
if(is.null(ordered)) ordered <- is.ordered(x)
fx <- factor(x = x, levels = levels, labels = labels,
exclude = exclude, ordered = ordered)
}
fx
}
More information about the R-help
mailing list