[R] Keep value lables with data frame manipulation

Heinz Tuechler tuechler at gmx.at
Tue Jul 18 00:41:07 CEST 2006

At 20:39 14.07.2006 -0500, Frank E Harrell Jr wrote:
>Heinz Tuechler wrote:
>> At 11:02 13.07.2006 -0500, Frank E Harrell Jr wrote:
>>> Heinz Tuechler wrote:
>>>> At 08:11 13.07.2006 -0500, Frank E Harrell Jr wrote:
>>>>> Heinz Tuechler wrote:
>>>>>> At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:
>>>>>>> On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:
>>>>>>>> Dear R,
>>>>>>>> I import data from spss into a R data.frame. On this rawdata I do
>>>>>>>> data processing (selection of observations, normalization,
recoding of
>>>>>>>> variables etc..). The result is stored in a new data.frame,
however, in
>>>>>>>> this new data.frame the value labels are lost.
>>>>>>>> Example of what I do in code:
>>>>>>>> # read raw data from spss
>>>>>>>> rawdata <- read.spss("./data/T50937.SAV",
>>>>>>>> 	use.value.labels=FALSE,to.data.frame=TRUE)
>>>>>>>> # select the observations that we need
>>>>>>>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 |
>> rawdata$D22==17 |
>>>>>>>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>>>>>>>>  			rawdata$D22==24 | rawdata$D22==33,]
>>>>>>>> The result is that rawdata$D22 has value labels and that
>>>>>>>> is numeric without value labels.
>>>>>>>> Question: How can I prevent this from happening?
>>>>>>>> Thanks in advance!
>>>>>>>> Groeten,
>>>>>>>> Arne
>>>>>>> Two things:
>>>>>>> 1. With respect to your subsetting, your lengthy code can be replaced
>>>>>>> with the following:
>>>>>>>  diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24,
>>>>>>> See ?subset and ?"%in%" for more information.
>>>>>>> 2. With respect to keeping the label related attributes, the
>>>>>>> 'value.labels' attribute and the 'variable.labels' attribute will
not by
>>>>>>> default survive the use of "[".data.frame in R (see ?Extract
>>>>>>> and ?"[.data.frame").
>>>>>>> On the other hand, based upon my review of ?read.spss, the SPSS value
>>>>>>> labels should be converted to the factor levels of the respective
>>>>>>> columns when 'use.value.labels = TRUE' and these would survive a
>>>>>>> subsetting.
>>>>>>> If you want to consider a solution to the attribute subsetting issue,
>>>>>>> you might want to review the following post by Gabor Grothendieck in
>>>>>>> May, which provides a possible solution:
>>>>>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html
>>>>>>> and this post by me, for an explanation of what is happening in
>>>>>>> solution:
>>>>>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html
>>>>>>> HTH,
>>>>>>> Marc Schwartz
>>>>>> Hello Mark and Arne,
>>>>>> I worked on the suggestions of Gabor and Mark and programmed some
>> functions
>>>>>> in this way, but they are very, very preliminary (see below).
>>>>>> In my view there is a lack of convenient possibilities in R to document
>>>>>> empirical data by variable labels, value labels, etc. I would prefer to
>>>>>> have these possibilities in the "standard" configuration.
>>>>>> So I sketched a concept, but in my view it would only be useful, if
>>>>>> was some acceptance by the core developers of R.
>>>>>> The concept would be to define a class. For now I call it
>>>>>> To design it more flexible than the Hmisc class "labelled" I would
>> define a
>>>>>> related option "source.data.attributes" with default c('value.labels',
>>>>>> 'variable.name', 'label')). This option contains all attributes that
>> should
>>>>>> persist in subsetting/indexing.
>>>>>> I made only some very, very preliminary tests with these functions,
>> mainly
>>>>>> because I am not happy with defining a new class. Instead I would
>>>>>> if this functionality could be integrated in the Hmisc class
>>>>>> since this is in my view the best known starting point for data
>>>>>> documentation in R.
>>>>>> I would be happy, if there were some discussion about the
wishes/needs of
>>>>>> other Rusers concerning data documentation.
>>>>>> Greetings,
>>>>>> Heinz
>>>>> I feel that separating variable labels and value labels and just using 
>>>>> factors for value labels works fine, and I would urge you not to create 
>>>>> a new system that will not benefit from the many Hmisc functions that 
>>>>> use variable labels and units.  [.data.frame in Hmisc keeps all
>> attributes.
>>>>> Frank
>>>> Frank,
>>>> of course I aggree with you about the importance of Hmisc and as I
said, I
>>>> do not want to define a new class, but in my view factors are no good
>>>> substitute for value labels.
>>>> As the language definition (version 2.3.1 (2006-06-05) Draft, page 7)
>>>> "Factors are currently implemented using an integer array to specify the
>>>> actual levels and a second array of names that are mapped to the
>>>> Rather unfortunately users often make use of the implementation in
order to
>>>> make some calculations easier." 
>>>> So, in my view, the levels represent the "values" of the factor.
>>>> This has inconveniencies if you want to use value labels in different
>>>> languages. Further I do not see a simple method to label numerical
>>>> variables. I often encounter discrete, but still metric data, as e.g.
>>>> scores. Usually it would be nice to use them in their original coding,
>>>> which may include zero or decimal places and to label them at the same
>> time.
>>>> Personally at the moment I try to solve this problem by following a
>>>> suggestion of Martin, Dimitis and others to use names instead. I doubt,
>>>> however, that this is a good solution, but at least it makes it
possible to
>>>> have the source data numerically coded and in this sense "language free"
>>>> (see first attempts of functions below).
>>>> Heinz
>>> Those are excellent points Heinz.  I addressed that problem partially in 
>>> sas.get - see the sascodes attribute.
>>> Frank
>> Frank, I looked at your function sas.get. You solved the problem with a lot
>> of effort. Don't you think that it would be easier to create just one new
>> class, say "documented", which offers the possibility to represent the
>> original data as it is and to add all the useful descriptions like variable
>> labels, value labels, units, special missing values, and may be others.
>> If I remember correctly SAS, SPSS and BMDP offer these possibilities since
>> many years, and in my view for good reason. I am thinking about this
>> questions since I started using R about two years ago and I wonder, why
>> there seems to be so little interest in these questions.
>> In my work good documentation of the _unchanged_ data is very important,
>> also because it eases checking the data for errors.
>> Heinz
>>>> ...snip...
>Heinz - the code is quite small and simple, not much effort.  And 
>variable labels need to be attributes to individual variables, otherwise 
>   plotting, latex, and other functions can't get access to them (e.g., 
>in Hmisc xYplot(y ~ x) labels for x and y, and units of measurement, get 
>plotted on axes.  I've been having all the SAS, SPSS, and BMDP 
>capabilities you've mentioned in R/S-Plus (plus units attributes not 
>available in those) for years.
>What would make all this even easier is for R to be told a list of 
>attribute names that would always carry with subsetting, so that 
>specially subsetting methods such as [.labeled would not be necessary.
>Frank E Harrell Jr   Professor and Chair           School of Medicine
>                      Department of Biostatistics   Vanderbilt University

Frank - maybe I did not understand you right, but it seems that you propose
exactly what I did initially. Yes, I aggree with you that it would ease the
situation, if there were a list of respected attributes. However, I suspect
that it could be a computational burden to copy these attributes in any
case. So I would suggest to define a class that typically would be assigned
to raw data and to define an option that sets all the attributes which
should be copied.
Would you think this issue could/should be discussed in r-devel?


More information about the R-help mailing list