[R] Keep value lables with data frame manipulation

Sat Jul 15 03:39:42 CEST 2006

Heinz Tuechler wrote:
> At 11:02 13.07.2006 -0500, Frank E Harrell Jr wrote:
>> Heinz Tuechler wrote:
>>> At 08:11 13.07.2006 -0500, Frank E Harrell Jr wrote:
>>>> Heinz Tuechler wrote:
>>>>> At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:
>>>>>> On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:
>>>>>>> Dear R,
>>>>>>>
>>>>>>> I import data from spss into a R data.frame. On this rawdata I do some
>>>>>>> data processing (selection of observations, normalization, recoding of
>>>>>>> variables etc..). The result is stored in a new data.frame, however, in
>>>>>>> this new data.frame the value labels are lost.
>>>>>>>
>>>>>>> Example of what I do in code:
>>>>>>>
>>>>>>> # read raw data from spss
>>>>>>> rawdata <- read.spss("./data/T50937.SAV",
>>>>>>> 	use.value.labels=FALSE,to.data.frame=TRUE)
>>>>>>>
>>>>>>> # select the observations that we need
>>>>>>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 |
> rawdata$D22==17 |
>>>>>>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>>>>>>>  			rawdata$D22==24 | rawdata$D22==33,]
>>>>>>>
>>>>>>> The result is that rawdata$D22 has value labels and that diarydata$D22
>>>>>>> is numeric without value labels.
>>>>>>>
>>>>>>> Question: How can I prevent this from happening?
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>> Groeten,
>>>>>>> Arne
>>>>>> Two things:
>>>>>>
>>>>>> 1. With respect to your subsetting, your lengthy code can be replaced
>>>>>> with the following:
>>>>>>
>>>>>>  diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, 33))
>>>>>>
>>>>>> See ?subset and ?"%in%" for more information.
>>>>>>
>>>>>>
>>>>>> 2. With respect to keeping the label related attributes, the
>>>>>> 'value.labels' attribute and the 'variable.labels' attribute will not by
>>>>>> default survive the use of "[".data.frame in R (see ?Extract
>>>>>> and ?"[.data.frame").
>>>>>>
>>>>>> On the other hand, based upon my review of ?read.spss, the SPSS value
>>>>>> labels should be converted to the factor levels of the respective
>>>>>> columns when 'use.value.labels = TRUE' and these would survive a
>>>>>> subsetting.
>>>>>>
>>>>>> If you want to consider a solution to the attribute subsetting issue,
>>>>>> you might want to review the following post by Gabor Grothendieck in
>>>>>> May, which provides a possible solution:
>>>>>>
>>>>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html
>>>>>>
>>>>>> and this post by me, for an explanation of what is happening in Gabor's
>>>>>> solution:
>>>>>>
>>>>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html
>>>>>>
>>>>>> HTH,
>>>>>>
>>>>>> Marc Schwartz
>>>>>>
>>>>> Hello Mark and Arne,
>>>>>
>>>>> I worked on the suggestions of Gabor and Mark and programmed some
> functions
>>>>> in this way, but they are very, very preliminary (see below).
>>>>> In my view there is a lack of convenient possibilities in R to document
>>>>> empirical data by variable labels, value labels, etc. I would prefer to
>>>>> have these possibilities in the "standard" configuration.
>>>>> So I sketched a concept, but in my view it would only be useful, if there
>>>>> was some acceptance by the core developers of R.
>>>>>
>>>>> The concept would be to define a class. For now I call it "source.data".
>>>>> To design it more flexible than the Hmisc class "labelled" I would
> define a
>>>>> related option "source.data.attributes" with default c('value.labels',
>>>>> 'variable.name', 'label')). This option contains all attributes that
> should
>>>>> persist in subsetting/indexing.
>>>>>
>>>>> I made only some very, very preliminary tests with these functions,
> mainly
>>>>> because I am not happy with defining a new class. Instead I would prefer,
>>>>> if this functionality could be integrated in the Hmisc class "labelled",
>>>>> since this is in my view the best known starting point for data
>>>>> documentation in R.
>>>>>
>>>>> I would be happy, if there were some discussion about the wishes/needs of
>>>>> other Rusers concerning data documentation.
>>>>>
>>>>> Greetings,
>>>>>
>>>>> Heinz
>>>> I feel that separating variable labels and value labels and just using 
>>>> factors for value labels works fine, and I would urge you not to create 
>>>> a new system that will not benefit from the many Hmisc functions that 
>>>> use variable labels and units.  [.data.frame in Hmisc keeps all
> attributes.
>>>> Frank
>>>>
>>> Frank,
>>>
>>> of course I aggree with you about the importance of Hmisc and as I said, I
>>> do not want to define a new class, but in my view factors are no good
>>> substitute for value labels.
>>> As the language definition (version 2.3.1 (2006-06-05) Draft, page 7) says:
>>> "Factors are currently implemented using an integer array to specify the
>>> actual levels and a second array of names that are mapped to the integers.
>>> Rather unfortunately users often make use of the implementation in order to
>>> make some calculations easier." 
>>> So, in my view, the levels represent the "values" of the factor.
>>> This has inconveniencies if you want to use value labels in different
>>> languages. Further I do not see a simple method to label numerical
>>> variables. I often encounter discrete, but still metric data, as e.g. risk
>>> scores. Usually it would be nice to use them in their original coding,
>>> which may include zero or decimal places and to label them at the same
> time.
>>> Personally at the moment I try to solve this problem by following a
>>> suggestion of Martin, Dimitis and others to use names instead. I doubt,
>>> however, that this is a good solution, but at least it makes it possible to
>>> have the source data numerically coded and in this sense "language free"
>>> (see first attempts of functions below).
>>>
>>> Heinz
>>>
>> Those are excellent points Heinz.  I addressed that problem partially in 
>> sas.get - see the sascodes attribute.
>>
>> Frank
>>
> 
> Frank, I looked at your function sas.get. You solved the problem with a lot
> of effort. Don't you think that it would be easier to create just one new
> class, say "documented", which offers the possibility to represent the
> original data as it is and to add all the useful descriptions like variable
> labels, value labels, units, special missing values, and may be others.
> If I remember correctly SAS, SPSS and BMDP offer these possibilities since
> many years, and in my view for good reason. I am thinking about this
> questions since I started using R about two years ago and I wonder, why
> there seems to be so little interest in these questions.
> In my work good documentation of the _unchanged_ data is very important,
> also because it eases checking the data for errors.
> 
> Heinz
> 
> 
>>> ...snip...
> 
> 
> 

Heinz - the code is quite small and simple, not much effort.  And 
variable labels need to be attributes to individual variables, otherwise 
   plotting, latex, and other functions can't get access to them (e.g., 
in Hmisc xYplot(y ~ x) labels for x and y, and units of measurement, get 
plotted on axes.  I've been having all the SAS, SPSS, and BMDP 
capabilities you've mentioned in R/S-Plus (plus units attributes not 
available in those) for years.

What would make all this even easier is for R to be told a list of 
attribute names that would always carry with subsetting, so that 
specially subsetting methods such as [.labeled would not be necessary.

Frank

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University