[R] Keep value lables with data frame manipulation

Tue Jul 18 04:47:09 CEST 2006

Heinz Tuechler wrote:
> At 20:39 14.07.2006 -0500, Frank E Harrell Jr wrote:
>> Heinz Tuechler wrote:
>>> At 11:02 13.07.2006 -0500, Frank E Harrell Jr wrote:
>>>> Heinz Tuechler wrote:
>>>>> At 08:11 13.07.2006 -0500, Frank E Harrell Jr wrote:
>>>>>> Heinz Tuechler wrote:
>>>>>>> At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:
>>>>>>>> On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:
>>>>>>>>> Dear R,
>>>>>>>>>
>>>>>>>>> I import data from spss into a R data.frame. On this rawdata I do
> some
>>>>>>>>> data processing (selection of observations, normalization,
> recoding of
>>>>>>>>> variables etc..). The result is stored in a new data.frame,
> however, in
>>>>>>>>> this new data.frame the value labels are lost.
>>>>>>>>>
>>>>>>>>> Example of what I do in code:
>>>>>>>>>
>>>>>>>>> # read raw data from spss
>>>>>>>>> rawdata <- read.spss("./data/T50937.SAV",
>>>>>>>>> 	use.value.labels=FALSE,to.data.frame=TRUE)
>>>>>>>>>
>>>>>>>>> # select the observations that we need
>>>>>>>>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 |
>>> rawdata$D22==17 |
>>>>>>>>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>>>>>>>>>  			rawdata$D22==24 | rawdata$D22==33,]
>>>>>>>>>
>>>>>>>>> The result is that rawdata$D22 has value labels and that
> diarydata$D22
>>>>>>>>> is numeric without value labels.
>>>>>>>>>
>>>>>>>>> Question: How can I prevent this from happening?
>>>>>>>>>
>>>>>>>>> Thanks in advance!
>>>>>>>>> Groeten,
>>>>>>>>> Arne
>>>>>>>> Two things:
>>>>>>>>
>>>>>>>> 1. With respect to your subsetting, your lengthy code can be replaced
>>>>>>>> with the following:
>>>>>>>>
>>>>>>>>  diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24,
> 33))
>>>>>>>> See ?subset and ?"%in%" for more information.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2. With respect to keeping the label related attributes, the
>>>>>>>> 'value.labels' attribute and the 'variable.labels' attribute will
> not by
>>>>>>>> default survive the use of "[".data.frame in R (see ?Extract
>>>>>>>> and ?"[.data.frame").
>>>>>>>>
>>>>>>>> On the other hand, based upon my review of ?read.spss, the SPSS value
>>>>>>>> labels should be converted to the factor levels of the respective
>>>>>>>> columns when 'use.value.labels = TRUE' and these would survive a
>>>>>>>> subsetting.
>>>>>>>>
>>>>>>>> If you want to consider a solution to the attribute subsetting issue,
>>>>>>>> you might want to review the following post by Gabor Grothendieck in
>>>>>>>> May, which provides a possible solution:
>>>>>>>>
>>>>>>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html
>>>>>>>>
>>>>>>>> and this post by me, for an explanation of what is happening in
> Gabor's
>>>>>>>> solution:
>>>>>>>>
>>>>>>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html
>>>>>>>>
>>>>>>>> HTH,
>>>>>>>>
>>>>>>>> Marc Schwartz
>>>>>>>>
>>>>>>> Hello Mark and Arne,
>>>>>>>
>>>>>>> I worked on the suggestions of Gabor and Mark and programmed some
>>> functions
>>>>>>> in this way, but they are very, very preliminary (see below).
>>>>>>> In my view there is a lack of convenient possibilities in R to document
>>>>>>> empirical data by variable labels, value labels, etc. I would prefer to
>>>>>>> have these possibilities in the "standard" configuration.
>>>>>>> So I sketched a concept, but in my view it would only be useful, if
> there
>>>>>>> was some acceptance by the core developers of R.
>>>>>>>
>>>>>>> The concept would be to define a class. For now I call it
> "source.data".
>>>>>>> To design it more flexible than the Hmisc class "labelled" I would
>>> define a
>>>>>>> related option "source.data.attributes" with default c('value.labels',
>>>>>>> 'variable.name', 'label')). This option contains all attributes that
>>> should
>>>>>>> persist in subsetting/indexing.
>>>>>>>
>>>>>>> I made only some very, very preliminary tests with these functions,
>>> mainly
>>>>>>> because I am not happy with defining a new class. Instead I would
> prefer,
>>>>>>> if this functionality could be integrated in the Hmisc class
> "labelled",
>>>>>>> since this is in my view the best known starting point for data
>>>>>>> documentation in R.
>>>>>>>
>>>>>>> I would be happy, if there were some discussion about the
> wishes/needs of
>>>>>>> other Rusers concerning data documentation.
>>>>>>>
>>>>>>> Greetings,
>>>>>>>
>>>>>>> Heinz
>>>>>> I feel that separating variable labels and value labels and just using 
>>>>>> factors for value labels works fine, and I would urge you not to create 
>>>>>> a new system that will not benefit from the many Hmisc functions that 
>>>>>> use variable labels and units.  [.data.frame in Hmisc keeps all
>>> attributes.
>>>>>> Frank
>>>>>>
>>>>> Frank,
>>>>>
>>>>> of course I aggree with you about the importance of Hmisc and as I
> said, I
>>>>> do not want to define a new class, but in my view factors are no good
>>>>> substitute for value labels.
>>>>> As the language definition (version 2.3.1 (2006-06-05) Draft, page 7)
> says:
>>>>> "Factors are currently implemented using an integer array to specify the
>>>>> actual levels and a second array of names that are mapped to the
> integers.
>>>>> Rather unfortunately users often make use of the implementation in
> order to
>>>>> make some calculations easier." 
>>>>> So, in my view, the levels represent the "values" of the factor.
>>>>> This has inconveniencies if you want to use value labels in different
>>>>> languages. Further I do not see a simple method to label numerical
>>>>> variables. I often encounter discrete, but still metric data, as e.g.
> risk
>>>>> scores. Usually it would be nice to use them in their original coding,
>>>>> which may include zero or decimal places and to label them at the same
>>> time.
>>>>> Personally at the moment I try to solve this problem by following a
>>>>> suggestion of Martin, Dimitis and others to use names instead. I doubt,
>>>>> however, that this is a good solution, but at least it makes it
> possible to
>>>>> have the source data numerically coded and in this sense "language free"
>>>>> (see first attempts of functions below).
>>>>>
>>>>> Heinz
>>>>>
>>>> Those are excellent points Heinz.  I addressed that problem partially in 
>>>> sas.get - see the sascodes attribute.
>>>>
>>>> Frank
>>>>
>>> Frank, I looked at your function sas.get. You solved the problem with a lot
>>> of effort. Don't you think that it would be easier to create just one new
>>> class, say "documented", which offers the possibility to represent the
>>> original data as it is and to add all the useful descriptions like variable
>>> labels, value labels, units, special missing values, and may be others.
>>> If I remember correctly SAS, SPSS and BMDP offer these possibilities since
>>> many years, and in my view for good reason. I am thinking about this
>>> questions since I started using R about two years ago and I wonder, why
>>> there seems to be so little interest in these questions.
>>> In my work good documentation of the _unchanged_ data is very important,
>>> also because it eases checking the data for errors.
>>>
>>> Heinz
>>>
>>>
>>>>> ...snip...
>>>
>>>
>> Heinz - the code is quite small and simple, not much effort.  And 
>> variable labels need to be attributes to individual variables, otherwise 
>>   plotting, latex, and other functions can't get access to them (e.g., 
>> in Hmisc xYplot(y ~ x) labels for x and y, and units of measurement, get 
>> plotted on axes.  I've been having all the SAS, SPSS, and BMDP 
>> capabilities you've mentioned in R/S-Plus (plus units attributes not 
>> available in those) for years.
>>
>> What would make all this even easier is for R to be told a list of 
>> attribute names that would always carry with subsetting, so that 
>> specially subsetting methods such as [.labeled would not be necessary.
>>
>> Frank
>>
>> -- 
>> Frank E Harrell Jr   Professor and Chair           School of Medicine
>>                      Department of Biostatistics   Vanderbilt University
>>
>>
> 
> Frank - maybe I did not understand you right, but it seems that you propose
> exactly what I did initially. Yes, I aggree with you that it would ease the
> situation, if there were a list of respected attributes. However, I suspect
> that it could be a computational burden to copy these attributes in any
> case. So I would suggest to define a class that typically would be assigned
> to raw data and to define an option that sets all the attributes which
> should be copied.
> Would you think this issue could/should be discussed in r-devel?

Yes r-devel would be the place.  In retrospect a single attribute such 
as varExtras would have been good - it could contain label, units, etc. 
  But my functions are too well established for me to change now.  I'd 
have to change too much code.

Frank

> 
> Heinz