[R] Keep value lables with data frame manipulation

Heinz Tuechler tuechler at gmx.at
Tue Jul 18 00:41:07 CEST 2006


At 20:39 14.07.2006 -0500, Frank E Harrell Jr wrote:
>Heinz Tuechler wrote:
>> At 11:02 13.07.2006 -0500, Frank E Harrell Jr wrote:
>>> Heinz Tuechler wrote:
>>>> At 08:11 13.07.2006 -0500, Frank E Harrell Jr wrote:
>>>>> Heinz Tuechler wrote:
>>>>>> At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote:
>>>>>>> On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote:
>>>>>>>> Dear R,
>>>>>>>>
>>>>>>>> I import data from spss into a R data.frame. On this rawdata I do
some
>>>>>>>> data processing (selection of observations, normalization,
recoding of
>>>>>>>> variables etc..). The result is stored in a new data.frame,
however, in
>>>>>>>> this new data.frame the value labels are lost.
>>>>>>>>
>>>>>>>> Example of what I do in code:
>>>>>>>>
>>>>>>>> # read raw data from spss
>>>>>>>> rawdata <- read.spss("./data/T50937.SAV",
>>>>>>>> 	use.value.labels=FALSE,to.data.frame=TRUE)
>>>>>>>>
>>>>>>>> # select the observations that we need
>>>>>>>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 |
>> rawdata$D22==17 |
>>>>>>>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 |
>>>>>>>>  			rawdata$D22==24 | rawdata$D22==33,]
>>>>>>>>
>>>>>>>> The result is that rawdata$D22 has value labels and that
diarydata$D22
>>>>>>>> is numeric without value labels.
>>>>>>>>
>>>>>>>> Question: How can I prevent this from happening?
>>>>>>>>
>>>>>>>> Thanks in advance!
>>>>>>>> Groeten,
>>>>>>>> Arne
>>>>>>> Two things:
>>>>>>>
>>>>>>> 1. With respect to your subsetting, your lengthy code can be replaced
>>>>>>> with the following:
>>>>>>>
>>>>>>>  diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24,
33))
>>>>>>>
>>>>>>> See ?subset and ?"%in%" for more information.
>>>>>>>
>>>>>>>
>>>>>>> 2. With respect to keeping the label related attributes, the
>>>>>>> 'value.labels' attribute and the 'variable.labels' attribute will
not by
>>>>>>> default survive the use of "[".data.frame in R (see ?Extract
>>>>>>> and ?"[.data.frame").
>>>>>>>
>>>>>>> On the other hand, based upon my review of ?read.spss, the SPSS value
>>>>>>> labels should be converted to the factor levels of the respective
>>>>>>> columns when 'use.value.labels = TRUE' and these would survive a
>>>>>>> subsetting.
>>>>>>>
>>>>>>> If you want to consider a solution to the attribute subsetting issue,
>>>>>>> you might want to review the following post by Gabor Grothendieck in
>>>>>>> May, which provides a possible solution:
>>>>>>>
>>>>>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html
>>>>>>>
>>>>>>> and this post by me, for an explanation of what is happening in
Gabor's
>>>>>>> solution:
>>>>>>>
>>>>>>>  https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html
>>>>>>>
>>>>>>> HTH,
>>>>>>>
>>>>>>> Marc Schwartz
>>>>>>>
>>>>>> Hello Mark and Arne,
>>>>>>
>>>>>> I worked on the suggestions of Gabor and Mark and programmed some
>> functions
>>>>>> in this way, but they are very, very preliminary (see below).
>>>>>> In my view there is a lack of convenient possibilities in R to document
>>>>>> empirical data by variable labels, value labels, etc. I would prefer to
>>>>>> have these possibilities in the "standard" configuration.
>>>>>> So I sketched a concept, but in my view it would only be useful, if
there
>>>>>> was some acceptance by the core developers of R.
>>>>>>
>>>>>> The concept would be to define a class. For now I call it
"source.data".
>>>>>> To design it more flexible than the Hmisc class "labelled" I would
>> define a
>>>>>> related option "source.data.attributes" with default c('value.labels',
>>>>>> 'variable.name', 'label')). This option contains all attributes that
>> should
>>>>>> persist in subsetting/indexing.
>>>>>>
>>>>>> I made only some very, very preliminary tests with these functions,
>> mainly
>>>>>> because I am not happy with defining a new class. Instead I would
prefer,
>>>>>> if this functionality could be integrated in the Hmisc class
"labelled",
>>>>>> since this is in my view the best known starting point for data
>>>>>> documentation in R.
>>>>>>
>>>>>> I would be happy, if there were some discussion about the
wishes/needs of
>>>>>> other Rusers concerning data documentation.
>>>>>>
>>>>>> Greetings,
>>>>>>
>>>>>> Heinz
>>>>> I feel that separating variable labels and value labels and just using 
>>>>> factors for value labels works fine, and I would urge you not to create 
>>>>> a new system that will not benefit from the many Hmisc functions that 
>>>>> use variable labels and units.  [.data.frame in Hmisc keeps all
>> attributes.
>>>>> Frank
>>>>>
>>>> Frank,
>>>>
>>>> of course I aggree with you about the importance of Hmisc and as I
said, I
>>>> do not want to define a new class, but in my view factors are no good
>>>> substitute for value labels.
>>>> As the language definition (version 2.3.1 (2006-06-05) Draft, page 7)
says:
>>>> "Factors are currently implemented using an integer array to specify the
>>>> actual levels and a second array of names that are mapped to the
integers.
>>>> Rather unfortunately users often make use of the implementation in
order to
>>>> make some calculations easier." 
>>>> So, in my view, the levels represent the "values" of the factor.
>>>> This has inconveniencies if you want to use value labels in different
>>>> languages. Further I do not see a simple method to label numerical
>>>> variables. I often encounter discrete, but still metric data, as e.g.
risk
>>>> scores. Usually it would be nice to use them in their original coding,
>>>> which may include zero or decimal places and to label them at the same
>> time.
>>>> Personally at the moment I try to solve this problem by following a
>>>> suggestion of Martin, Dimitis and others to use names instead. I doubt,
>>>> however, that this is a good solution, but at least it makes it
possible to
>>>> have the source data numerically coded and in this sense "language free"
>>>> (see first attempts of functions below).
>>>>
>>>> Heinz
>>>>
>>> Those are excellent points Heinz.  I addressed that problem partially in 
>>> sas.get - see the sascodes attribute.
>>>
>>> Frank
>>>
>> 
>> Frank, I looked at your function sas.get. You solved the problem with a lot
>> of effort. Don't you think that it would be easier to create just one new
>> class, say "documented", which offers the possibility to represent the
>> original data as it is and to add all the useful descriptions like variable
>> labels, value labels, units, special missing values, and may be others.
>> If I remember correctly SAS, SPSS and BMDP offer these possibilities since
>> many years, and in my view for good reason. I am thinking about this
>> questions since I started using R about two years ago and I wonder, why
>> there seems to be so little interest in these questions.
>> In my work good documentation of the _unchanged_ data is very important,
>> also because it eases checking the data for errors.
>> 
>> Heinz
>> 
>> 
>>>> ...snip...
>> 
>> 
>> 
>
>Heinz - the code is quite small and simple, not much effort.  And 
>variable labels need to be attributes to individual variables, otherwise 
>   plotting, latex, and other functions can't get access to them (e.g., 
>in Hmisc xYplot(y ~ x) labels for x and y, and units of measurement, get 
>plotted on axes.  I've been having all the SAS, SPSS, and BMDP 
>capabilities you've mentioned in R/S-Plus (plus units attributes not 
>available in those) for years.
>
>What would make all this even easier is for R to be told a list of 
>attribute names that would always carry with subsetting, so that 
>specially subsetting methods such as [.labeled would not be necessary.
>
>Frank
>
>-- 
>Frank E Harrell Jr   Professor and Chair           School of Medicine
>                      Department of Biostatistics   Vanderbilt University
>
>

Frank - maybe I did not understand you right, but it seems that you propose
exactly what I did initially. Yes, I aggree with you that it would ease the
situation, if there were a list of respected attributes. However, I suspect
that it could be a computational burden to copy these attributes in any
case. So I would suggest to define a class that typically would be assigned
to raw data and to define an option that sets all the attributes which
should be copied.
Would you think this issue could/should be discussed in r-devel?

Heinz



More information about the R-help mailing list