[Rd] 1954 from NA

Mon May 24 15:18:47 CEST 2021

Hi Adrian, 

Have a look at vctrs package — they have low-level primitives that might simplify your life a bit. I think you can get quite far by creating a custom type that stores NAs in an attribute and utilizes vctrs proxy functionality to preserve these attributes across different operations. Going that route will likely to give you a much more flexible and robust solution. 

Best, 

Taras

> On 24 May 2021, at 15:09, Adrian Dușa <dusa.adrian using gmail.com> wrote:
> 
> Dear Alex,
> 
> Thanks for piping in, I am learning with each new message.
> The problem is clear, the solution escapes me though. I've already tried
> the attributes route: it is going to triple the data size: along with the
> additional (logical) variable that specifies which level is missing, one
> also needs to store an index such that sorting the data would still
> maintain the correct information.
> 
> One also needs to think about subsetting (subset the attributes as well),
> splitting (the same), aggregating multiple datasets (even more attention),
> creating custom vectors out of multiple variables... complexity quickly
> grows towards infinity.
> 
> R factors are nice indeed, but:
> - there are numerical variables which can hold multiple missing values (for
> instance income)
> - factors convert the original questionnaire values: if a missing value was
> coded 999, turning that into a factor would convert that value into
> something else
> 
> I really, and wholeheartedly, do appreciate all advice: but please be
> assured that I have been thinking about this for more than 10 years and
> still haven't found a satisfactory solution.
> 
> Which makes it even more intriguing, since other software like SAS or Stata
> have solved this for decades: what is their implementation, and how come
> they don't seem to be affected by the new M1 architecture?
> When package "haven" introduced the tagged NA values I said: ah-haa... so
> that is how it's done... only to learn that implementation is just as
> fragile as the R internals.
> 
> There really should be a robust solution for this seemingly mundane
> problem, but apparently is far from mundane...
> 
> Best wishes,
> Adrian
> 
> 
> On Mon, May 24, 2021 at 3:29 PM Bertram, Alexander <alex using bedatadriven.com>
> wrote:
> 
>> Dear Adrian,
>> I just wanted to pipe in and underscore Thomas' point: the payload bits of
>> IEEE 754 floating point values are no place to store data that you care
>> about or need to keep. That is not only related to the R APIs, but also how
>> processors handle floating point values and signaling and non-signaling
>> NaNs. It is very difficult to reason about when and under which
>> circumstances these bits are preserved. I spent a lot of time working on
>> Renjin's handling of these values and I can assure that any such scheme
>> will end in tears.
>> 
>> A far, far better option is to use R's attributes to store this kind of
>> metadata. This is exactly what this language feature is for. There is
>> already a standard 'levels' attribute that holds the labels of factors like
>> "Yes", "No" , "Refused", "Interviewer error'' etc. In the past, I've worked
>> on projects where we stored an additional attribute like "missingLevels"
>> that stores extra metadata on which levels should be used in which kind of
>> analysis. That way, you can preserve all the information, and then write a
>> utility function which automatically applies certain logic to a whole
>> dataframe just before passing the data to an analysis function. This is
>> also important because in surveys like this, different values should be
>> excluded at different times. For example, you might want to include all
>> responses in a data quality report, but exclude interviewer error and
>> refusals when conducting a PCA or fitting a model.
>> 
>> Best,
>> Alex
>> 
>> On Mon, May 24, 2021 at 2:03 PM Adrian Dușa <dusa.adrian using gmail.com> wrote:
>> 
>>> On Mon, May 24, 2021 at 1:31 PM Tomas Kalibera <tomas.kalibera using gmail.com>
>>> wrote:
>>> 
>>>> [...]
>>>> 
>>>> For the reasons I explained, I would be against such a change. Keeping
>>> the
>>>> data on the side, as also recommended by others on this list, would
>>> allow
>>>> you for a reliable implementation. I don't want to support fragile
>>> package
>>>> code building on unspecified R internals, and in this case particularly
>>>> internals that themselves have not stood the test of time, so are at
>>> high
>>>> risk of change.
>>>> 
>>> I understand, and it makes sense.
>>> We'll have to wait for the R internals to settle (this really is
>>> surprising, I wonder how other software have solved this). In the
>>> meantime,
>>> I will probably go ahead with NaNs.
>>> 
>>> Thank you again,
>>> Adrian
>>> 
>>>        [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>> 
>> 
>> --
>> Alexander Bertram
>> Technical Director
>> *BeDataDriven BV*
>> 
>> Web: http://bedatadriven.com
>> Email: alex using bedatadriven.com
>> Tel. Nederlands: +31(0)647205388
>> Skype: akbertram
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel