[Rd] [External] Re: 1954 from NA

Tue May 25 21:01:59 CEST 2021

That helps get more understanding of what you want to do, Adrian. Getting anyone to switch is always a challenge but changing R enough to tempt them may be a bigger challenge. His is an old story. I was the first adopter for C++ in my area and at first had to have my code be built with an all C project making me reinvent some wheels so the same “make” system knew how to build the two compatibly and link them. Of course, they all eventually had to join me in a later release but I had moved forward by then.

I have changed (or more accurately added) lots of languages in my life and continue to do so. The biggest challenge is not to just adapt and use it similarly to the previous ones already mastered but to understand WHY someone designed the language this way and what kind of idioms are common and useful even if that means a new way of thinking. But, of course, any “older” language has evolved and often drifted in multiple directions. Many now borrow heavily from others even when the philosophy is different and often the results are not pretty. Making major changes in R might have serious impacts on existing programs including just by making them fail as they run out of memory.

If you look at R, there is plenty you can do in base R, sometimes by standing on your head. Yet you see package after package coming along that offers not just new things but sometimes a reworking and even remodeling of old things. R has a base graphics system I now rarely use and another called lattice I have no reason to use again because I can do so much quite easily in ggplot. Similarly, the evolving tidyverse group of packages approaches things from an interesting direction to the point where many people mainly use it and not base R. So if they were to teach a class in how to gather your data and analyze it and draw pretty pictures, the students might walk away thinking they had learned R but actually have learned these packages.

Your scenario seems related to a common scenario of how we can have values that signal beyond some range in an out-of-band manner. Years ago we had functions in languages like C that would return a -1 on failure when only non-negative results were otherwise possible. That can work fine but fails in cases when any possible value in the range can be returned. We have languages that deal with this kind of thing using error handling constructs like exceptions.  Sometimes you bundle up multiple items into a structure and return that with one element of the structure holding some kind of return status and another holding the payload. A variation on this theme, as in languages like GO is to have function that return multiple values with one of them containing nil on success and an error structure on failure.

The situation we have here that seems to be of concern to you is that you would like each item in a structure to have attributes that are recognized and propagated as it is being processed. Older languages tended not to even have a concept so basic types simply existed and two instances of the number 5 might even be the same underlying one or two strings with the same contents and so on. You could of course play the game of making a struct, as mentioned above, but then you needed your own code to do all the handling as nothing else knew it contained multiple items and which ones had which purpose.

R did add generalized attributes and some are fairly well integrated or at least partially. “Names” were discussed as not being easy to keep around. Factors used their own tagging method that seems to work fairly well but probably not everywhere. But what you want may be more general and not built on similar foundations.

I look at languages like Python that are arguably more object-oriented now than R is and in some ways can be extended better, albeit not in others. If I wanted to create an object to hold the number 5 and I add methods to the object that allow it to participate in various ways with other objects using the hidden payload but also sometimes using the hidden payload, then I might pair it with the string “five” but also with dozens of other strings for the word representing 5 in many languages. So I might have it act like a number in numerical situations and like text when someone is using it in writing a novel in any of many languages.

You seem to want to have the original text visible that gives a reason something is missing (or something like that) but have the software TREAT it like it is missing in calculations. In effect, you want is.na() to be a bit more like is.numeric() or is.character() and care more about the TYPE of what is being stored. An item may contain a 999 and yet not be seen as a number but as an NA. The problem I see is that you also may want the item to be a string like “DELETED” and yet include it in the vector that R insists can only hold integers. R does have a built-in data structure called a list that indeed allows that. You can easily store data as a list of lists rather than a list of vectors and many other structures. Some of those structures might handle your needs BUT may only work properly if you build your own packages as with  the tidyverse and break as soon as any other functions encountered them!

But then you would arguably no longer be in R but in your own universe based on R.

I have written much code that does things a bit sideways. For example, I might have a treelike structure in which you do some form of search till you encounter a leaf node and return that value to be used in a calculation. To perform a calculation using multiple trees such as taking an average, you always use find_value(tree) and never hand over the tree itself. As I think I pointed out earlier, you can do things like that in many places and hand over a variation of your data. In the ggplot example, you might have:

ggplot(data=mydata, aes(x=abs(col1), y=convert_string_to_numeric(col2)) …

Ggplot would not use the original data in plotting but the view it is asked to use. The function I made up above would know what values are some form of NA and convert all others like “12.3” to numeric form. BUT it would not act as simply or smoothly as when your data is already in the format everyone else uses.

So how does R know what something is? Presumably there is some overhead associated with a vector or some table that records the type. A list presumably depends on each internal item to have such a type. So maybe what you want is for each item in a vector to have a type where one type is some for of NA. But as noted, R does often not give a damn about an NA and happily uses it to create more nonsense. The mean of a bunch of numbers that includes one or more copies of things like NA (or NaN or inf) can pollute them all. Generally R is not designed to give a darn. When people complain, they may get mean to add an na.rm=TRUE or remove them some way before asking for a mean or perhaps reset them to something like zero.

So if you want to leave your variables in place with assorted meanings but a tag saying they are to be treated as NA, much in R might have to change. Your suggested approach though is not yet clear but might mean doing something analogous to using extra bits and hoping nobody will notice.

So, the solution is both blindingly obvious and even more blindingly stupid. Use complex numbers! All normal content shall be stored as numbers like 5.3+0i and any variant on NA shall be stored as something like 0+3i where 3 means an NA of type 3.

OK, humor aside, since the social sciences do not tend to even know what complex numbers are, this should provide another dimension to hide lots of meaningless info. Heck, you could convert  message like “LATE” into some numeric form. Assuming an English centered world (which I do not!) you could store it with L replaced by 12 and A by 01 and so on so the imaginary component might look like 0+12011905i and easily decoded back into LATE when needed. Again, not a serious proposal. The storage probably would be twice the size of a numeric albeit you can extract the real part when needed for normal calculations and the imaginary part when you want to know about NA type or whatever. 

What R really is missing is quaternions and octonions which are the only two other variations on complex numbers that are possible and are sort of complex numbers on steroids with either three or seven distinct square roots of minus-one  so they allow storage along additional axes in other dimensions.

Yes, I am sure someone wrote a package for that! LOL!

Ah, here is one: https://cran.r-project.org/web/packages/onion/onion.pdf

I will end by saying my experience is that enticing people to do something new is just a start. After they start, you often get lots of complaints and requests for help and even requests to help them move back! Unless you make some popular package everyone runs to, NOBODY else will be able to help them on some things. The reality is that some of the more common tasks these people do are sometimes already optimized for them and often do not make them know more. I have had to use these systems and for some common tasks they are easy. Dialog boxes can pop up and let you checks off various options and off you go. No need to learn lots of programming details like the names of various functions that do a Tukey test and what arguments they need and what errors might have to be handled and so on. I know SPSS often produces LOTS of output including many things you do not wat and then lets you remove parts you don’t need or even know what they mean. Sure, R can have similar functionality but often you are expected to sort of stitch various parts together as well as ADD your own bits. I love that and value being able to be creative. In my experience, most normal people just want to get the job done and be fairly certain others accept the results ad then do other activities they are better suited for, or at least think they are.

There are intermediates I have used where I let them do various kinds of processing on SPSS and save the result in some format I can read into R for additional processing. The latter may not be stuff that requires keeping track of multiple NA equivalents. Of course if you want to save the results and move them back, that is  a challenge. Hybrid approaches may tempt them to try something and maybe later do more and more and move over.

From: Adrian Dușa <dusa.adrian using unibuc.ro> 
Sent: Tuesday, May 25, 2021 2:17 AM
To: Avi Gross <avigross using verizon.net>
Cc: r-devel <r-devel using r-project.org>
Subject: Re: [Rd] [External] Re: 1954 from NA

Dear Avi,

Thank you so much for the extended messages, I read them carefully.

While partially offering a solution (I've already been there), it creates additional work for the user, and some of that is unnecessary.

What I am trying to achieve is best described in this draft vignette:

devtools::install_github("dusadrian/mixed")

vignette("mixed")

Once a value is declared to be missing, the user should not do anything else about it. Despite being present, the value should automatically be treated as missing by the software. That is the way it's done in all major statistical packages like SAS, Stata and even SPSS.

My end goal is to make R attractive for my faculty peers (and beyond), almost all of whom are massively using SPSS and sometimes Stata. But in order to convince them to (finally) make the switch, I need to provide similar functionality, not additional work.

Re. your first part of the message, I am definitely not trying to change the R internals. The NA will still be NA, exactly as currently defined.

My initial proposal was based on the observation that the 1954 payload was stored as an unsigned int (thus occupying 32 bits) when it is obvious it doesn't need more than 16. That was the only proposed modification, and everything else stays the same.

I now learned, thanks to all contributors in this list, that building something around that payload is risky because we do not know exactly what the compilers will do. One possible solution that I can think of, while (still) maintaining the current functionality around the NA, is to use a different high word for the NA that would not trigger compilation issues. But I have absolutely no idea what that implies for the other inner workings of R.

I very much trust the R core will eventually find a robust solution, they've solved much more complicated problems than this. I just hope the current thread will push the idea of tagged NAs on the table, for when they will discuss this.

Once that will be solved, and despite the current advice discouraging this route, I believe tagging NAs is a valuable idea that should not be discarded.

After all, the NA is nothing but a tagged NaN.

All the best,

Adrian

On Tue, May 25, 2021 at 7:05 AM Avi Gross via R-devel <r-devel using r-project.org <mailto:r-devel using r-project.org> > wrote:

I was thinking about how one does things in a language that is properly object-oriented versus R that makes various half-assed attempts at being such.

Clearly in some such languages you can make an object that is a wrapper that allows you to save an item that is the main payload as well as anything else you want. You might need a way to convince everything else to allow you to make things like lists and vectors and other collections of the objects and perhaps automatically unbox them for many purposes. As an example in a language like Python, you might provide methods so that adding A and B actually gets the value out of A and/or B and adds them properly.  But there may be too many edge cases to handle and some software may not pay attention to what you want including some libraries written in other languages.

I mention Python for the odd reason that it is now possible to combine Python and R in the same program and sort of switch back and forth between data representations. This may provide some openings for preserving and accessing metadata when needed.

Realistically, if R was being designed from scratch TODAY, many things might be done differently. But I recall it being developed at Bell Labs for purposes where it was sort of revolutionary at the time (back when it was S) and designed to do things in a vectorized way and probably primarily for the kinds of scientific and mathematical operations where a single NA (of several types depending on the data) was enough when augmented by a few things like a Nan and Inf and -Inf. I doubt they seriously saw a need for an unlimited number of NA that were all the same AND also all different that they felt had to be built-in. As noted, had they had a reason to make it fully object-oriented too and made the base types such as integer into full-fledged objects with room for additional metadata, then things may be different. I note I have seen languages which have both a data type called integer as lower case and Integer as upper case. One of them is regularly boxed and unboxed automagically when used in a context that needs the other. As far as efficiency goes, this invisibly adds many steps. So do languages that sometimes take a variable that is a pointer and invisibly reference it to provide the underlying field rather than make you do extra typing and so on.

So is there any reason only an NA should have such meta-data? Why not have reasons associated with Inf stating it was an Inf because you asked for one or the result of a calculation such as dividing by Zero (albeit maybe that might be a NaN) and so on. Maybe I could annotate integers with whether they are prime or even  versus odd  or a factor of 144 or anything else I can imagine. But at some point, the overhead from allowing all this can become substantial. I was amused at how python allows a function to be annotated including by itself since it is an object. So it can store such metadata perhaps in an attached dictionary so a complex costly calculation can have the results cached and when you ask for the same thing in the same session, it checks if it has done it and just returns the result in linear time. But after a while, how many cached results can there be?

-----Original Message-----
From: R-devel <r-devel-bounces using r-project.org <mailto:r-devel-bounces using r-project.org> > On Behalf Of luke-tierney using uiowa.edu <mailto:luke-tierney using uiowa.edu> 
Sent: Monday, May 24, 2021 9:15 AM
To: Adrian Dușa <dusa.adrian using unibuc.ro <mailto:dusa.adrian using unibuc.ro> >
Cc: Greg Minshall <minshall using umich.edu <mailto:minshall using umich.edu> >; r-devel <r-devel using r-project.org <mailto:r-devel using r-project.org> >
Subject: Re: [Rd] [External] Re: 1954 from NA

On Mon, 24 May 2021, Adrian Dușa wrote:

> On Mon, May 24, 2021 at 2:11 PM Greg Minshall <minshall using umich.edu <mailto:minshall using umich.edu> > wrote:
>
>> [...]
>> if you have 500 columns of possibly-NA'd variables, you could have 
>> one column of 500 "bits", where each bit has one of N values, N being 
>> the number of explanations the corresponding column has for why the 
>> NA exists.
>>

PLEASE DO NOT DO THIS!

It will not work reliably, as has been explained to you ad nauseam in this thread.

If you distribute code that does this it will only lead to bug reports on R that will waste R-core time.

As Alex explained, you can use attributes for this. If you need operations to preserve attributes across subsetting you can define subsetting methods that do that.

If you are dead set on doing something in C you can try to develop an ALTREP class that provides augmented missing value information.

Best,

luke

>
> The mere thought of implementing something like that gives me shivers. 
> Not to mention such a solution should also be robust when subsetting, 
> splitting, column and row binding, etc. and everything can be lost if 
> the user deletes that particular column without realising its importance.
>
> Social science datasets are much more alive and complex than one might 
> first think: there are multi-wave studies with tens of countries, and 
> aggregating such data is already a complex process to add even more 
> complexity on top of that.
>
> As undocumented as they may be, or even subject to change, I think the 
> R internals are much more reliable that this.
>
> Best wishes,
> Adrian
>
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu <mailto:luke-tierney using uiowa.edu> 
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
R-devel using r-project.org <mailto:R-devel using r-project.org>  mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel using r-project.org <mailto:R-devel using r-project.org>  mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

-- 

Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania

https://adriandusa.eu

	[[alternative HTML version deleted]]