[Rd] [External] Re: 1954 from NA

Duncan Murdoch murdoch@dunc@n @end|ng |rom gm@||@com
Wed May 26 01:27:02 CEST 2021


You've already been told how to solve this:  just add attributes to the 
objects. Use the standard NA to indicate that there is some kind of 
missingness, and the attribute to describe exactly what it is.  Stick a 
class on those objects and define methods so that subsetting and 
arithmetic preserves the extra info you've added. If you do some 
operation that turns those NAs into NaNs, big deal:  the attribute will 
still be there, and is.na(NaN) still returns TRUE.

Base R doesn't need anything else.

You complained that users shouldn't need to know about attributes, and 
they won't:  you, as the author of the package that does this, will 
handle all those details.  Working in your subject area you know all the 
different kinds of NAs that people care about, and how they code them in 
input data, so you can make it all totally transparent.  If you do it 
well, someone in some other subject area with a completely different set 
of kinds of missingness will be able to adapt your code to their use.

I imagine this has all been done in one of the thousands of packages on 
CRAN, but if it hasn't been done well enough for you, do it better.

Duncan Murdoch

On 25/05/2021 7:01 p.m., Adrian Dușa wrote:
> Dear Avi,
> 
> That was quite a lengthy email...
> What you write makes sense of course. I try hard not to deviate from the
> base R, and thought my solution does just that but apparently no such luck.
> 
> I suspect, however, that something will have to eventually change: since
> one of the R building blocks (such as an NA) is questioned by compilers, it
> is serious enough to attract attention from the R core and maintainers.
> And if that happens, my fingers are crossed the solution would allow users
> to declare existing values as missing.
> 
> The importance of that, for the social sciences, cannot be stressed enough.
> 
> Best wishes, thanks once again to everyone,
> Adrian
> 
> On Tue, May 25, 2021 at 10:03 PM Avi Gross via R-devel <
> r-devel using r-project.org> wrote:
> 
>> That helps get more understanding of what you want to do, Adrian. Getting
>> anyone to switch is always a challenge but changing R enough to tempt them
>> may be a bigger challenge. His is an old story. I was the first adopter for
>> C++ in my area and at first had to have my code be built with an all C
>> project making me reinvent some wheels so the same “make” system knew how
>> to build the two compatibly and link them. Of course, they all eventually
>> had to join me in a later release but I had moved forward by then.
>>
>>
>>
>> I have changed (or more accurately added) lots of languages in my life and
>> continue to do so. The biggest challenge is not to just adapt and use it
>> similarly to the previous ones already mastered but to understand WHY
>> someone designed the language this way and what kind of idioms are common
>> and useful even if that means a new way of thinking. But, of course, any
>> “older” language has evolved and often drifted in multiple directions. Many
>> now borrow heavily from others even when the philosophy is different and
>> often the results are not pretty. Making major changes in R might have
>> serious impacts on existing programs including just by making them fail as
>> they run out of memory.
>>
>>
>>
>> If you look at R, there is plenty you can do in base R, sometimes by
>> standing on your head. Yet you see package after package coming along that
>> offers not just new things but sometimes a reworking and even remodeling of
>> old things. R has a base graphics system I now rarely use and another
>> called lattice I have no reason to use again because I can do so much quite
>> easily in ggplot. Similarly, the evolving tidyverse group of packages
>> approaches things from an interesting direction to the point where many
>> people mainly use it and not base R. So if they were to teach a class in
>> how to gather your data and analyze it and draw pretty pictures, the
>> students might walk away thinking they had learned R but actually have
>> learned these packages.
>>
>>
>>
>> Your scenario seems related to a common scenario of how we can have values
>> that signal beyond some range in an out-of-band manner. Years ago we had
>> functions in languages like C that would return a -1 on failure when only
>> non-negative results were otherwise possible. That can work fine but fails
>> in cases when any possible value in the range can be returned. We have
>> languages that deal with this kind of thing using error handling constructs
>> like exceptions.  Sometimes you bundle up multiple items into a structure
>> and return that with one element of the structure holding some kind of
>> return status and another holding the payload. A variation on this theme,
>> as in languages like GO is to have function that return multiple values
>> with one of them containing nil on success and an error structure on
>> failure.
>>
>>
>>
>> The situation we have here that seems to be of concern to you is that you
>> would like each item in a structure to have attributes that are recognized
>> and propagated as it is being processed. Older languages tended not to even
>> have a concept so basic types simply existed and two instances of the
>> number 5 might even be the same underlying one or two strings with the same
>> contents and so on. You could of course play the game of making a struct,
>> as mentioned above, but then you needed your own code to do all the
>> handling as nothing else knew it contained multiple items and which ones
>> had which purpose.
>>
>>
>>
>> R did add generalized attributes and some are fairly well integrated or at
>> least partially. “Names” were discussed as not being easy to keep around.
>> Factors used their own tagging method that seems to work fairly well but
>> probably not everywhere. But what you want may be more general and not
>> built on similar foundations.
>>
>>
>>
>> I look at languages like Python that are arguably more object-oriented now
>> than R is and in some ways can be extended better, albeit not in others. If
>> I wanted to create an object to hold the number 5 and I add methods to the
>> object that allow it to participate in various ways with other objects
>> using the hidden payload but also sometimes using the hidden payload, then
>> I might pair it with the string “five” but also with dozens of other
>> strings for the word representing 5 in many languages. So I might have it
>> act like a number in numerical situations and like text when someone is
>> using it in writing a novel in any of many languages.
>>
>>
>>
>> You seem to want to have the original text visible that gives a reason
>> something is missing (or something like that) but have the software TREAT
>> it like it is missing in calculations. In effect, you want is.na() to be
>> a bit more like is.numeric() or is.character() and care more about the TYPE
>> of what is being stored. An item may contain a 999 and yet not be seen as a
>> number but as an NA. The problem I see is that you also may want the item
>> to be a string like “DELETED” and yet include it in the vector that R
>> insists can only hold integers. R does have a built-in data structure
>> called a list that indeed allows that. You can easily store data as a list
>> of lists rather than a list of vectors and many other structures. Some of
>> those structures might handle your needs BUT may only work properly if you
>> build your own packages as with  the tidyverse and break as soon as any
>> other functions encountered them!
>>
>>
>>
>> But then you would arguably no longer be in R but in your own universe
>> based on R.
>>
>>
>>
>> I have written much code that does things a bit sideways. For example, I
>> might have a treelike structure in which you do some form of search till
>> you encounter a leaf node and return that value to be used in a
>> calculation. To perform a calculation using multiple trees such as taking
>> an average, you always use find_value(tree) and never hand over the tree
>> itself. As I think I pointed out earlier, you can do things like that in
>> many places and hand over a variation of your data. In the ggplot example,
>> you might have:
>>
>>
>>
>> ggplot(data=mydata, aes(x=abs(col1), y=convert_string_to_numeric(col2)) …
>>
>>
>>
>> Ggplot would not use the original data in plotting but the view it is
>> asked to use. The function I made up above would know what values are some
>> form of NA and convert all others like “12.3” to numeric form. BUT it would
>> not act as simply or smoothly as when your data is already in the format
>> everyone else uses.
>>
>>
>>
>> So how does R know what something is? Presumably there is some overhead
>> associated with a vector or some table that records the type. A list
>> presumably depends on each internal item to have such a type. So maybe what
>> you want is for each item in a vector to have a type where one type is some
>> for of NA. But as noted, R does often not give a damn about an NA and
>> happily uses it to create more nonsense. The mean of a bunch of numbers
>> that includes one or more copies of things like NA (or NaN or inf) can
>> pollute them all. Generally R is not designed to give a darn. When people
>> complain, they may get mean to add an na.rm=TRUE or remove them some way
>> before asking for a mean or perhaps reset them to something like zero.
>>
>>
>>
>> So if you want to leave your variables in place with assorted meanings but
>> a tag saying they are to be treated as NA, much in R might have to change.
>> Your suggested approach though is not yet clear but might mean doing
>> something analogous to using extra bits and hoping nobody will notice.
>>
>>
>>
>> So, the solution is both blindingly obvious and even more blindingly
>> stupid. Use complex numbers! All normal content shall be stored as numbers
>> like 5.3+0i and any variant on NA shall be stored as something like 0+3i
>> where 3 means an NA of type 3.
>>
>>
>>
>> OK, humor aside, since the social sciences do not tend to even know what
>> complex numbers are, this should provide another dimension to hide lots of
>> meaningless info. Heck, you could convert  message like “LATE” into some
>> numeric form. Assuming an English centered world (which I do not!) you
>> could store it with L replaced by 12 and A by 01 and so on so the imaginary
>> component might look like 0+12011905i and easily decoded back into LATE
>> when needed. Again, not a serious proposal. The storage probably would be
>> twice the size of a numeric albeit you can extract the real part when
>> needed for normal calculations and the imaginary part when you want to know
>> about NA type or whatever.
>>
>>
>>
>> What R really is missing is quaternions and octonions which are the only
>> two other variations on complex numbers that are possible and are sort of
>> complex numbers on steroids with either three or seven distinct square
>> roots of minus-one  so they allow storage along additional axes in other
>> dimensions.
>>
>>
>>
>> Yes, I am sure someone wrote a package for that! LOL!
>>
>>
>>
>> Ah, here is one: https://cran.r-project.org/web/packages/onion/onion.pdf
>>
>>
>>
>> I will end by saying my experience is that enticing people to do something
>> new is just a start. After they start, you often get lots of complaints and
>> requests for help and even requests to help them move back! Unless you make
>> some popular package everyone runs to, NOBODY else will be able to help
>> them on some things. The reality is that some of the more common tasks
>> these people do are sometimes already optimized for them and often do not
>> make them know more. I have had to use these systems and for some common
>> tasks they are easy. Dialog boxes can pop up and let you checks off various
>> options and off you go. No need to learn lots of programming details like
>> the names of various functions that do a Tukey test and what arguments they
>> need and what errors might have to be handled and so on. I know SPSS often
>> produces LOTS of output including many things you do not wat and then lets
>> you remove parts you don’t need or even know what they mean. Sure, R can
>> have similar functionality but often you are expected to sort of stitch
>> various parts together as well as ADD your own bits. I love that and value
>> being able to be creative. In my experience, most normal people just want
>> to get the job done and be fairly certain others accept the results ad then
>> do other activities they are better suited for, or at least think they are.
>>
>>
>>
>> There are intermediates I have used where I let them do various kinds of
>> processing on SPSS and save the result in some format I can read into R for
>> additional processing. The latter may not be stuff that requires keeping
>> track of multiple NA equivalents. Of course if you want to save the results
>> and move them back, that is  a challenge. Hybrid approaches may tempt them
>> to try something and maybe later do more and more and move over.
>>
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list