[R] Creating NA equivalent

Wed Dec 22 02:36:27 CET 2021

Jim,

there are indeed many mathematical areas where data are not quite fixed. Consider inequalities such as a value that can be higher than some number but lower than another. A grade of A can often mean a score between 90 and 100 (no extra credit). An event deemed to be "significant at the 95% level of probability can be in a 5% range or based on various errors, may not even be in the range. Some places you can have infinitesimals or things approaching infinity and yet sometimes cancel things out without having an exact number.

The list of such things is vast and as was already pointed out here, many such cases have some info, even USEFUL info, that is lost if you declare them to be an NA or an Inf or by say choosing to view an A is exactly 95. If a student has straight A's, there is an excellent chance many of those A's came from scores above 95. A student with an overall C average may be more likely to have the single A be in the low 90's. 

R was not necessarily designed to work this way. For some purposes, you may want to use a variable that is more of a range. When I make plots in ggplot, I often use Inf or -Inf to specify one end of a range, so that, for example, whatever the data makes ggplot choose for upper and lower bounds, something I draw in the background will extend to that border. 

But there is a difference between how we store info, and how we use it. Many R functions have a feature like saying na.rm=TRUE that may not make sense if you store a value as an NA whose meaning is "between 95 and 100". You might want to write code that makes two copies of any vector which has an NA value associated with a range, and do something like place the minimum value(s) in one and the maximum in the other and then do some complex calculation.

Or consider a value like measuring a room with a ruler accurate only to 1/4 inch? If a side is 100 inches, the real value can be between 99.75 and 100.25 inches. Each measurement can be stored as a number and a plus/minus. To calculate the volume of a room, you might multiply all the low values to get one number and the high values to get another and store that as a range or whatever else makes send like averaging the two. 

Still, some of that is normally ignored or done some other way, without inventing new meanings for NA. I noted earlier that programs outside R will often do something like store out-of-band info that when imported into R is always treated as NA. Some thig may be unavailable because the person did not show up, others because they had horrible handwriting and the one who typed it in guessed what it said, and others who refused to answer . It may be that much of your program should treat all those as NA but other parts might want to record that some percent of the responders did this or that. As noted, Adrian Dusa and others had such needs and have a package that in some way annotates NA values when asked. I have played with it but currently have no need for it. And, just FYI, Adrian tried other things first as there already are multiple bit patterns that mean specific variation on an NA such as NA_integer_ (note the two underscores) and other variants for character, real, complex and a few more. In a bizarre way, you can play games and test them as in:

  > a=NA_integer_
  > b=NA_character_
  > identical(a, NA_integer_)
  [1] TRUE
  > identical(a, NA_character_)
  [1] FALSE
  > identical(a, a)
  [1] TRUE
  > identical(a, b)
  [1] FALSE
  > identical(a, NA)
  [1] FALSE

So, in THEORY, you might get away to using these oddball bitmap variations, or adding to them but they do not survive well in vectors which must in some sense only contain one type. I have had some minor success making a list and test the contents, which normally show all version as NA but clearly retain subtle differences:

  > temp=list(1, NA_integer_, 2, NA_character_, 3, NA)
  > temp
  [[1]]
  [1] 1

  [[2]]
  [1] NA

  [[3]]
  [1] 2

  [[4]]
  [1] NA

  [[5]]
  [1] 3

  [[6]]
  [1] NA

  > temp[[2]]
  [1] NA
  > identical(temp[[2]], NA_integer_)
  [1] TRUE
  > identical(temp[[2]], NA_character_)
  [1] FALSE
  > identical(temp[[4]], NA_character_)
  [1] TRUE

So, yes, I can imagine a subtle window of opportunity for re-using some of these NA variants to act like an NA but also be able to carefully signal some other opportunities. But as noted, vectors break the scheme so your data.frame might need to use list columns, which is doable. I bet many tools you use, especially ones that make copies or conversions, will break the scheme.

Please note that for ME, the above discussion is academic and a reaction to the ideas raised by others. I am not in any way suggesting R is deficient for not being designed for things like this, nor that wanting some such feature is a bad thing. What Adrian provided is sort of in between as real NA are stored but also some attributes record what the NA is supposed to represent.

-----Original Message-----
From: Jim Lemon <drjimlemon using gmail.com> 
Sent: Tuesday, December 21, 2021 5:00 PM
To: Avi Gross <avigross using verizon.net>
Cc: r-help mailing list <r-help using r-project.org>; Adrian Dușa <dusa.adrian using unibuc.ro>
Subject: Re: [R] Creating NA equivalent

Please pardon a comment that may be off-target as well as off-topic.
This appears similar to a number of things like fuzzy logic, where an instance can take incompatible truth values.

It is known that an instance may have an attribute with a numeric value, but that value cannot be determined.

It seems to me that an appropriate designation for the value is Unk, perhaps with an associated probability of determination to distinguish it from NA (it is definitely not known).

Jim

On Wed, Dec 22, 2021 at 6:55 AM Avi Gross via R-help <r-help using r-project.org> wrote:
>
> I wonder if the package Adrian Dușa created might be helpful or point you along the way.
>
> It was eventually named "declared"
>
> https://cran.r-project.org/web/packages/declared/index.html
>
> With a vignette here:
>
> https://cran.r-project.org/web/packages/declared/vignettes/declared.pd
> f
>
> I do not know if it would easily satisfy your needs but it may be a step along the way. A package called Haven was part of the motivation and Adrian wanted a way to import data from external sources that had more than one category of NA that sounds a bit like what you want. His functions should allow the creation of such data within R, as well. I am including him in this email if you want to contact him or he has something to say.
>
>
> -----Original Message-----
> From: R-help <r-help-bounces using r-project.org> On Behalf Of Duncan 
> Murdoch
> Sent: Tuesday, December 21, 2021 5:26 AM
> To: Marc Girondot <marc_grt using yahoo.fr>; r-help using r-project.org
> Subject: Re: [R] Creating NA equivalent
>
> On 20/12/2021 11:41 p.m., Marc Girondot via R-help wrote:
> > Dear members,
> >
> > I work about dosage and some values are bellow the detection limit. 
> > I would like create new "numbers" like LDL (to represent lower than 
> > detection limit) and UDL (upper the detection limit) that behave 
> > like NA, with the possibility to test them using for example 
> > is.LDL() or is.UDL().
> >
> > Note that NA is not the same than LDL or UDL: NA represent missing data.
> > Here the data is available as LDL or UDL.
> >
> > NA is built in R language very deep... any option to create new 
> > version of NA-equivalent ?
> >
>
> There was a discussion of this back in May.  Here's a link to one approach that I suggested:
>
>    https://stat.ethz.ch/pipermail/r-devel/2021-May/080776.html
>
> Read the followup messages, I made at least one suggested improvement.
> I don't know if anyone has packaged this, but there's a later version of the code here:
>
>    https://stackoverflow.com/a/69179441/2554330
>
> Duncan Murdoch
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.