[Rd] 1954 from NA

Tue May 25 04:51:53 CEST 2021

Adrian,

This is an aside. I note in many machine-learning algorithms they actually do something along the lines being discussed. They may take an item like a paragraph of words or an email message  and add thousands of columns with each one being a Boolean specifying if a particular word is in or not in that item. They may then run an analysis trying to heuristically match known SPAM items so as to be able to predict if new items might be SPAM. Some may even have a column for words taken two or more at a time such as “must” followed by “have” or “Your”, “last”, “chance” resulting> column_orig

<NA>    <NA>     bad    <NA>   worse    <NA>    <NA>    <NA>     bad    <NA>   worse missing    <NA> 

  5       1      NA       2      NA       1       2       5      NA       6      NA      NA       2  in even more columns. The software than does the analysis can work on remarkably large such collections including in some cases taking multiple approaches at the same problem and choosing among them in some way.

In your case, yes, adding lots of columns seems like added work. But in data science, often the easiest way to do some complex things is to loop over selected existing columns and create multiple sets of additional columns that simplify later calculations by just using these values rather than some multi-line complex condition. I have as an example run statistical analyses where I have a Boolean column if the analysis failed (as in I caught it using try() or else it would kill my process) and another if I was told it did not converge properly and yet another column if it failed some post-tests. It simplified some queries that excluded rows where any one of the above was TRUE. I also stored columns for metrics like RMSEA and chi-squared values, sometimes dozens. And for each of the above, I actually had a set of columns for various models such as linear versus quadratic and more. Worse, as the analysis continued, more derived columns were added as various measures of the above results were compared to each other so the different models could be compared as in how often each was better. Careful choices of naming conventions and nice features of the tidyverse made it fairly simple to operate on many columns in the same way fairly easily such as all columns whose names start with a string or end with …

And, yes, for some efficiency, I often made a narrower version of the above with just the fields I needed and was careful not to remove what I might need later.

So it can be done and fairly trivially if you know what you are doing. If the names of all your original columns that behave this way look like *.orig and others look different, you can ask for a function to be applied to just those that produces another set with the same prefixes but named *.converted and yet another called *.annotation and so on. You may want to remove the originals to save space but you get the idea. The fact there are six hundred means little with such a design as the above can be done in probably a dozen lines of code to all of them at once.

For me, the above is way less complex than what you want to do and can have benefits. For example, if you make a graph of points from my larger tibble/data.frame using ggplot(), you can do things like specify what color to use for a point using a variable that contains the reason the data was missing (albeit that assumes the missing part is not what is being graphed) or add text giving the reason just above each such point. Your method of faking multiple things YOU claim are an NA may not make it doable in the above example.

From: Adrian Dușa <dusa.adrian using unibuc.ro <mailto:dusa.adrian using unibuc.ro> > 
Sent: Monday, May 24, 2021 8:18 AM
To: Greg Minshall <minshall using umich.edu <mailto:minshall using umich.edu> >
Cc: Avi Gross <avigross using verizon.net <mailto:avigross using verizon.net> >; r-devel <r-devel using r-project.org <mailto:r-devel using r-project.org> >
Subject: Re: [Rd] 1954 from NA

On Mon, May 24, 2021 at 2:11 PM Greg Minshall <minshall using umich.edu <mailto:minshall using umich.edu> > wrote:

[...]
if you have 500 columns of possibly-NA'd variables, you could have one
column of 500 "bits", where each bit has one of N values, N being the
number of explanations the corresponding column has for why the NA
exists.

The mere thought of implementing something like that gives me shivers. Not to mention such a solution should also be robust when subsetting, splitting, column and row binding, etc. and everything can be lost if the user deletes that particular column without realising its importance.

Social science datasets are much more alive and complex than one might first think: there are multi-wave studies with tens of countries, and aggregating such data is already a complex process to add even more complexity on top of that.

As undocumented as they may be, or even subject to change, I think the R internals are much more reliable that this.

Best wishes,

Adrian

-- 

Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania

https://adriandusa.eu

	[[alternative HTML version deleted]]