[R] Hierarchical factors

Thu May 6 13:13:45 CEST 2010

On 5/5/10 [May 5, 10] 11:29 PM, David Winsemius wrote:
> I think you are perhaps unintentionally obscuring two issues. One is 
> whether R might have the statistical functions to deal with such an 
> arrangement, and here "mixed models" would be the phrase you ought to 
> be watching for, while the other would be whether it would have 
> pre-written data management functions that would directly support the 
> particular data layout you might be getting from public-access gov't 
> files. The second is what I _thought_ you were soliciting in your 
> original posting. I was a bit surprised that no one mentioned the 
> survey package, since I have seen it used in such situations,  but I 
> cannot track down the citation at the moment. You might want to look 
> at Gelman's blogs:
>
> http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html 
>
>
> See also work on nested case within cohort desgns:
> http://aje.oxfordjournals.org/cgi/content/full/kwp055v1
>
> And Damico's article:
> "Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis 
> Techniques in Health Policy Data"
> R Journal, 2002 , n 2.
> http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Damico.pdf
>
First, I apologize for my last, somewhat incoherent post. I was 
composing it late at night, grew too tired to think, and thought I left 
it open to finish this morning. Looks as if I should have quit about an 
hour earlier since apparently the garbled message went out anyway.

Dave, you're right, although I would describe my question as combining 
rather than obscuring two issues. My thinking is that first one would 
want the data structure (actually a data type or class). A set of 
functions could then handle conversion to factors, etc. that would allow 
easy use of most existing statistical functions. New statistical 
functions could then be designed, or old ones retrofitted, to handle the 
new data type internally. Eventually, it would be great to integrate it 
into the formula language.

The data type would have an inheritance pattern sort of like this: 
factor -> hierarchy -> specific system. By "specific system" I mean 
either a standard or user-defined coding system that extends the 
hierarchy class. For example, NAICS would be a data type and any 
variable in this class would be both hierarchical and map to the labels 
associated with the industry definitions. The hierarchy class would be 
what I was describing, with information on how to parse individual 
character strings at various levels of aggregation. Finally, although my 
idea would extend R's factor data type, strictly speaking this would not 
be inheritance. Real factors replicate and include labels in the storage 
associated with individual variables. Most hierarchical systems are very 
large, including hundreds of levels and long labels. So factors would 
usually be a very inefficient way to handle them. Imagine, for example, 
an application analyzing Internet routing or airline traffic, with each 
node on a route having a spatial hierarchical code 
(country.state.county.city) and a separate variable for each node. Ugh!

Instead, my idea would be to use an approach similar to SAS's formats, 
where the labels are stored separately and the individual codes map 
through a few relatively simple algorithms. SAS, for example, maps codes 
to labels either 1:1 (a character representation of the code maps to a 
label) or by evaluating the code and mapping it according to a 
predefined range of values. SAS recently implemented a feature that 
allows 1:many mapping so that, for instance, an AGE variable could map 
to simultaneously map to "Adult" and "Senior Citizen." Some statistical 
procedures in SAS will now repeat the analysis for all the mappings, so 
a single call to describe a variable generates counts of both adults and 
seniors.

While something similar to SAS formats would itself be a useful addition 
to R (and has been discussed before), my idea extends this by adding the 
ability to parse a hierarchical code at its various levels. This could 
then be integrated into appropriate statistical functions, or the 
analyst could write a function to deparse the code into its levels and 
then call the statistical function as needed. At a minimum, the 
hierarchy class would have to include an as.factor() function.

Given R's thousands of packages, I sent my post to find out if something 
like this already existed.

Thanks to everyone for your feedback. This list is great! The answer to 
my question is:

 > answer <- little.red.hen(question)

Marsh Feldman