[R] Hierarchical factors

Thu May 6 14:21:43 CEST 2010

On May 6, 2010, at 7:13 AM, Marshall Feldman wrote:

> On 5/5/10 [May 5, 10] 11:29 PM, David Winsemius wrote:
>> I think you are perhaps unintentionally obscuring two issues. One  
>> is whether R might have the statistical functions to deal with such  
>> an arrangement, and here "mixed models" would be the phrase you  
>> ought to be watching for, while the other would be whether it would  
>> have pre-written data management functions that would directly  
>> support the particular data layout you might be getting from public- 
>> access gov't files. The second is what I _thought_ you were  
>> soliciting in your original posting. I was a bit surprised that no  
>> one mentioned the survey package, since I have seen it used in such  
>> situations,  but I cannot track down the citation at the moment.  
>> You might want to look at Gelman's blogs:
>>
>> http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html
>>
>> See also work on nested case within cohort desgns:
>> http://aje.oxfordjournals.org/cgi/content/full/kwp055v1
>>
>> And Damico's article:
>> "Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis  
>> Techniques in Health Policy Data"
>> R Journal, 2002 , n 2.
>> http://journal.r-project.org/archive/2009-2/ 
>> RJournal_2009-2_Damico.pdf
>>
> First, I apologize for my last, somewhat incoherent post. I was  
> composing it late at night, grew too tired to think, and thought I  
> left it open to finish this morning. Looks as if I should have quit  
> about an hour earlier since apparently the garbled message went out  
> anyway.
>
> Dave, you're right, although I would describe my question as  
> combining rather than obscuring two issues. My thinking is that  
> first one would want the data structure (actually a data type or  
> class). A set of functions could then handle conversion to factors,  
> etc. that would allow easy use of most existing statistical  
> functions. New statistical functions could then be designed, or old  
> ones retrofitted, to handle the new data type internally.  
> Eventually, it would be great to integrate it into the formula  
> language.
>
> The data type would have an inheritance pattern sort of like this:  
> factor -> hierarchy -> specific system. By "specific system" I mean  
> either a standard or user-defined coding system that extends the  
> hierarchy class. For example, NAICS would be a data type and any  
> variable in this class would be both hierarchical and map to the  
> labels associated with the industry definitions. The hierarchy class  
> would be what I was describing, with information on how to parse  
> individual character strings at various levels of aggregation.  
> Finally, although my idea would extend R's factor data type,  
> strictly speaking this would not be inheritance. Real factors  
> replicate and include labels in the storage associated with  
> individual variables. Most hierarchical systems are very large,  
> including hundreds of levels and long labels. So factors would  
> usually be a very inefficient way to handle them. Imagine, for  
> example, an application analyzing Internet routing or airline  
> traffic, with each node on a route having a spatial hierarchical  
> code (country.state.county.city) and a separate variable for each  
> node. Ugh!
>
> Instead, my idea would be to use an approach similar to SAS's  
> formats, where the labels are stored separately and the individual  
> codes map through a few relatively simple algorithms. SAS, for  
> example, maps codes to labels either 1:1 (a character representation  
> of the code maps to a label) or by evaluating the code and mapping  
> it according to a predefined range of values. SAS recently  
> implemented a feature that allows 1:many mapping so that, for  
> instance, an AGE variable could map to simultaneously map to "Adult"  
> and "Senior Citizen." Some statistical procedures in SAS will now  
> repeat the analysis for all the mappings, so a single call to  
> describe a variable generates counts of both adults and seniors.
>
> While something similar to SAS formats would itself be a useful  
> addition to R (and has been discussed before), my idea extends this  
> by adding the ability to parse a hierarchical code at its various  
> levels. This could then be integrated into appropriate statistical  
> functions, or the analyst could write a function to deparse the code  
> into its levels and then call the statistical function as needed. At  
> a minimum, the hierarchy class would have to include an as.factor()  
> function.
>

I have seen statements that R and ROOT can be compiled together on the  
same machine. ROOT is an object oriented database system developed at  
CERN (also where the WWW started) that supports hierarchical  
organization of data:

http://en.wikipedia.org/wiki/ROOT

The BioConductor "project" ought to be considered as a potential  
source of coding, and the geospatial interest group as well.

See for instance the xps package in BioC
http://bioconductor.org/packages/release/bioc/html/xps.html
http://www.iscb.org/uploaded/css/G04Stratowa.pdf

You might try corresponding with the xps author Christian Stratowa.

> Given R's thousands of packages, I sent my post to find out if  
> something like this already existed.
>
> Thanks to everyone for your feedback. This list is great! The answer  
> to my question is:
>
> > answer <- little.red.hen(question)
>
> Marsh Feldman

David Winsemius, MD
West Hartford, CT