[R] Hierarchical factors
Marshall Feldman
marsh at uri.edu
Thu May 6 13:13:45 CEST 2010
On 5/5/10 [May 5, 10] 11:29 PM, David Winsemius wrote:
> I think you are perhaps unintentionally obscuring two issues. One is
> whether R might have the statistical functions to deal with such an
> arrangement, and here "mixed models" would be the phrase you ought to
> be watching for, while the other would be whether it would have
> pre-written data management functions that would directly support the
> particular data layout you might be getting from public-access gov't
> files. The second is what I _thought_ you were soliciting in your
> original posting. I was a bit surprised that no one mentioned the
> survey package, since I have seen it used in such situations, but I
> cannot track down the citation at the moment. You might want to look
> at Gelman's blogs:
>
> http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html
>
>
> See also work on nested case within cohort desgns:
> http://aje.oxfordjournals.org/cgi/content/full/kwp055v1
>
> And Damico's article:
> "Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis
> Techniques in Health Policy Data"
> R Journal, 2002 , n 2.
> http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Damico.pdf
>
First, I apologize for my last, somewhat incoherent post. I was
composing it late at night, grew too tired to think, and thought I left
it open to finish this morning. Looks as if I should have quit about an
hour earlier since apparently the garbled message went out anyway.
Dave, you're right, although I would describe my question as combining
rather than obscuring two issues. My thinking is that first one would
want the data structure (actually a data type or class). A set of
functions could then handle conversion to factors, etc. that would allow
easy use of most existing statistical functions. New statistical
functions could then be designed, or old ones retrofitted, to handle the
new data type internally. Eventually, it would be great to integrate it
into the formula language.
The data type would have an inheritance pattern sort of like this:
factor -> hierarchy -> specific system. By "specific system" I mean
either a standard or user-defined coding system that extends the
hierarchy class. For example, NAICS would be a data type and any
variable in this class would be both hierarchical and map to the labels
associated with the industry definitions. The hierarchy class would be
what I was describing, with information on how to parse individual
character strings at various levels of aggregation. Finally, although my
idea would extend R's factor data type, strictly speaking this would not
be inheritance. Real factors replicate and include labels in the storage
associated with individual variables. Most hierarchical systems are very
large, including hundreds of levels and long labels. So factors would
usually be a very inefficient way to handle them. Imagine, for example,
an application analyzing Internet routing or airline traffic, with each
node on a route having a spatial hierarchical code
(country.state.county.city) and a separate variable for each node. Ugh!
Instead, my idea would be to use an approach similar to SAS's formats,
where the labels are stored separately and the individual codes map
through a few relatively simple algorithms. SAS, for example, maps codes
to labels either 1:1 (a character representation of the code maps to a
label) or by evaluating the code and mapping it according to a
predefined range of values. SAS recently implemented a feature that
allows 1:many mapping so that, for instance, an AGE variable could map
to simultaneously map to "Adult" and "Senior Citizen." Some statistical
procedures in SAS will now repeat the analysis for all the mappings, so
a single call to describe a variable generates counts of both adults and
seniors.
While something similar to SAS formats would itself be a useful addition
to R (and has been discussed before), my idea extends this by adding the
ability to parse a hierarchical code at its various levels. This could
then be integrated into appropriate statistical functions, or the
analyst could write a function to deparse the code into its levels and
then call the statistical function as needed. At a minimum, the
hierarchy class would have to include an as.factor() function.
Given R's thousands of packages, I sent my post to find out if something
like this already existed.
Thanks to everyone for your feedback. This list is great! The answer to
my question is:
> answer <- little.red.hen(question)
Marsh Feldman
More information about the R-help
mailing list