[R] building a formula for glm() with 30,000 independent variables
jaitchis@hwy.com.au
jaitchis at hwy.com.au
Mon Nov 11 08:19:35 CET 2002
Moderator / list boss: should this be taken off list? I am interested
in the current thinking on this issue (of the acceptability of
methodological questions), since I myself recently posted a
statistical/methodological question. imho, a certain amount of this
sort of posting is a "good thing" - it enlivens the list, and
implementation and theoretical matters are often intertwined - but
then I have to declare an interest .. I'd like to be able to post
interesting or challenging problems<g>. So, perhaps some general
guidance as to the admissibility of such enquiries could be (re)
posted.
re Ben Liblit's problem :
I don't quite understand how you get 30,000 variables out of 10,000
code instrumentation points but no matter... 10K, 30K all the same
difference.
You have a simple 0/1 dependent variable (runs OK, crashes)
which has a mean of about 0.2. That's OK .. it means that the
crash is not such a rare event that you need to consider differential
subsampling, not immediately anyway.
Why you would want to fit an "additive" model in this context is
beyond me .. do you believe that x variable contributes something
and z variable something else and that if you add those effects
together you are likely to get a better prediction (of a crash)?. I
can't imagine it.
Consider a simple circuit in which there are several switches along
several paths and a lamp which either glows or it does not
(success or failure). You MIGHT be able to build a model that
implies that switch X contributes a certain amount to
"glowingness", that MIGHT fit the data OK, but other than as a
guide to further exploration the model is not much use.
Surely most crashes have a single cause (or perhaps a chain of
causes - some interaction effects), or there is a multiplicity of
disparate causes. If your focus is on fault isolation - and assuming
you only have this post hoc dataset to work with, rather than having
control over the process - then fitting additive models does not
imho make a lot of sense.
So why not "screen" your predictors for "significance" in the first
instance?. This is essentially what recursive partitioning
procedures do .. look at rpart. Use a simple t test or some such
and throw out all those that appear to have no influence .. if you
end up with more than a handful of variables after this weeding out
stage then I suggest that you either have a very buggy program or
the process is inherently too complex to analyze. Worried about
throwing out interaction effects if you throw out all variables that
appear non-influential at the first cut? Don't be. The likelihood of an
interaction effect being substantive when the main effects are non
substantive is quite small.
> Murray Jorgensen wrote:
>
> > You have not really given enough background to enable much help to be
> > given.
>
> In a way, that was intentional. I was hoping that my problem was merely
> a matter of proper R usage. But several of you have politely pointed
> out that my underlying thinking about the statistics itself is flawed.
> With 30,000 predictors and an order of magnitude *fewer* observations, I
> should expect to find a bogus but perfectly predictive model even if
> everything were random noise.
>
> > Knowledge of any structure on the predictors may suggest strategies
> > for choosing representative predictors.
>
> Understood. Since my statistics background is so weak, perhaps it would
> be wise at this point to explain what exactly I am trying to accomplish,
> and thereby better leverage this list's expertise.
>
> The "predictors" here are randomly sampled observations of the behavior
> of a running program. We decide in advance what things to observe. For
> example, we might decide to check whether a particular pointer on a
> particular line is null. So that would give us two counters: one
> telling us how many times it was seen to be null, and one telling us how
> many times it was seen to be not null.
>
> A similar sort of instrumentation would be to guess that a pair of
> program variables might related in some way. At random intervals we
> check their values and increment one of three counters depending on
> whether the first is less then, equal to, or greater than the second.
> So that would give us a trio of related predictors.
>
> We don't update these counters every time a given line of code executes,
> though: we randomly sample perhaps 1/100 or 1/1000. The samples are
> fair in the sense that each sampling opportunity is taken or skipped
> randomly and independently from each other opportunity.
>
> The "dependent outcome" is whether the program ultimately crashes or
> exits successfully. The goal is to identify those program behaviors
> which are strongly predictive of an eventual crash. For example, if the
> program has a single buffer overrun bug, we might discover that the
> "(index > limit) on line 196" counter is nonzero every time we crash,
> but is zero for most runs that do not crash.
>
> (Most, but not all. Sometimes you can overrun a buffer but not crash.
> "Getting lucky" is part of what we're trying to express in our model.)
>
> In my current experiment, I have about 10,000 pairs of program variables
> being compared with a "less", "equal", and "greater" counter for each.
> Thus, 30,000 predictors. Almost all of these should be irrelevant. And
> they certainly are not independent of each other. Looking along the
> other axis, I've got about 3300 distinct program runs, of which roughly
> one fifth crash. I have complete and perfect counter information for
> all of these runs, which I can easily postprocess to simulate sampled
> counters with any desired sampling density.
>
> I'm getting the distinct impression that a standard logistic regression
> with 30,000 predictors is *not* a practical approach. What should I be
> using instead? I'm frustrated by the fact that while the problem seems
> conceptually simple enough, I just don't have the statistics background
> required to know how to solve it correctly. If any of you have any
> suggestions, I certainly welcome them.
>
> Thank you, one and all. You've been quite generous with your advice
> already, and I certainly do appreciate it.
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>
John Aitchison
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list