[R-sig-ME] Efficient mixed logistic reg w 500k individuals

Sat Dec 26 21:56:06 CET 2020

As the OP, the hierarchical structure is only part of the problem.  The
other part is that RF’s are opaque. I need an interpretable model.  But
p-values aren’t an issue.  I’d try Bayesian except I think that’s even more
computationally demanding.

On Sat, Dec 26, 2020 at 3:05 PM Phillip Alday <me using phillipalday.com> wrote:

> The problem with random forests is that they don't respect the
> hierarchical nature of the data, which depending on the OP's goals may
> or may not be a problem. That's in addition to the differences between
> random forests vs logistic regression even in a non
> hierarchical/multilevel context.
>
> Also, I think the spurious/unstable relationships bit requires some
> qualification. Yes, if you're looking at p-values, then with that much
> data, you'll typically be able to estimate trivial effects. But the
> solution is then not to focus on p-values.
>
> (Not saying random forests and the like aren't useful -- quite the
> contrary. But the motivations here are a bit of a red herring.)
>
> Phillip
>
> On 26/12/20 7:14 am, sree datta wrote:
> > With such a large dataset, I would recommend exploring interactions among
> > variables using ensemble methods such as Random Forests and Extreme
> > Gradient Boosting (since you have a binary dependent variable).
> > These models also correct against bias since with such a large dataset,
> you
> > may end up finding a lot of spurious and unstable relationships (both in
> > main effects and interaction effects) with such large N.
> > In terms of processing efficiency, have you tried using the *parallel*
> package
> > in R (in addition, I would also suggest *foreach* and *doParallel*
> package
> > to improve processing speed). For a more detailed description of
> > parallelism implemented in R see this article:
> > https://www.jigsawacademy.com/handling-big-data-using-r/ (a good
> summary of
> > packages).
> >
> >
> > <
> https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon
> >
> > Virus-free.
> > www.avast.com
> > <
> https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link
> >
> > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> >
> > On Wed, Dec 23, 2020 at 8:20 PM Mitchell Maltenfort <mmalten using gmail.com>
> > wrote:
> >
> >> Here’s a fun one for you (I hope)
> >>
> >> I’m mucking about with a logistic regression that may have about 30
> million
> >> records for half a million individuals.
> >>
> >> Yes, I have a large RAM machine - 64 Gig.  And I’ve used nAGQ 0 and
> other
> >> recommendations from
> >>
> >>
> http://angrystatistician.blogspot.com/2015/10/mixed-models-in-r-bigger-faster-stronger.html?m=1
> >>  which should be reasonable for the large data.
> >>
> >> It works but I’d still be interested in tweaks to improve speed or
> >> accuracy.  Any ideas?
> >> --
> >> Sent from Gmail Mobile
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> R-sig-mixed-models using r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-mixed-models using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> >
>
-- 
Sent from Gmail Mobile

	[[alternative HTML version deleted]]