[R-sig-ME] Efficient mixed logistic reg w 500k individuals

Sat Dec 26 21:04:53 CET 2020

The problem with random forests is that they don't respect the
hierarchical nature of the data, which depending on the OP's goals may
or may not be a problem. That's in addition to the differences between
random forests vs logistic regression even in a non
hierarchical/multilevel context.

Also, I think the spurious/unstable relationships bit requires some
qualification. Yes, if you're looking at p-values, then with that much
data, you'll typically be able to estimate trivial effects. But the
solution is then not to focus on p-values.

(Not saying random forests and the like aren't useful -- quite the
contrary. But the motivations here are a bit of a red herring.)

Phillip

On 26/12/20 7:14 am, sree datta wrote:
> With such a large dataset, I would recommend exploring interactions among
> variables using ensemble methods such as Random Forests and Extreme
> Gradient Boosting (since you have a binary dependent variable).
> These models also correct against bias since with such a large dataset, you
> may end up finding a lot of spurious and unstable relationships (both in
> main effects and interaction effects) with such large N.
> In terms of processing efficiency, have you tried using the *parallel* package
> in R (in addition, I would also suggest *foreach* and *doParallel* package
> to improve processing speed). For a more detailed description of
> parallelism implemented in R see this article:
> https://www.jigsawacademy.com/handling-big-data-using-r/ (a good summary of
> packages).
> 
> 
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
> Virus-free.
> www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> 
> On Wed, Dec 23, 2020 at 8:20 PM Mitchell Maltenfort <mmalten using gmail.com>
> wrote:
> 
>> Here’s a fun one for you (I hope)
>>
>> I’m mucking about with a logistic regression that may have about 30 million
>> records for half a million individuals.
>>
>> Yes, I have a large RAM machine - 64 Gig.  And I’ve used nAGQ 0 and other
>> recommendations from
>>
>> http://angrystatistician.blogspot.com/2015/10/mixed-models-in-r-bigger-faster-stronger.html?m=1
>>  which should be reasonable for the large data.
>>
>> It works but I’d still be interested in tweaks to improve speed or
>> accuracy.  Any ideas?
>> --
>> Sent from Gmail Mobile
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-mixed-models using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>