# [R] How long to wait for process?

john polo jpolo at mail.usf.edu
Thu Jul 27 15:26:42 CEST 2017

```Michael,

Thank you for the suggestion. I will take your advice and look more
critically at the covariates.

John

On 7/27/2017 8:08 AM, Michael Friendly wrote:
> Rather than go to a penalized GLM, you might be better off
> investigating the sources of quasi-perfect separation and simplifying
> the model to avoid or reduce it.  In your data set you have several
> factors with large number of levels, making the data sparse for all
> their combinations.
>
> Like multicolinearity, near perfect separation is a data problem, and
> is often better solved by careful thought about the model, rather than
> wrapping the data in a computationally intensive band aid.
>
> -Michael
>
> On 7/26/2017 10:14 AM, john polo wrote:
>> UseRs,
>>
>> I have a dataframe with 2547 rows and several hundred columns in R
>> 3.1.3. I am trying to run a small logistic regression with a subset
>> of the data.
>>
>> know_fin ~
>> comp_grp2+age+gender+education+employment+income+ideol+home_lot+home+county
>>
>>      > str(knowf3)
>>      'data.frame':   2033 obs. of  18 variables:
>>      \$ userid    : Factor w/ 2542 levels "FNCNM1639","FNCNM1642",..:
>> 1857 157 965 1967 164 315 849 1017 699 189 ...
>>      \$ round_id   : Factor w/ 1 level "Round 11": 1 1 1 1 1 1 1 1 1 1
>> ...
>>      \$ age       : int  67 66 44 27 32 67 36 76 70 66 ...
>>      \$ county: Factor w/ 80 levels "Adair","Alfalfa",..: 75 75 75 75
>> 75 75 64 64 64 64 ...
>>      \$ gender    : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 2 ...
>>      \$ education : Factor w/ 8 levels "1","2","3","4",..: 6 7 6 8 2 4
>> 2 4 2 6 ...
>>      \$ employment: Factor w/ 9 levels "1","2","3","4",..: 8 4 4 4 3 8
>> 5 8 4 4 ...
>>      \$ income    : num  550000 80000 90000 19000 42000 30000 18000
>> 50000 800000 10000 ...
>>      \$ home: num  0 0 0 0 0 0 0 0 0 0 ...
>>      \$ ideol     : Factor w/ 7 levels "1","2","3","4",..: 2 7 4 3 2 4
>> 2 3 2 6 ...
>>      \$ home_lot  : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 3 3 1
>> 2 ...
>>      \$ hispanic  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
>>      \$ comp_grp2 : Factor w/ 16 levels "Cr_Gr","Cr_Ot",..: 13 13 13
>> 13 13 13 10 10 10 10 ...
>>      \$ know_fin : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2
>> ...
>>
>>
>> With the regular glm() function, I get a warning about "perfect or
>> quasi-perfect separation"[1]. I looked for a method to deal with this
>> and a penalized GLM is an accepted method[2]. This is implemented in
>> logistf(). I used the default settings for the function.
>>
>> Just before I run the model, memory.size() for my session is ~4500
>> (MB). memory.limit() is ~25500. When I start the model, R immediately
>> becomes non-responsive. This is in a Windows environment and in Task
>> Manager, the instance of R is, and has been, using ~13% of CPU aand
>> ~4997 MB of RAM. It's been ~24 hours now in that state and I don't
>> have any idea of how long this should take. If I run the same model
>> in the same setting with the base glm(), the model runs in about 60
>> seconds. Is there a way to know if the process is going to produce
>> something useful after all this time or if it's hanging on some kind
>> of problem?
>>
>>
>>    [1]:
>> https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression#68917
>>
>>    [2]:
>>
>>
>>
>

--
Men occasionally stumble
over the truth, but most of them
pick themselves up and hurry off