[R] Prediction from a rank deficient fit may be misleading

Fri Mar 11 01:07:09 CET 2016

> On Mar 10, 2016, at 2:21 PM, Michael Artz <michaeleartz at gmail.com> wrote:
> 
> Here is the results of the logistic regression model.  Is it because of the
> NA values?

It's unclear. The InternetServiceNo (an other "No")-values could well be the cause. Many times questionnaires get encoded in a manner that causes complete collinearity and the glm function then "aliases" those levels and displays an NA result for the coefficients. I don't remember the predict function then emitting that warning, but seems possible that including column names for aliased factors would be a well-mannered behavior for software. At any rate I don't see the absurd sorts of coefficients (such as 10 or 20) that I associate with severe numerical pathology.

> 
> Call:
> glm(formula = TARGET_A ~ Contract + Dependents + DeviceProtection +
>    gender + InternetService + MonthlyCharges + MultipleLines +
>    OnlineBackup + OnlineSecurity + PaperlessBilling + Partner +
>    PaymentMethod + PhoneService + SeniorCitizen + StreamingMovies +
>    StreamingTV + TechSupport + tenure + TotalCharges, family =
> binomial(link = "logit"),
>    data = churn_training)
> 
> Deviance Residuals:
>    Min       1Q   Median       3Q      Max
> -1.8943  -0.6867  -0.2863   0.7378   3.4259
> 
> Coefficients: (7 not defined because of singularities)
>                                       Estimate Std. Error z value Pr(>|z|)
> 
> (Intercept)                           1.0664928  1.7195494   0.620   0.5351
> 
> ContractOne year                     -0.6874005  0.1314227  -5.230 1.69e-07
> ***
> ContractTwo year                     -1.2775385  0.2101193  -6.080 1.20e-09
> ***
> DependentsYes                        -0.1485301  0.1095348  -1.356   0.1751
> 
> DeviceProtectionNo internet service  -1.5547306  0.9661837  -1.609   0.1076
> 
> DeviceProtectionYes                   0.0459115  0.2114253   0.217   0.8281
> 
> genderMale                           -0.0350970  0.0776896  -0.452   0.6514
> 
> InternetServiceFiber optic            1.4800374  0.9545398   1.551   0.1210
> 
> InternetServiceNo                            NA         NA      NA       NA
> 
> MonthlyCharges                       -0.0324614  0.0379646  -0.855   0.3925
> 
> MultipleLinesNo phone service         0.0808745  0.7736359   0.105   0.9167
> 
> MultipleLinesYes                      0.3990450  0.2131343   1.872   0.0612
> .
> OnlineBackupNo internet service              NA         NA      NA       NA
> 
> OnlineBackupYes                      -0.0328892  0.2081145  -0.158   0.8744
> 
> OnlineSecurityNo internet service            NA         NA      NA       NA
> 
> OnlineSecurityYes                    -0.2760602  0.2132917  -1.294   0.1956
> 
> PaperlessBillingYes                   0.3509944  0.0890884   3.940 8.15e-05
> ***
> PartnerYes                            0.0306815  0.0940650   0.326   0.7443
> 
> PaymentMethodCredit card (automatic) -0.0710923  0.1377252  -0.516   0.6057
> 
> PaymentMethodElectronic check         0.3074078  0.1137939   2.701   0.0069
> **
> PaymentMethodMailed check            -0.0201076  0.1377539  -0.146   0.8839
> 
> PhoneServiceYes                              NA         NA      NA       NA
> 
> SeniorCitizen                         0.1856454  0.1023527   1.814   0.0697
> .
> StreamingMoviesNo internet service           NA         NA      NA       NA
> 
> StreamingMoviesYes                    0.5260087  0.3899615   1.349   0.1774
> 
> StreamingTVNo internet service               NA         NA      NA       NA
> 
> StreamingTVYes                        0.4781321  0.3905777   1.224   0.2209
> 
> TechSupportNo internet service               NA         NA      NA       NA
> 
> TechSupportYes                       -0.2511197  0.2181612  -1.151   0.2497
> 
> tenure                               -0.0702813  0.0077113  -9.114  < 2e-16
> ***
> TotalCharges                          0.0004276  0.0000874   4.892 9.97e-07
> ***
> 
> On Thu, Mar 10, 2016 at 4:05 PM, David Winsemius <dwinsemius at comcast.net>
> wrote:
> 
>> 
>>> On Mar 10, 2016, at 8:08 AM, Michael Artz <michaeleartz at gmail.com>
>> wrote:
>>> 
>>> HI all,
>>> I have the following error -
>>>> resultVector <- predict(logitregressmodel, dataset1, type='response')
>>> Warning message:
>>> In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==
>> :
>>> prediction from a rank-deficient fit may be misleading
>> 
>> It wasn't an R error. It was an R warning. Was the `summary` output on
>> logitregressmodel informative? Does the resultVector look sensible given
>> its inputs?
>> 
>> 
>>> I have seen on internet that there may be some collinearity in the data
>> and
>>> this is causing that.  How can I be sure?
>> 
>> Do some diagnostics. After looking carefully at the output of
>> summary(logitregressmodel)  and perhaps summary(dataset1) if it was the
>> original input to the modeling functions, and then you could move on to
>> looking at cross-correlations on things you think are continuous and
>> crosstabs on factor variables and the condition number on the full data
>> matrix.
>> 
>> Lots of stuff turns up on search for "detecting collinearity condition
>> number in r"
>> 
>>> 
>>> Thanks
>>> 
>>>      [[alternative HTML version deleted]]
>>> 

David Winsemius
Alameda, CA, USA