[R] Prediction from a rank deficient fit may be misleading
David Winsemius
dwinsemius at comcast.net
Fri Mar 11 01:07:09 CET 2016
> On Mar 10, 2016, at 2:21 PM, Michael Artz <michaeleartz at gmail.com> wrote:
>
> Here is the results of the logistic regression model. Is it because of the
> NA values?
It's unclear. The InternetServiceNo (an other "No")-values could well be the cause. Many times questionnaires get encoded in a manner that causes complete collinearity and the glm function then "aliases" those levels and displays an NA result for the coefficients. I don't remember the predict function then emitting that warning, but seems possible that including column names for aliased factors would be a well-mannered behavior for software. At any rate I don't see the absurd sorts of coefficients (such as 10 or 20) that I associate with severe numerical pathology.
>
> Call:
> glm(formula = TARGET_A ~ Contract + Dependents + DeviceProtection +
> gender + InternetService + MonthlyCharges + MultipleLines +
> OnlineBackup + OnlineSecurity + PaperlessBilling + Partner +
> PaymentMethod + PhoneService + SeniorCitizen + StreamingMovies +
> StreamingTV + TechSupport + tenure + TotalCharges, family =
> binomial(link = "logit"),
> data = churn_training)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -1.8943 -0.6867 -0.2863 0.7378 3.4259
>
> Coefficients: (7 not defined because of singularities)
> Estimate Std. Error z value Pr(>|z|)
>
> (Intercept) 1.0664928 1.7195494 0.620 0.5351
>
> ContractOne year -0.6874005 0.1314227 -5.230 1.69e-07
> ***
> ContractTwo year -1.2775385 0.2101193 -6.080 1.20e-09
> ***
> DependentsYes -0.1485301 0.1095348 -1.356 0.1751
>
> DeviceProtectionNo internet service -1.5547306 0.9661837 -1.609 0.1076
>
> DeviceProtectionYes 0.0459115 0.2114253 0.217 0.8281
>
> genderMale -0.0350970 0.0776896 -0.452 0.6514
>
> InternetServiceFiber optic 1.4800374 0.9545398 1.551 0.1210
>
> InternetServiceNo NA NA NA NA
>
> MonthlyCharges -0.0324614 0.0379646 -0.855 0.3925
>
> MultipleLinesNo phone service 0.0808745 0.7736359 0.105 0.9167
>
> MultipleLinesYes 0.3990450 0.2131343 1.872 0.0612
> .
> OnlineBackupNo internet service NA NA NA NA
>
> OnlineBackupYes -0.0328892 0.2081145 -0.158 0.8744
>
> OnlineSecurityNo internet service NA NA NA NA
>
> OnlineSecurityYes -0.2760602 0.2132917 -1.294 0.1956
>
> PaperlessBillingYes 0.3509944 0.0890884 3.940 8.15e-05
> ***
> PartnerYes 0.0306815 0.0940650 0.326 0.7443
>
> PaymentMethodCredit card (automatic) -0.0710923 0.1377252 -0.516 0.6057
>
> PaymentMethodElectronic check 0.3074078 0.1137939 2.701 0.0069
> **
> PaymentMethodMailed check -0.0201076 0.1377539 -0.146 0.8839
>
> PhoneServiceYes NA NA NA NA
>
> SeniorCitizen 0.1856454 0.1023527 1.814 0.0697
> .
> StreamingMoviesNo internet service NA NA NA NA
>
> StreamingMoviesYes 0.5260087 0.3899615 1.349 0.1774
>
> StreamingTVNo internet service NA NA NA NA
>
> StreamingTVYes 0.4781321 0.3905777 1.224 0.2209
>
> TechSupportNo internet service NA NA NA NA
>
> TechSupportYes -0.2511197 0.2181612 -1.151 0.2497
>
> tenure -0.0702813 0.0077113 -9.114 < 2e-16
> ***
> TotalCharges 0.0004276 0.0000874 4.892 9.97e-07
> ***
>
> On Thu, Mar 10, 2016 at 4:05 PM, David Winsemius <dwinsemius at comcast.net>
> wrote:
>
>>
>>> On Mar 10, 2016, at 8:08 AM, Michael Artz <michaeleartz at gmail.com>
>> wrote:
>>>
>>> HI all,
>>> I have the following error -
>>>> resultVector <- predict(logitregressmodel, dataset1, type='response')
>>> Warning message:
>>> In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==
>> :
>>> prediction from a rank-deficient fit may be misleading
>>
>> It wasn't an R error. It was an R warning. Was the `summary` output on
>> logitregressmodel informative? Does the resultVector look sensible given
>> its inputs?
>>
>>
>>> I have seen on internet that there may be some collinearity in the data
>> and
>>> this is causing that. How can I be sure?
>>
>> Do some diagnostics. After looking carefully at the output of
>> summary(logitregressmodel) and perhaps summary(dataset1) if it was the
>> original input to the modeling functions, and then you could move on to
>> looking at cross-correlations on things you think are continuous and
>> crosstabs on factor variables and the condition number on the full data
>> matrix.
>>
>> Lots of stuff turns up on search for "detecting collinearity condition
>> number in r"
>>
>>>
>>> Thanks
>>>
>>> [[alternative HTML version deleted]]
>>>
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list