[R-sig-eco] Pred function - miss understanding?

Mon Aug 30 09:02:09 CEST 2010

> I have generated a ROC plot and a calibration curve (attached)
>
> Also calculated the AUC 0.7201762
>
> However if i am honest i am unsure where to go from here?
>
> 1. How does this tell me how effective the model is at predicting the 
> response?

An AUC of 0.72 means that your model predicts higher probability for 
unthreatened species than for threatened species in 28% of all possible 
species pairs in the training data. This performance is likely to be poorer 
if evaluated in other sample, because if you evaluate the performance of the 
model on the training data the estimates of model performance will be 
considerably optimistic. Try to evaluate the model with an independent 
sample or, if not possible, try internal validation with bootstrap using the 
validate function from the Design package. Bootstrap based validation 
performs better than other approaches (see Steyerberg et al., 2001. Internal 
validation of predictive models: Efficiency of some procedures for logistic 
regression analysis, Journal of Clinical Epidemiology, 54 (8): 774-781)

> 2. How can i use this information to predict a response from my test data 
> set i.e if i only have the factors and i want to know if a species is 
> threatened or not?

If you want to transform predictions from probabilities to binary data you 
have to choose a probability threshold. A simple approach is to use the 
prevalence in the training data as threshold. There are other approaches, 
but I haven't used them (see Liu et al., 2005. Selecting thresholds of 
occurrence in the prediction of species distributions. Ecography 28, 385-393 
or Jimenez-Valverde & Lobo, 2007. Threshold criteria for conversion of 
probability of species presence to either-or presence-absence, Acta 
Oecologica, 31 (3): 361-369).

Hope this helps,

Aitor

--------------------------------------------------
From: "Chris Mcowen" <chrismcowen at gmail.com>
Sent: Friday, August 27, 2010 3:47 PM
To: "Aitor GastónGonzález" <aitor.gaston at upm.es>
Subject: Re: [R-sig-eco] Pred function - miss understanding?

> Aitor,
>
> Thanks very much for this, i am very grateful.
>
> I have generated a ROC plot and a calibration curve (attached)
>
> Also calculated the AUC 0.7201762
>
> However if i am honest i am unsure where to go from here?
>
> 1. How does this tell me how effective the model is at predicting the 
> response?
>
> 2. How can i use this information to predict a response from my test data 
> set i.e if i only have the factors and i want to know if a species is 
> threatened or not?
>
> Thanks very much
>
> Chris
>
>

>
>
> On 27 Aug 2010, at 00:07, Aitor GastónGonzález wrote:
>
> Chris,
>
> The predicted probabilities of a binomial GLM (i.e., logistic regression)
> should not be interpreted as an absolute value, they largely depend on
> the prevalence in the training sample (the proportion of threatened
> species in your case).
>
> I understand that you are interested in evaluating the predictive
> performance of the model. There are many statistics to evaluate the
> predictive performance of a logistic regression model. If you want to
> use the predictions to rank species according to extinction risk you may
> focus on discrimination, e.g. AUC (area under ROC curve). AUC may be
> interpreted as the probability that the prediction for a threatened
> species chosen at random is larger than the prediction for a non
> threatened species chosen at random. If you are concerned with the
> reliability of the predictions (i.e., level of agreement between
> predicted and actual probabilities) you may evaluate calibration (e.g.
> calibration slope). If your model is well calibrated, you should find
> approximately 50% of threatened species among those that yielded a
> predicted probability of 0.5, 30% among those that yielded 0.3 and so on.
>
> You can try val.prob function of the Design package to calculate
> discrimination and calibration measures. You will find useful advice on
> predictive performance evaluation of logistic regression models in any
> of these books:
>
> Harrell, F.E., 2001. Regression Modelling Strategies with Applications to
> Linear Models Logistic Regression and Survival Analysis. Springer, New
> York, NY, USA, p. 568
>
> Steyerberg, E.W., 2009. Clinical Prediction Models: A Practical Approach
> to Development Validation and Updating. Springer, New York, NY, USA, p.
> 497.
>
> Just in case your sample is not very large, you may consider a simpler
> model. If the factors used as predictors have several levels and the
> training sample size is limited, your model may be overfitted. 10 events
> (number of threatened species, or unthreatened if less frequent) per
> estimated parameter are recommendable (note that each factor with k
> levels will "spend" k-1 parameters).
>
> Hope this helps,
>
> Aitor
>
>
>> Dear List,
>>
>> I am trying to predict the extinction risk of a species based on its life 
>> history. I will detail my method below and would welcome comments as to 
>> why the results are not as i expected.
>>
>>
>> First i fit my model -
>>
>>> model1 <- glm(THREAT~ HAB*BS + FR + WO + SEA + PD, data=traits, 
>>> family="binomial")
>>
>> Where THREAT is TRUE (1) / FALSE (0).
>>
>> Where BS, FR etc are factors with multiple levels.
>>
>>
>> I then predicted the probability of a species being threatened or not 
>> using
>>
>>> print(predict(model1, type = "response"))
>>
>> example output:-
>>
>>      1          2          3          4          5          6          7
>> 0.44659200 0.65221495 0.71357243 0.71357243 0.71357243 0.71357243 
>> 0.71357243
>>      8          9         10         11         12         13         14
>> 0.71357243 0.65221495 0.65221495 0.65221495 0.65221495 0.65221495 
>> 0.65221495
>>
>> I interpret this as species 1 has a 45% chance (probability) of being 
>> threatened etc....
>>
>> I then wanted to see how this relates to the "true" threat level so i 
>> looked at species 1 and it was classed as threatened, which disagrees 
>> with the predict results, although marginally. In fact most of the 
>> predict results do not agree with the "real" threat level, some species 
>> have a probability of 0.17 which to me says they are non threatened but 
>> in "real" they are classed as threatened.
>>
>> This is important as if these are not matching, at least most of the 
>> time, then how can i confidently predict the response of a species when i 
>> don't know its "real" response?
>>
>> I hope this makes sense.
>>
>> Chris
>>
>>
>>
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-ecology mailing list
>> R-sig-ecology at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>
>