[R] ROCR predictions

Frank Harrell f.harrell at vanderbilt.edu
Thu Aug 19 17:45:23 CEST 2010


At the heart of this you have a problem in incomplete conditioning. 
You are computing things like Prob(X > x) when you know X=x.  Working 
with a statistician who is well versed in probability models will 
undoubtedly help.

Frank

Frank E Harrell Jr   Professor and Chairman        School of Medicine
                      Department of Biostatistics   Vanderbilt University

On Thu, 19 Aug 2010, Assa Yeroslaviz wrote:

> Hello everybody,
>
> yes I'm sorry. I can see it is not so easy to understand.
> I'l try to explain a bit more. The experiment was used to compare two
> (protein domain) data bases and find out whether or not the results founded
> in one are comparable to the second DB.
> the first column shows the list of the various inputs in the DB, the second
> lists the various domains for each gene. the p-value column calculates the
> probability that the found in column four (Expected) to be found by chance.
> in column five the expected values was listed.
> The calculation of the TP,TN,FP,FN was made many times, each time with a
> different p-value (from p=1,...,p=10E-12) as a threshold to calculate the
> various values of TP,TN, etc.
>
> The goal of this calculation was to find the optimal p-value wit ha maximum
> of TP and a minimum of FP.
>
> To do so I thought about making the column of p-values my predictions and
> the values in the column Is.Expected (TRUE,FALSE) to my labels.
> This how I calculated my first ROC curve:
>> pValue <- read.delim(file = "p=1.txt", as.is= TRUE)
>> desc1 <- pValue[["p.value"]]
>> label1 <- pValue[["Is.Expected"]] # after changing the values of TRUE = 0,
> FALSE = 1
>
>> pred <- prediction(desc1, label1)
>> perf <- performance(pred, "tpr", "fpr")
>> plot(perf, colorsize = TRUE)
>
> my question are as follow:
> 1. Am I right in my way of thinkning, that the p-values here are
> predictions?
> I know you said I need to decided it for myself, but I'm not sure. If they
> are, than I will have the same predictions for each and every calculation of
> ROCR. Will it make any difference at the prediction?
> 2. how can i calculate the other p-values thresholds? Do I need to do each
> separately, or is there a way of combining them?
>
> I hope you can still help we with some hints or further advieces.
>
> Thanks
>
> Assa
>
> On Wed, Aug 18, 2010 at 07:55, Claudia Beleites <cbeleites at units.it> wrote:
>
>> Dear Assa,
>>
>> you need to call prediction with continuous predictions and a _binary_ true
>> class label.
>>
>> You are the only one who can tell whether the p-values are actually
>> predictions  and what the class labels are. For the list readers p is just
>> the name of whatever variable, and you didn't even vaguely say what you try
>> to classify, nor did you offer any explanation of what the columns are.
>>
>> The only information we get from your table is that p-value has small and
>> continuous values. From what I see the p-values could also be fitting errors
>> of the predictions (e.g. expressed as a probability that the similarity to
>> the predicted class is random).
>>
>> Claudia
>>
>> Assa Yeroslaviz wrote:
>>
>>> Dear Claudia,
>>>
>>> thank you for your fast answer.
>>> I add again the table of the data as an example.
>>>
>>> Protein ID      Pfam Domain     p-value         Expected        Is
>>> Expected     True Postive False Negative     False Positive  True Negative
>>> NP_000011.2     APH     1.15E-05        APH     TRUE    1       0       0
>>>       0
>>> NP_000011.2     MutS_V  0.0173  APH     FALSE   0       0       1       0
>>> NP_000062.1     CBS     9.40E-08        CBS     TRUE    1       0       0
>>>       0
>>> NP_000066.1     APH     3.83E-06        APH     TRUE    1       0       0
>>>       0
>>> NP_000066.1     CobU    0.009   APH     FALSE   0       0       1       0
>>> NP_000066.1     FeoA    0.3975  APH     FALSE   0       0       1       0
>>> NP_000066.1     Phage_integr_N  0.0219  APH     FALSE   0       0       1
>>>       0
>>> NP_000161.2     Beta_elim_lyase         6.25E-12        Beta_elim_lyase
>>>       TRUE    1       0       0       0
>>> NP_000161.2     Glyco_hydro_6   0.002   Beta_elim_lyase         FALSE   0
>>>       0       1       0
>>> NP_000161.2     SurE    0.0059  Beta_elim_lyase         FALSE   0       0
>>>       1       0
>>> NP_000161.2     SapB_2  0.0547  Beta_elim_lyase         FALSE   0       0
>>>       1       0
>>> NP_000161.2     Runt    0.1034  Beta_elim_lyase         FALSE   0       0
>>>       1       0
>>> NP_000204.3     EGF     0.004666118     EGF     TRUE    1       0       0
>>>       0
>>> NP_000229.1     PAS     3.13E-06        PAS     TRUE    1       0       0
>>>       0
>>> NP_000229.1     zf-CCCH         0.2067  PAS     FALSE   0       1       1
>>>       0
>>> NP_000229.1     E_raikovi_mat   0.0206  PAS     FALSE   0       0       0
>>>       0
>>> NP_000388.2     NAD_binding_1   8.21E-24        NAD_binding_1   TRUE    1
>>>       0       0       0
>>> NP_000388.2     ABM     1.40E-08        NAD_binding_1   FALSE   0       0
>>>       1       0
>>> NP_000483.3     MMR_HSR1        1.98E-05        MMR_HSR1        TRUE    1
>>>       0       0       0
>>> NP_000483.3     DEAD    2.30E-05        MMR_HSR1        FALSE   0       0
>>>       1       0
>>> NP_000483.3     APS_kinase      1.80E-09        MMR_HSR1        FALSE   0
>>>       0       1       0
>>> NP_000483.3     CbiA    0.0003  MMR_HSR1        FALSE   0       0       1
>>>       0
>>> NP_000483.3     CoaE    1.28E-07        MMR_HSR1        FALSE   0       0
>>>       1       0
>>> NP_000483.3     FMN_red         4.61E-08        MMR_HSR1        FALSE   0
>>>       0       1       0
>>> NP_000483.3     Fn_bind         0.3855  MMR_HSR1        FALSE   0       0
>>>       1       0
>>> NP_000483.3     Invas_SpaK      0.2431  MMR_HSR1        FALSE   0       0
>>>       1       0
>>> NP_000483.3     PEP-utilizers   0.127   MMR_HSR1        FALSE   0       0
>>>       1       0
>>> NP_000483.3     NIR_SIR_ferr    0.1661  MMR_HSR1        FALSE   0       0
>>>       1       0
>>> NP_000483.3     AAA     0.0031  MMR_HSR1        FALSE   0       0       1
>>>       0
>>> NP_000483.3     DUF448  0.0021  MMR_HSR1        FALSE   0       0       1
>>>       0
>>> NP_000483.3     CBF_beta        0.1201  MMR_HSR1        FALSE   0       0
>>>       1       0
>>> NP_000483.3     zf-C3HC4        0.0959  MMR_HSR1        FALSE   0       0
>>>       1       0
>>> NP_000560.5     ig      5.69E-39        ig      TRUE    1       0       0
>>>       0
>>> NP_000704.1     Epimerase       4.40E-21        Epimerase       TRUE    1
>>>       0       0       0
>>> NP_000704.1     Lipase_GDSL     6.63E-11        Epimerase       FALSE   0
>>>       0       1       0
>>>
>>> ...
>>>
>>> this is a shorted list from one of the 10 lists I have for different
>>> p-values.
>>>
>>> As you can see I have separate p-value experiments and probably need to
>>> calculate for each of them a separate ROC. But I don't know how to calculate
>>> these characteristics for the p-values.
>>> How do I assign the predictions to each of the single p-value experiments?
>>>
>>> I would appreciate any help
>>>
>>> Thanks
>>> Assa
>>>
>>>
>>> On Tue, Aug 17, 2010 at 12:55, Claudia Beleites <cbeleites at units.it<mailto:
>>> cbeleites at units.it>> wrote:
>>>
>>>    Dear Assa,
>>>
>>>
>>>
>>>        I am having a problem building a ROC curve with my data using
>>>        the ROCR
>>>        package.
>>>
>>>        I have 10 lists of proteins such as attached (proteinlist.xls).
>>>        each of the
>>>
>>>    your file didn't make it to the list.
>>>
>>>
>>>
>>>        lists was calculated with a different p-value.
>>>        The goal is to find the optimal p-value for the highest number
>>>        of true
>>>        positives as well as lowaest number of false positives.
>>>
>>>
>>>        As far as I understood the explanations from the vignette of
>>>        ROCR, my data
>>>        of TP and FP are the labels of the prediction function. But I
>>>        don't know how
>>>        to assign the right predictions to these labels.
>>>
>>>
>>>    I assume the p-values are different cutoffs that you use for
>>>    "hardening" (= making yes/no predictions) from some soft (=
>>>    continuous class membership) output of your classifier.
>>>
>>>    Usually, ROCR calculates the curves as function of the
>>>    cutoff/threshold itself from the continuos predictions. If you have
>>>    these soft predictions, let ROCR do the calculation for you.
>>>
>>>    If you don't have them, ROCR can calculate your characteristics
>>>    (sens, spec, precision, recall, whatever) for each of the p-values.
>>>    While you could combine the results "by hand" into a
>>>    ROCR-performance object and let ROCR do the plotting, it is then
>>>    probably easier if you plot directly yourself.
>>>
>>>    Don't be shy to look into the prediction and performance objects, I
>>>    find them pretty obvious. Maybe start with the objects produced by
>>>    the examples.
>>>
>>>    Also, note ROCR works with binary validation data only. If your data
>>>    has more than one class, you need to make two-class-problems first
>>>    (e.g. protein xy ./. not protein xy).
>>>
>>>
>>>
>>>        BTW, Is there a way of finding the optimum in the curve? I mean
>>>        to find the
>>>        exact value in the ROC curve (see sheet 2 in the excel file for
>>>        the ROC
>>>        curve).
>>>
>>>
>>>    Someone asked for optimum on ROC a couple of months ago, RSiteSearch
>>>    on the mailing list with ROC and optimal or optimum should get you
>>>    answers.
>>>
>>>
>>>
>>>        I would like to thank for any help in advance
>>>
>>>    You're welcome.
>>>
>>>    Claudia
>>>
>>>    --     Claudia Beleites
>>>    Dipartimento dei Materiali e delle Risorse Naturali
>>>    Università degli Studi di Trieste
>>>    Via Alfonso Valerio 6/a
>>>    I-34127 Trieste
>>>
>>>    phone: +39 0 40 5 58-37 68
>>>    email: cbeleites at units.it <mailto:cbeleites at units.it>
>>>
>>>
>>>
>>
>> --
>> Claudia Beleites
>> Dipartimento dei Materiali e delle Risorse Naturali
>> Università degli Studi di Trieste
>> Via Alfonso Valerio 6/a
>> I-34127 Trieste
>>
>> phone: +39 0 40 5 58-37 68
>> email: cbeleites at units.it
>>
>
>        [[alternative HTML version deleted]]
>
>


More information about the R-help mailing list