[R] ROCR predictions
Frank Harrell
f.harrell at vanderbilt.edu
Thu Aug 19 17:45:23 CEST 2010
At the heart of this you have a problem in incomplete conditioning.
You are computing things like Prob(X > x) when you know X=x. Working
with a statistician who is well versed in probability models will
undoubtedly help.
Frank
Frank E Harrell Jr Professor and Chairman School of Medicine
Department of Biostatistics Vanderbilt University
On Thu, 19 Aug 2010, Assa Yeroslaviz wrote:
> Hello everybody,
>
> yes I'm sorry. I can see it is not so easy to understand.
> I'l try to explain a bit more. The experiment was used to compare two
> (protein domain) data bases and find out whether or not the results founded
> in one are comparable to the second DB.
> the first column shows the list of the various inputs in the DB, the second
> lists the various domains for each gene. the p-value column calculates the
> probability that the found in column four (Expected) to be found by chance.
> in column five the expected values was listed.
> The calculation of the TP,TN,FP,FN was made many times, each time with a
> different p-value (from p=1,...,p=10E-12) as a threshold to calculate the
> various values of TP,TN, etc.
>
> The goal of this calculation was to find the optimal p-value wit ha maximum
> of TP and a minimum of FP.
>
> To do so I thought about making the column of p-values my predictions and
> the values in the column Is.Expected (TRUE,FALSE) to my labels.
> This how I calculated my first ROC curve:
>> pValue <- read.delim(file = "p=1.txt", as.is= TRUE)
>> desc1 <- pValue[["p.value"]]
>> label1 <- pValue[["Is.Expected"]] # after changing the values of TRUE = 0,
> FALSE = 1
>
>> pred <- prediction(desc1, label1)
>> perf <- performance(pred, "tpr", "fpr")
>> plot(perf, colorsize = TRUE)
>
> my question are as follow:
> 1. Am I right in my way of thinkning, that the p-values here are
> predictions?
> I know you said I need to decided it for myself, but I'm not sure. If they
> are, than I will have the same predictions for each and every calculation of
> ROCR. Will it make any difference at the prediction?
> 2. how can i calculate the other p-values thresholds? Do I need to do each
> separately, or is there a way of combining them?
>
> I hope you can still help we with some hints or further advieces.
>
> Thanks
>
> Assa
>
> On Wed, Aug 18, 2010 at 07:55, Claudia Beleites <cbeleites at units.it> wrote:
>
>> Dear Assa,
>>
>> you need to call prediction with continuous predictions and a _binary_ true
>> class label.
>>
>> You are the only one who can tell whether the p-values are actually
>> predictions and what the class labels are. For the list readers p is just
>> the name of whatever variable, and you didn't even vaguely say what you try
>> to classify, nor did you offer any explanation of what the columns are.
>>
>> The only information we get from your table is that p-value has small and
>> continuous values. From what I see the p-values could also be fitting errors
>> of the predictions (e.g. expressed as a probability that the similarity to
>> the predicted class is random).
>>
>> Claudia
>>
>> Assa Yeroslaviz wrote:
>>
>>> Dear Claudia,
>>>
>>> thank you for your fast answer.
>>> I add again the table of the data as an example.
>>>
>>> Protein ID Pfam Domain p-value Expected Is
>>> Expected True Postive False Negative False Positive True Negative
>>> NP_000011.2 APH 1.15E-05 APH TRUE 1 0 0
>>> 0
>>> NP_000011.2 MutS_V 0.0173 APH FALSE 0 0 1 0
>>> NP_000062.1 CBS 9.40E-08 CBS TRUE 1 0 0
>>> 0
>>> NP_000066.1 APH 3.83E-06 APH TRUE 1 0 0
>>> 0
>>> NP_000066.1 CobU 0.009 APH FALSE 0 0 1 0
>>> NP_000066.1 FeoA 0.3975 APH FALSE 0 0 1 0
>>> NP_000066.1 Phage_integr_N 0.0219 APH FALSE 0 0 1
>>> 0
>>> NP_000161.2 Beta_elim_lyase 6.25E-12 Beta_elim_lyase
>>> TRUE 1 0 0 0
>>> NP_000161.2 Glyco_hydro_6 0.002 Beta_elim_lyase FALSE 0
>>> 0 1 0
>>> NP_000161.2 SurE 0.0059 Beta_elim_lyase FALSE 0 0
>>> 1 0
>>> NP_000161.2 SapB_2 0.0547 Beta_elim_lyase FALSE 0 0
>>> 1 0
>>> NP_000161.2 Runt 0.1034 Beta_elim_lyase FALSE 0 0
>>> 1 0
>>> NP_000204.3 EGF 0.004666118 EGF TRUE 1 0 0
>>> 0
>>> NP_000229.1 PAS 3.13E-06 PAS TRUE 1 0 0
>>> 0
>>> NP_000229.1 zf-CCCH 0.2067 PAS FALSE 0 1 1
>>> 0
>>> NP_000229.1 E_raikovi_mat 0.0206 PAS FALSE 0 0 0
>>> 0
>>> NP_000388.2 NAD_binding_1 8.21E-24 NAD_binding_1 TRUE 1
>>> 0 0 0
>>> NP_000388.2 ABM 1.40E-08 NAD_binding_1 FALSE 0 0
>>> 1 0
>>> NP_000483.3 MMR_HSR1 1.98E-05 MMR_HSR1 TRUE 1
>>> 0 0 0
>>> NP_000483.3 DEAD 2.30E-05 MMR_HSR1 FALSE 0 0
>>> 1 0
>>> NP_000483.3 APS_kinase 1.80E-09 MMR_HSR1 FALSE 0
>>> 0 1 0
>>> NP_000483.3 CbiA 0.0003 MMR_HSR1 FALSE 0 0 1
>>> 0
>>> NP_000483.3 CoaE 1.28E-07 MMR_HSR1 FALSE 0 0
>>> 1 0
>>> NP_000483.3 FMN_red 4.61E-08 MMR_HSR1 FALSE 0
>>> 0 1 0
>>> NP_000483.3 Fn_bind 0.3855 MMR_HSR1 FALSE 0 0
>>> 1 0
>>> NP_000483.3 Invas_SpaK 0.2431 MMR_HSR1 FALSE 0 0
>>> 1 0
>>> NP_000483.3 PEP-utilizers 0.127 MMR_HSR1 FALSE 0 0
>>> 1 0
>>> NP_000483.3 NIR_SIR_ferr 0.1661 MMR_HSR1 FALSE 0 0
>>> 1 0
>>> NP_000483.3 AAA 0.0031 MMR_HSR1 FALSE 0 0 1
>>> 0
>>> NP_000483.3 DUF448 0.0021 MMR_HSR1 FALSE 0 0 1
>>> 0
>>> NP_000483.3 CBF_beta 0.1201 MMR_HSR1 FALSE 0 0
>>> 1 0
>>> NP_000483.3 zf-C3HC4 0.0959 MMR_HSR1 FALSE 0 0
>>> 1 0
>>> NP_000560.5 ig 5.69E-39 ig TRUE 1 0 0
>>> 0
>>> NP_000704.1 Epimerase 4.40E-21 Epimerase TRUE 1
>>> 0 0 0
>>> NP_000704.1 Lipase_GDSL 6.63E-11 Epimerase FALSE 0
>>> 0 1 0
>>>
>>> ...
>>>
>>> this is a shorted list from one of the 10 lists I have for different
>>> p-values.
>>>
>>> As you can see I have separate p-value experiments and probably need to
>>> calculate for each of them a separate ROC. But I don't know how to calculate
>>> these characteristics for the p-values.
>>> How do I assign the predictions to each of the single p-value experiments?
>>>
>>> I would appreciate any help
>>>
>>> Thanks
>>> Assa
>>>
>>>
>>> On Tue, Aug 17, 2010 at 12:55, Claudia Beleites <cbeleites at units.it<mailto:
>>> cbeleites at units.it>> wrote:
>>>
>>> Dear Assa,
>>>
>>>
>>>
>>> I am having a problem building a ROC curve with my data using
>>> the ROCR
>>> package.
>>>
>>> I have 10 lists of proteins such as attached (proteinlist.xls).
>>> each of the
>>>
>>> your file didn't make it to the list.
>>>
>>>
>>>
>>> lists was calculated with a different p-value.
>>> The goal is to find the optimal p-value for the highest number
>>> of true
>>> positives as well as lowaest number of false positives.
>>>
>>>
>>> As far as I understood the explanations from the vignette of
>>> ROCR, my data
>>> of TP and FP are the labels of the prediction function. But I
>>> don't know how
>>> to assign the right predictions to these labels.
>>>
>>>
>>> I assume the p-values are different cutoffs that you use for
>>> "hardening" (= making yes/no predictions) from some soft (=
>>> continuous class membership) output of your classifier.
>>>
>>> Usually, ROCR calculates the curves as function of the
>>> cutoff/threshold itself from the continuos predictions. If you have
>>> these soft predictions, let ROCR do the calculation for you.
>>>
>>> If you don't have them, ROCR can calculate your characteristics
>>> (sens, spec, precision, recall, whatever) for each of the p-values.
>>> While you could combine the results "by hand" into a
>>> ROCR-performance object and let ROCR do the plotting, it is then
>>> probably easier if you plot directly yourself.
>>>
>>> Don't be shy to look into the prediction and performance objects, I
>>> find them pretty obvious. Maybe start with the objects produced by
>>> the examples.
>>>
>>> Also, note ROCR works with binary validation data only. If your data
>>> has more than one class, you need to make two-class-problems first
>>> (e.g. protein xy ./. not protein xy).
>>>
>>>
>>>
>>> BTW, Is there a way of finding the optimum in the curve? I mean
>>> to find the
>>> exact value in the ROC curve (see sheet 2 in the excel file for
>>> the ROC
>>> curve).
>>>
>>>
>>> Someone asked for optimum on ROC a couple of months ago, RSiteSearch
>>> on the mailing list with ROC and optimal or optimum should get you
>>> answers.
>>>
>>>
>>>
>>> I would like to thank for any help in advance
>>>
>>> You're welcome.
>>>
>>> Claudia
>>>
>>> -- Claudia Beleites
>>> Dipartimento dei Materiali e delle Risorse Naturali
>>> Università degli Studi di Trieste
>>> Via Alfonso Valerio 6/a
>>> I-34127 Trieste
>>>
>>> phone: +39 0 40 5 58-37 68
>>> email: cbeleites at units.it <mailto:cbeleites at units.it>
>>>
>>>
>>>
>>
>> --
>> Claudia Beleites
>> Dipartimento dei Materiali e delle Risorse Naturali
>> Università degli Studi di Trieste
>> Via Alfonso Valerio 6/a
>> I-34127 Trieste
>>
>> phone: +39 0 40 5 58-37 68
>> email: cbeleites at units.it
>>
>
> [[alternative HTML version deleted]]
>
>
More information about the R-help
mailing list