[R] convenient way to calculate specificity, sensitivity and accuracy from raw data
drflxms
drflxms at googlemail.com
Tue Sep 2 11:00:09 CEST 2008
Hello Dimitris, Hello Gabor,
absolutely incredible! I can't tell you how happy I am about your code
which worked out of the box and saved me from days of boring and stupid
Excel-handwork. Thank you a thousand times!
Just for other newbies, that might be faced with a similar problem, I'd
like to make a few closing remarks to the way I calculate now:
The read.table command is not necessary in my case, because there are
already ready-to-use data.frames I created with the "reshape" package.
So I started with the line:
pairs<-data.frame(pred=factor(unlist(input.frame[2:21])),ref=factor(input.frame[,22]))
# explanation for other newbies: creates a data.frame named pairs, with
two columns. In the column pred(iction) you have the values from the
columns 2-21 of the original input-data.frame "input.frame" which
corresponds to all the observations the medical doctors made in my
specific case. In the column ref(erence) you have the observations from
gold-standard, which are assumed to be the truth.
pred<-pairs$pred
#saves column "pred" of "pairs" data.frame as vector named "pred"
lab <- pairs$ref
#saves column "ref" of "pairs" data.frame as vector named "lab"
library(caret)
#loads library "caret"
confusionMatrix(pred, ref, positive=1)
#creates a confusion matrix with sensitivity, specificity, accuracy,
kappa and much more; please see documentation (?confusionMatrix) for
details.
Example output for the data.frame I sent with my original question:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 656 122
1 24 38
Accuracy : 0.8262
95% CI : (0.7988, 0.8512)
No Information Rate : 0.8095
P-Value [Acc > NIR] : 0.117
Kappa : 0.264
Sensitivity : 0.2375
Specificity : 0.9647
Pos Pred Value : 0.6129
Neg Pred Value : 0.8432
This works not only for input data that consists of 2 result-classes
like true or false, but for data with multiple categories/result-classes
as well! See example output:
Confusion Matrix and Statistics
Reference
Prediction 0 1 10 11 100 101 110
0 349 31 60 40 66 1 15
1 25 80 1 22 3 17 3
10 0 1 24 8 3 1 10
11 1 6 5 3 1 0 2
100 3 1 6 7 24 0 5
101 0 0 0 0 0 1 0
110 2 1 4 0 3 0 5
Overall Statistics
Accuracy : 0.5786
95% CI : (0.5444, 0.6122)
No Information Rate : 0.4524
P-Value [Acc > NIR] : 1.506e-13
Kappa : 0.3571
Statistics by Class:
Sensitivity Specificity Pos Pred Value Neg Pred Value
Class: 0 0.9184 0.5370 0.6210 0.8885
Class: 1 0.6667 0.9014 0.5298 0.9419
Class: 10 0.2400 0.9689 0.5106 0.9042
Class: 11 0.0375 0.9803 0.1667 0.9063
Class: 100 0.2400 0.9703 0.5217 0.9043
Class: 101 0.0500 1.0000 1.0000 0.9774
Class: 110 0.1250 0.9875 0.3333 0.9576
This is much more than I ever had expected! (Thank you to Max Kuhn, the
creator of "caret"-package!)
The code from Dimitris (see below) perfectly re-samples the way I did
the calculation in Excel by hand. Wow! This is very instructive for me.
I never had thought about real programming, cause I always believed this
is much too high for me. But as I now try to understand the code, that
solves "my problem", I'll re-think this. It is still "magic" to me, but
magic, one can learn ;-). I definitely like to become a so(u)rcerer's
apprentice :-).
So again thank you for your quick and efficient help! Great software,
great community. I am really happy, that I decided "against all odds"
and advice from colleagues not to use SPSS or SAS, but to learn R. I
never had thought, that I might succeed in evaluating the results of our
small study in just a few weeks by my own using R.
Cheers,
Felix.
Dimitris Rizopoulos wrote:
> try something like this:
>
>
> dat <- read.table(textConnection("video 1 2 3 4 5 6 7 8 9 10 11 12 13
> 14 15 16 17 18 19 20 21
> 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
> 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
> 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
> 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 8 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
> 9 9 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 1 0
> 10 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 11 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 12 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 13 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 14 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 15 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 16 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 17 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 18 18 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
> 19 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 20 20 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 21 21 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
> 22 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 23 23 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 0 1 0 0 0 0
> 24 24 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 1
> 25 25 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0
> 26 26 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
> 27 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 28 28 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 29 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 30 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 31 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 32 32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 33 33 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 34 34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 35 35 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 36 36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 37 37 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1
> 38 38 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 39 39 0 1 0 0 1 0 0 1 0 1 1 0 1 1 0 0 1 1 0 1 1
> 40 40 1 1 1 1 1 0 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1
> 41 41 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
> 42 42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"),
> header = TRUE)
> closeAllConnections()
>
> goldstand <- dat$X21
> prev <- sum(goldstand)
> cprev <- sum(!goldstand)
> n <- prev + cprev
> lapply(dat[-1], function(x){
> tab <- table(x, goldstand)
> cS <- colSums(tab)
> if(nrow(tab) > 1 && ncol(tab) > 1) {
> out <- c(sp = tab[1,1], sn = tab[2,2]) / cS
> c(out, ac = (out[1] * cprev + out[2] * prev) / n)
> }
> })
>
>
> I hope it helps.
>
> Best,
> Dimitris
>
> Quoting drflxms <drflxms at googlemail.com>:
>
>> Dear R-colleagues,
>>
>> this is a question from a R-newbie medical doctor:
>>
>> I am evaluating data on inter-observer-reliability in endoscopy. 20
>> medical doctors judged 42 videos filling out a multiple choice survey
>> for each video. The overall-data is organized in a classical way:
>> observations (items from the multiple choice survey) as columns, each
>> case (identified by the two columns "number of medical doctor" and
>> "number of video") in a row. In addition there is a medical doctor
>> number 21 who is assumed to be a gold-standard.
>>
>> As measure of inter-observer-agreement I calculated kappa according to
>> Fleiss and simple agreement in percent using the routines
>> "kappam.fleiss" and "agree" from the irr-package. Everything worked fine
>> so far.
>>
>> Now I'd like to calculate specificity, sensitivity and accuracy for each
>> item (compared to the gold-standard), as these are well-known and easy
>> to understand quantities for medical doctors.
>>
>> Unfortunately I haven't found a feasible way to do this in R so far. All
>> solutions I found, describe calculation of specificity, sensitivity and
>> accuracy from a contingency-table / confusion-matrix only. For me it is
>> very difficult to create such contingency-tables / confusion-matrices
>> from the raw data I have.
>>
>> So I started to do it in Excel by hand - a lot of work! When I'll keep
>> on doing this, I'll miss the deadline. So maybe someone can help me out:
>>
>> It would be very convenient, if there is way to calculate specificity,
>> sensitivity and accuracy from the very same data.frames I created for
>> the calculation of kappa and agreement. In these data.frames, which were
>> generated from the overall-data-table described above using the
>> "reshape" package, we have the judging medical doctor in the columns and
>> the videos in the rows. In the cells there are the coded answer-options
>> from the multiple choice survey. Please see an simple example with
>> answer-options 0/1 (copied from R console) below:
>>
>> video 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
>> 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
>> 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
>> 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
>> 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 8 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
>> 9 9 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 1 0
>> 10 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 11 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 12 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 13 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 14 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 15 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 16 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 17 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 18 18 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
>> 19 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 20 20 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 21 21 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
>> 22 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 23 23 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 0 1 0 0 0 0
>> 24 24 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 1
>> 25 25 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0
>> 26 26 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
>> 27 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 28 28 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 29 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 30 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 31 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 32 32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 33 33 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 34 34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 35 35 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 36 36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 37 37 0 1 1 0 1 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1
>> 38 38 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> 39 39 0 1 0 0 1 0 0 1 0 1 1 0 1 1 0 0 1 1 0 1 1
>> 40 40 1 1 1 1 1 0 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1
>> 41 41 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
>> 42 42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>>
>> What I did in Excel is: Creating the very same tables using
>> pivot-charts. Comparing columns 1-20 to column 21 (gold-standard),
>> summing up the count of values that are identical to 21. I repeated this
>> for each answer-option. From the results, one can easily calculate
>> specificity, sensitivity and accuracy.
>>
>> How to do this, or something similar leading to the same results in R?
>> I'd appreciate any kind of help very much!
>>
>> Greetings from Munich,
>> Felix
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list