[R] merge a list of data frames
Sam Steingold
sds at gnu.org
Thu Sep 6 19:53:03 CEST 2012
> * David Winsemius <qjvafrzvhf at pbzpnfg.arg> [2012-09-06 10:30:16 -0700]:
>
>> these are the results of applying a model to the test data.
>> the first column is the ID
>
> In which case you should be using the 'by' argument to `merge`
I already do! see my initial message!
>> 3. sort by the sum/mean of the V3 columns and evaluate the combined
>> model using the lift quality metric
>> (http://dl.acm.org/citation.cfm?id=380995.381018)
>
> That's going to require more background (or more money since they want $15.00 for a pdf.
:-)
that I have already implemented, works just fine:
proficiency <- function (actual, prediction) {
proficiency1(ea = entropy(table(actual)),
ep = entropy(table(prediction)),
ej = entropy(table(actual,prediction)))
}
proficiency1 <- function (ea, ep, ej) {
mi <- ea + ep - ej
list(joint = ej, actual = ea, prediction = ep, mutual = mi,
proficiency = mi / ea, dependency = mi / ej)
}
detector.statistics <- function (tp,fn,fp,tn) {
observationCount <- tp + fn + fp + tn
predictedPositive <- tp + fp
predictedNegative <- fn + tn
actualPositive <- tp + fn
actualNegative <- fp + tn
correct <- tp + tn
list(baseRate = actualPositive / observationCount,
precision = if (tp == 0) 0 else tp / predictedPositive,
specificity = if (tn == 0) 0 else tn / actualNegative,
recall = if (tp == 0) 0 else tp / actualPositive,
accuracy = correct / observationCount,
lift = (tp * observationCount) / (predictedPositive * actualPositive),
f1score = if (tp == 0) 0 else 2 * tp / (2 * tp + fp + fn),
proficiency = proficiency1(ej = entropy(c(tp,fn,fp,tn)),
ea = entropy(c(actualPositive,actualNegative)),
ep = entropy(c(predictedPositive,predictedNegative))))
}
## v should be vector of 0&1 sorted according to some model
## Gregory Piatetsky-Shapiro, Samuel Steingold
## "Measuring Lift Quality in Database Marketing"
## http://sds.podval.org/data/l-quality.pdf
## http://www.sigkdd.org/explorations/issues/2-2-2000-12/piatetsky-shapiro.pdf
## SIGKDD Explorations, Vol. 2:2, (2000), 81-86
## tests: lift.quality(rbinom(10000,size=1,prob=0.1)) ==> ~0
## lift.quality(rev(round((1:10000)/12000))) ==> 1
lift.quality <- function (v, plot = TRUE, file = NULL, main = "lift curve", thresholds = NULL) {
target.count <- sum(v)
total.count <- length(v)
base.rate <- target.count / total.count
target.level <- cumsum(v)/target.count
lq <- ((2*sum(target.level) - 1)/total.count - 1) / (1 - base.rate)
if (plot) {
if (!is.null(file)) {
pdf(file = file)
on.exit(dev.off())
}
plot(x=(1:total.count)/total.count,y=target.level,type="l",
main=paste(main,"( lift quality ",lq,")"),
xlab="% cutoff",ylab="cumulative % hit")
}
if (is.null(thresholds)) thresholds = c(base.rate)
list(lift.quality = lq,
detector.statistics = sapply(thresholds, function (l) {
cutoff <- round(l * total.count)
tp <- round(target.level[cutoff] * target.count) # = sum(v[1:cutoff])
fn <- target.count - tp
fp <- cutoff - tp
tn <- total.count - target.count - cutoff + tp
detector.statistics(tp, fn, fp, tn)
}))
}
>> I have many more score files (not just 4), so it is not practical for me
>> to rename the column to something unique.
>
> Which column?
the 3rd ("score") column.
Meanwhile I realised that the fastest way is actuall shell:
sort+cut+paste produced the csv file which can be loaded into R much
faster than the individual score files, so this issue is now purely
academic. However, I appreciate the replies I got so far, it was quite
educational, thanks!
(I also appreciate comments on the code above)
--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://www.memritv.org http://truepeace.org
http://openvotingconsortium.org http://ffii.org http://mideasttruth.com
Save your burned out bulbs for me, I'm building my own dark room.
More information about the R-help
mailing list