[R] Confused - better empirical results with error in data
Noah Silverman
noah at smartmediacorp.com
Mon Sep 7 22:39:19 CEST 2009
You both make good points.
Ideally, it would be nice to know WHY it works.
Without digging into too much verbiage, the system is designed to
predict the outcome of certain events. The "broken" model predicts
outcomes correctly much more frequently than one with the broken data
withheld. So, to answer Mark's question, we say it's "better" because we
see much better results with our "broken" model when applied to
real-world data used for testing.
I have one theory.
The data is listed in our CSV file from newest to oldest. We are
supposed to calculated a valued that is an "average" of some items. We
loop through some queries to our database and increment two variables -
$total_found and $total_score. The final value is simply $total_score /
$total_found.
Our programmer forgot to reset both $total_score and $total_found back
to zero for each record we process. So both grow.
I think that this may, in a way, be some warped form of a recency
weighted score. The newer records will have a score more affected by
their "contribution" to the wrongly growing totals. A record that is
closer to the end of the data set will be starting with HUGE values for
$total_score and $total_found, so addition of its values will have very
little effect.
We've done the following so far today (Note, scores are just relative
to indicate performance. Higher is better)
1) Run with "bad" data = 6.9
2) Run with "bad" data missing = 5.5
3) Run with "correct" data = ?? (We're running now, will take a few
hours to compute.)
I might also try to plot the bad data. It would be interesting to see
what shape it has...
On 9/7/09 1:05 PM, Mark Knecht wrote:
> On Mon, Sep 7, 2009 at 12:33 PM, Noah Silverman<noah at smartmediacorp.com> wrote:
> <SNIP
>
>> So, this is really a philosophical question. Do we:
>> 1) Shrug and say, "who cares", the SVM figured it out and likes that bad
>> data item for some inexplicable reason
>> 2) Tear into the math and try to figure out WHY the SVM is predicting
>> more accurately
>>
>> Any opinions??
>>
>> Thanks!
>>
>>
> Boy, I'd sure think you'd want to know why it worked with the 'wrong'
> calculations. It's not that the math is wrong, really, but rather that
> it wasn't what you thought it was. I cannot see why you wouldn't want
> to know why this mistake helped. Won't future project benefit?
>
> Just my 2 cents,
> Mark
>
More information about the R-help
mailing list