[R] Floating point "fuzz" and rpart?
feldesmanm at pdx.edu
Wed Jul 25 21:54:52 CEST 2001
I've been using rpart with R (1.3.0 Windows) for some time. I recently ran
one of my research data sets through the rpart routine and produced a
classification tree. I tried to replicate the results of the rpart
analysis on another machine of mine and discovered some startling
differences in the results. Puzzled, I went back to the raw data residing
on both machines. I printed out both versions of the data, ran summary
statistics, plotted histograms, boxplots, and anything else I could think
of. On the surface, the datasets are identical. Since the file attributes
were completely different, I know that the two versions may have originated
from the same source but had been moved to R via different
mechanisms. Ultimately, I read both files in as .csv tables using read.csv().
Perplexed I gave the files different names and read both into a single
version of R 1.3.0. I ran rpart on each file and got the same results as
when I ran the two files on separate machines. So, I decided to do
variable-by-variable comparisons using the all.equal.numeric() function.
On one machine, all.equal.numeric() returns TRUE for the same set of
variables in both files, while on the second machine 9 of 10 variables
return answers like the following (all are approximately 2.6......e-07):
"Mean relative difference: 2.628787e-07"
So, clearly the two "identical files" are different somewhere in the outer
reaches of floating point representation. (The two machines are identical
Dell PIII, XPS T700's - one has 256MB RAM, the other 512MB RAM).
1. Both machines have the same versions of R (with default options) and
rpart, and I used one machine to propagate duplicate copies of each file to
the other machine. Why would one machine report all.equal.numeric() to be
TRUE for all variables, while the other machine report 9 of 10 different in
the outer floating point regions? (Interestingly enough, the one variable
reported to be "exactly" equal is the only variable of 10 recorded to the
nearest "integer" - although stored as a floating point number; the other 9
were measured in mm and recorded to the nearest tenth of a mm.)
2. Even with differences only beyond the 7th decimal place, why would
rpart report such demonstrably different results with the "same" data
set? Does floating point "fuzz" really make that much
difference? (rhetorical question! The answer is obvious here.)
Thoughts, insights, suggestions for further explorations welcome.
Dr. Marc R. Feldesman
Professor and Chairman
Portland State University
1721 SW Broadway
Portland, Oregon 97201
email: feldesmanm at pdx.edu
PGP Key Available On Request
"Anyway, no drug, not even alcohol, causes the fundamental ills of society.
If we're looking for the source of our troubles, we shouldn't test people
for drugs, we should test them for stupidity, ignorance, greed and love of
power." P.J. O'Rourke
Powered by Optiplochoerus and Windows 2000 (scary isn't it?)
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
More information about the R-help