[R] Floating point "fuzz" and rpart?

Wed Jul 25 21:54:52 CEST 2001

I've been using rpart with R (1.3.0 Windows) for some time.  I recently ran 
one of my research data sets through the rpart routine and produced a 
classification tree.  I tried to replicate the results of the rpart 
analysis on another machine of mine and discovered some startling 
differences in the results.  Puzzled, I went back to the raw data residing 
on both machines.  I printed out both versions of the data, ran summary 
statistics, plotted histograms, boxplots, and anything else I could think 
of.  On the surface, the datasets are identical.  Since the file attributes 
were completely different, I know that the two versions may have originated 
from the same source but had been moved to R via different 
mechanisms.  Ultimately, I read both files in as .csv tables using read.csv().

Perplexed I gave the files different names and read both into a single 
version of R 1.3.0.  I ran rpart on each file and got the same results as 
when I ran the two files on separate machines.  So, I decided to do 
variable-by-variable comparisons using the all.equal.numeric() function.

On one machine, all.equal.numeric() returns TRUE for the same set of 
variables in both files, while on the second machine 9 of 10 variables 
return answers like the following (all are approximately 2.6......e-07):

"Mean relative  difference: 2.628787e-07"

So, clearly the two "identical files" are different somewhere in the outer 
reaches of floating point representation.  (The two machines are identical 
Dell PIII, XPS T700's - one has 256MB RAM, the other 512MB RAM).

Questions:

1.  Both machines have the same versions of R (with default options) and 
rpart, and I used one machine to propagate duplicate copies of each file to 
the other machine.  Why would one machine report all.equal.numeric() to be 
TRUE for all variables, while the other machine report 9 of 10 different in 
the outer floating point regions?  (Interestingly enough, the one variable 
reported to be "exactly" equal is the only variable of 10 recorded to the 
nearest "integer" - although stored as a floating point number; the other 9 
were measured in mm and recorded to the nearest tenth of a mm.)

2.  Even with differences only beyond the 7th decimal place, why would 
rpart report such demonstrably different results with the "same" data 
set?  Does floating point "fuzz" really make that much 
difference?  (rhetorical question!   The answer is obvious here.)

Thoughts, insights, suggestions for further explorations welcome.

Thanks.

=====================
Dr. Marc R. Feldesman
Professor and Chairman
Anthropology Department
Portland State University
1721 SW Broadway
Portland, Oregon 97201
email:  feldesmanm at pdx.edu
phone:  503-725-3081
fax:    503-725-3905
http://web.pdx.edu/~h1mf
PGP Key Available On Request
======================

"Anyway, no drug, not even alcohol, causes the fundamental ills of society.
If we're looking for the source of our troubles, we shouldn't test people
for drugs, we should test them for stupidity, ignorance, greed and love of
power."   P.J. O'Rourke

Powered by Optiplochoerus and Windows 2000 (scary isn't it?)

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._