[R] Floating point "fuzz" and rpart?

Thu Jul 26 09:12:12 CEST 2001

On Wed, 25 Jul 2001, Marc Feldesman wrote:

> I've been using rpart with R (1.3.0 Windows) for some time.  I recently ran
> one of my research data sets through the rpart routine and produced a
> classification tree.  I tried to replicate the results of the rpart
> analysis on another machine of mine and discovered some startling
> differences in the results.  Puzzled, I went back to the raw data residing
> on both machines.  I printed out both versions of the data, ran summary
> statistics, plotted histograms, boxplots, and anything else I could think
> of.  On the surface, the datasets are identical.  Since the file attributes
> were completely different, I know that the two versions may have originated
> from the same source but had been moved to R via different
> mechanisms.  Ultimately, I read both files in as .csv tables using read.csv().
>
> Perplexed I gave the files different names and read both into a single
> version of R 1.3.0.  I ran rpart on each file and got the same results as
> when I ran the two files on separate machines.  So, I decided to do
> variable-by-variable comparisons using the all.equal.numeric() function.
>
> On one machine, all.equal.numeric() returns TRUE for the same set of
> variables in both files, while on the second machine 9 of 10 variables
> return answers like the following (all are approximately 2.6......e-07):
>
> "Mean relative  difference: 2.628787e-07"
>
> So, clearly the two "identical files" are different somewhere in the outer
> reaches of floating point representation.  (The two machines are identical
> Dell PIII, XPS T700's - one has 256MB RAM, the other 512MB RAM).
>
> Questions:
>
> 1.  Both machines have the same versions of R (with default options) and
> rpart, and I used one machine to propagate duplicate copies of each file to
> the other machine.  Why would one machine report all.equal.numeric() to be
> TRUE for all variables, while the other machine report 9 of 10 different in
> the outer floating point regions?  (Interestingly enough, the one variable
> reported to be "exactly" equal is the only variable of 10 recorded to the
> nearest "integer" - although stored as a floating point number; the other 9
> were measured in mm and recorded to the nearest tenth of a mm.)

Windows has a load of DLLs providing the run-time system, notably
msvcrt.dll.  I suspect different versions of msvcrt.dll.

> 2.  Even with differences only beyond the 7th decimal place, why would
> rpart report such demonstrably different results with the "same" data
> set?  Does floating point "fuzz" really make that much
> difference?  (rhetorical question!   The answer is obvious here.)

That is a little surprising (because most of rpart is in double precision
in R, single precision in S).  But it does make differences to `unstable'
methods (in Breiman's terminology) and CART is one of the most unstable
(hence bagging).

I should say that rpart_3.0-0 (the version in R 1.3.0) has a few problems
(as the first of a new major revision), although I am not aware of anything
giving incorrect results outside the survival area (where the author
convinced himself the new results were right, and has now changed his
mind).  He is getting all the new features in now, in anticipation of rpart
shipping with S-PLUS.

Brian

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._