[R] = vs. ==?

Tue Apr 15 15:36:35 CEST 2008

On 15-Apr-08 12:28:53, Linn wrote:
> 
> Hi
> Could anyone please explain to me the difference between
> the = and the ==?
> I'm quite new to R and I've tried to find out but didn't
> get any wiser...
> 
> Thanks

While these are indeed documented in ?"=" and ?"==", as
Gabor Csardi has pointed out, these particular help pages
(especially ?"=") devote so much attention to deep issues
in the implementation of R that they are unlikely to give
much to a newcomer to R. (Though ?"==" is not too bad).

Putting it simply:

"==" is a comparison operator. If 'x' and 'y' are two
variables of the same type, then "x==y" has value TRUE
if 'x' and 'y' have the same value.

There are a couple of traps here, which even beginners
should take care to be aware of.

One is that "NA" is not a value. Its logical status is,
in effect, "value not known". Therefore, when 'y' is "NA",
 "x==y" cannot have a definite resolution, since it is
possible for the unkown value of 'y' to be equal to the
value of 'x'; and equally possible that it may not be.
Hence the value of "x==y" is itself "NA". Similarly
the value of "x==y" is "NA" when both of 'x' and 'y'
are "NA". The function to use for testing whether (say)
'x' is "NA" is is.na(x).

The other is that the comparison of two floating-point
numbers which (mathematically) should be equal may be
FALSE, since their internal binary representations may
be different. Floating-point arithmetic in fixed-precision
computers is almost always approximate (though, in R,
to a very close degree of approximation). Thus, for instance,

  x <- sqrt(2)
  x^2 == 2
# [1] FALSE

and the reason for this is

  2 - sqrt(2)^2
# [1] -4.440892e-16

But, as pointed out in ?"==", a better test for this kind
of "equality" is the function all.equal():

  all.equal(x^2,2)
# [1] TRUE

since all.equal(x,y) considers x and y to be "equal" if
the numerical values corresponding to their representations
do not differ by more than a certain "tolerance" which
has a default value, but can be changed by the user.

So much for "==". Where "=" is concerned, it functions
rather like an assignment, but with complications. All that
incomprehensible stuff in ?"=" has to do with the complications.

In R, use "<-" rather than "=" for assigning a value to
a variable. Using "=" may often work, but sometimes it
won't, for deeply tangled reasons! As in "x <- sqrt(2)"
above, rather than "x = sqrt(2)" -- though in this case
that works as you would expect:

  y=sqrt(2)
  x==y
# [1] TRUE

In programming in R, it is a useful rule of thumb to
think "use something I know will work" rather than
thinking "use something which will work unless ... ";
unless, of course, you know all about those "..."!

Where you will routinely use "=" is in naming elements
of lists and dataframes, and in assigning values to named
parameters in functions.

Thus, if you already have vectors X and Y and you want
to make a dataframe in which you want X to play the role
of the "independent variable" in a subsequent regression,
and Y the role of the "dependent variable", then you
could write

 MyData <- data.frame(Indep=X, Depend=Y)

and then, later, execute the linear modelling function lm()
in the form:

 lm(Depend ~ Indep, data=MyData)

This executes lm() using what it finds in "Data" with
name "Depend" as the dependent variable, and what it
finds in "Data" with name "Indep" as the independent
variable.

This lm() call, in turn, illustrates the other typical
use of "=" in assigning a value to a parameter in a
function call, since the lm() function has a paramater
called "data", and "data=MyData" then tells it which
dataframe to use as the parameter "data" in this call
of lm().

Not that you necessarily *have* to do it that way,
of course, since often you may simply write

  lm(Y ~ X)

without reference to a dataframe, just referring to
variables you happen to have around at the time. But
the lm(...,data=...) form is useful in two kinds of
context: one, where the data come to you as a dataframe
in the first place, and it then saves you explicitly
extracting the variables from the dataframe; the other,
where the call (e.g., as above)

  lm(Depend ~ Indep, data=MyData)

is in some "generic" part of your code, and you do not
want to change it. Then it makes sense to change the
contents of "MyData", but keeping the names "Depend"
and "Indep", so that whatever you actually put in as
X and Y will be used in the same way.

Hoping this helps!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 15-Apr-08                                       Time: 14:36:30
------------------------------ XFMail ------------------------------