[R] Coercion of percentages by as.numeric
Marc Schwartz (via MN)
mschwartz at mn.rr.com
Mon Nov 14 19:02:31 CET 2005
On Mon, 2005-11-14 at 19:07 +0200, Brandt, T. (Tobias) wrote:
>
> >-----Original Message-----
> >From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
> >Sent: 14 November 2005 06:21 PM
> >
> >On 11/14/05, Brandt, T. (Tobias) <TobiasBr at taquanta.com> wrote:
> >> Hi
> >>
> >> Given that things like the following work
> >>
> >> > a <- c("-.1"," 2.7 ","B")
> >> > a
> >> [1] "-.1" " 2.7 " "B"
> >> > as.numeric(a)
> >> [1] -0.1 2.7 NA
> >> Warning message:
> >> NAs introduced by coercion
> >> >
> >>
> >> I naively expected that the following would behave differently.
> >>
> >> > b <- c('10%', '-20%', '30.0%', '.40%')
> >> > b
> >> [1] "10%" "-20%" "30.0%" ".40%"
> >> > as.numeric(b)
> >> [1] NA NA NA NA
> >> Warning message:
> >> NAs introduced by coercion
> >
> >Try this:
> >
> >as.numeric(sub("%", "e-2", b))
> >
>
> Thank you, that accomplishes what I had intended.
>
> I would have thought though that the expression "53%" would be a fairly
> standard representation of the number 0.53 and might be handled as such. Is
> there a specific reason for avoiding this behaviour?
"53%" is a 'shorthand' character representation of a mathematical
concept. To wit, the specific representation of a fraction using 100 as
the denominator (ie. 53 / 100). The symbol '%' can be replaced by the
word "percent", such as "53 percent", which is also a character
representation.
0.53, in context, is a numeric representation of a proportion in the
range of 0 - 1.0.
> I can imagine that it might add unnecessary overhead to routines like
> "as.numeric" which one would like to keep as fast as possible.
>
> Perhaps there are other areas though where it might be desirable? For
> example I'm thinking of the read.table function for reading in csv files
> since I have many of these that have been saved from excel and now contain
> numbers in the "%" format.
In Excel, numbers displayed with a '%' are what you see visually.
However, the internal representation (how the value is actually stored
in the program) is still as a floating point value, without the '%'.
For example:
> a <- 53
> a
[1] 53
> sprintf("%.0f%%", a)
[1] "53%"
> is.numeric(a)
[1] TRUE
> is.numeric(sprintf("%.0f%%", a))
[1] FALSE
Unfortunately (depending upon your perspective), Excel, and other
similar programs, tend to export the visually displayed values and not
the internal representations of them. Thus, as Gabor pointed out, you
will need to do some 'editing' of the values before using them in R. You
can either do this in Excel, by removing the "%" formatting, or
post-import in R as Gabor has described.
You need to keep separate the internal representation of a value and its
printed or displayed representation for human readable consumption.
as.numeric() does basically one thing and it does it well and properly.
It is up to the user to ensure that it is passed the proper values. When
that is not the case, it issues an appropriate warning message and
returns NA.
Of course, using Gabor's hint, you can also write your own variation of
as.numeric(), creating a function that takes percent formatted values
and converts them as you require. One of the many strengths of R, is
that you can extend it to meet your own specific requirements when the
base functions do not.
HTH,
Marc Schwartz
More information about the R-help
mailing list