[R] numerical accuracy, dumb question
Tony Plate
tplate at blackmesacapital.com
Sat Aug 14 15:42:31 CEST 2004
At Friday 08:41 PM 8/13/2004, Marc Schwartz wrote:
>Part of that decision may depend upon how big the dataset is and what is
>intended to be done with the ID's:
>
> > object.size(1011001001001)
>[1] 36
>
> > object.size("1011001001001")
>[1] 52
>
> > object.size(factor("1011001001001"))
>[1] 244
>
>
>They will by default, as Andy indicates, be read and stored as doubles.
>They are too large for integers, at least on my system:
>
> > .Machine$integer.max
>[1] 2147483647
>
>Converting to a character might make sense, with only a minimal memory
>penalty. However, using a factor results in a notable memory penalty, if
>the attributes of a factor are not needed.
That depends on how long the vectors are. The memory overhead for factors
is per vector, with only 4 bytes used for each additional element (if the
level already appears). The memory overhead for character data is per
element -- there is no amortization for repeated values.
> object.size(factor("1011001001001"))
[1] 244
>
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))
[1] 308
> # bytes per element in factor, for length 4:
>
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))/4
[1] 77
> # bytes per element in factor, for length 1000:
>
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250)))/1000
[1] 4.292
> # bytes per element in character data, for length 1000:
>
object.size(as.character(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250))))/1000
[1] 20.028
>
So, for long vectors with relatively few different values, storage as
factors is far more memory efficient (this is because the character data is
stored only once per level, and each element is stored as a 4-byte
integer). (The above was done on Windows 2000).
-- Tony Plate
>If any mathematical operations are to be performed with the ID's then
>leaving them as doubles makes most sense.
>
>Dan, more information on the numerical characteristics of your system
>can be found by using:
>
>.Machine
>
>See ?.Machine and ?object.size for more information.
>
>HTH,
>
>Marc Schwartz
>
>
>On Fri, 2004-08-13 at 21:02, Liaw, Andy wrote:
> > If I'm not mistaken, numerics are read in as doubles, so that shouldn't
> be a
> > problem. However, I'd try using factor or character.
> >
> > Andy
> >
> > > From: Dan Bolser
> > >
> > > I store an id as a big number, could this be a problem?
> > >
> > > Should I convert to at string when I use read.table(...
> > >
> > > example id's
> > >
> > > 1001001001001
> > > 1001001001002
> > > ...
> > > 1002001002005
> > >
> > >
> > > Bigest is probably
> > >
> > > 1011001001001
> > >
> > > Ta,
> > > Dan.
> > >
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
More information about the R-help
mailing list