[R] numerical accuracy, dumb question
Marc Schwartz
MSchwartz at MedAnalytics.com
Sat Aug 14 22:53:11 CEST 2004
On Sat, 2004-08-14 at 13:19, Prof Brian Ripley wrote:
> On Sat, 14 Aug 2004, Marc Schwartz wrote:
>
> > > object.size("a")
> > [1] 44
> >
> > > object.size(letters)
> > [1] 340
> >
> > In the second case, as Tony has noted, the size of letters (a character
> > vector) is not 26 * 44.
>
> Of course not. Both are character vectors, so have the overhead of any R
> object plus an allocation for pointers to the elements plus an amount for
> each element of the vector (see the end).
>
> These calculations differ on 32-bit and 64-bit machines. For a 32-bit
> machine storage is in units of either 28 bytes (Ncells) or 8 bytes
> (Vcells) so single-letter characters are wasteful, viz
>
> > object.size("aaaaaaa")
> [1] 44
>
> That is 1 Ncell and 2 Vcells, 1 for the string (7 bytes plus terminator)
> and 1 for the pointer.
>
> Whereas
>
> > object.size(letters)
> [1] 340
>
> has 1 Ncell and 39 Vcells, 26 for the strings and 13 for the pointers
> (which fit two to a Vcell).
>
> Note that repeated character strings may share storage, so for example
>
> > object.size(rep("a", 26))
> [1] 340
>
> is wrong (140, I think). And that makes comparisons with factors depend
> on exactly how they were created, for a character vector there probably is
> a lot of sharing.
>
> I have a feeling that these calculations are off for character vectors, as
> each element is a CHARSXP and so may have an Ncell not accounted for by
> object.size. (`May' because of potential sharing.) Would anyone who is
> sure like to confirm or deny this?
>
> It ought to be possible to improve the estimates for character vectors a
> bit as we can detect sharing amongst the elements.
Prof. Ripley,
Thanks for the clarifications.
I'll need to spend some time reading through R-exts.pdf and
Rinternals.h.
Regards,
Marc
More information about the R-help
mailing list