[Rd] incorrect output and segfaults from sprintf with %*d (PR#13667)
maechler at stat.math.ethz.ch
maechler at stat.math.ethz.ch
Fri Apr 24 12:45:24 CEST 2009
>>>>> "MM" == Martin Maechler <maechler at stat.math.ethz.ch>
>>>>> on Thu, 23 Apr 2009 12:40:22 +0200 (CEST) writes:
>>>>> "vQ" == Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>
>>>>> on Thu, 23 Apr 2009 11:49:54 +0200 writes:
vQ> maechler at stat.math.ethz.ch wrote:
>>>
vQ> sprintf has a documented limit on strings included in the output using the
vQ> format '%s'. It appears that there is a limit on the length of strings included
vQ> with, e.g., the format '%d' beyond which surprising things happen (output
vQ> modified for conciseness):
>>>
vQ> ... and this limit is *not* documented.
MM> well, it is basically (+ a few bytes ?)
MM> the same 8192 limit that *is* documented.
indeed, I was right with that..
vQ> gregexpr('1', sprintf('%9000d', 1))
vQ> # [1] 9000 9801
>>>
vQ> gregexpr('1', sprintf('%9000d', 1))
vQ> # [1] 9000 9801 10602
>>>
vQ> gregexpr('1', sprintf('%9000d', 1))
vQ> # [1] 9000 9801 10602 11403
>>>
vQ> gregexpr('1', sprintf('%9000d', 1))
vQ> # [1] 9000 9801 10602 11403 12204
>>>
vQ> ...
>>>
vQ> Note that not only more than one '1' is included in the
vQ> output, but also that the same functional expression (no
vQ> side effects used beyond the interface) gives different
vQ> results on each execution. Analogous behaviour can be
vQ> observed with '%nd' where n > 8200.
vQ> The actual output above is consistent across separate sessions.
>>>
vQ> With sufficiently large field width values, R segfaults:
>>>
vQ> sprintf('%*d', 10^5, 1)
vQ> # *** caught segfault ***
vQ> # address 0xbfcfc000, cause 'memory not mapped'
vQ> # Segmentation fault
>>>
>>>
>>> Thank you, Wacek.
>>> That's all ``interesting'' ... unfortunately,
>>>
>>> my version of 'man 3 sprintf' contains
>>>
>>>
>>>>> BUGS
>>>>> Because sprintf() and vsprintf() assume an arbitrarily
>>>>> long string, callers must be careful not to overflow the
>>>>> actual space; this is often impossible to assure. Note
>>>>> that the length of the strings produced is
>>>>> locale-dependent and difficult to predict. Use
>>>>> snprintf() and vsnprintf() instead (or asprintf() and vasprintf).
vQ> yes, but this is c documentation, not r documentation.
MM> Of course! ... and I *do* apply it to R's C code [sprintf.c]
MM> and hence am even concurring with you ..
vQ> it's applicable
vQ> to a degree, since ?sprintf does say that sprintf is "a wrapper for the
vQ> C function 'sprintf'". however, in c you use a buffer and you usually
vQ> have control over it's capacity, while in r this is a hidden
vQ> implementational detail, which should not be visible to the user, or
vQ> should cause an attempt to overflow the buffer to fail more gracefully
vQ> than with a segfault.
vQ> in r, sprintf('%9000d', 1) will produce a confused output with a count
vQ> of 1's variable (!) across runs (while sprintf('%*d', 9000, 1) seems to
vQ> do fine):
vQ> gregexpr('1', sprintf('%*d', 9000, 1))
vQ> # [1] 9000
vQ> gregexpr('1', sprintf('%9000d', 1))
vQ> # [1] 9000 9801 ..., variable across executions
vQ> on one execution in a series i actually got this:
vQ> Warning message:
vQ> In gregexpr("1", sprintf("%9000d", 1)) :
vQ> input string 1 is invalid in this locale
vQ> while the very next execution, still in the same session, gave
vQ> # [1] 9000 9801 10602
vQ> with sprintf('%*d', 10000, 1) i got segfaults on some executions but
vQ> correct output on others, while sprintf('%10000d', 1) is confused again.
>>> (note the "impossible" part above)
>>>
vQ> yes, but it does also say "must be careful", and it seems that someone
vQ> has not been careful enough.
>>> and we haven't used snprintf() yet, probably because it
>>> requires the C99 C standard, and AFAIK, we have only relatively
>>> recently started to more or less rely on C99 in the R sources.
>>>
vQ> while snprintf would help avoid buffer overflow, it may not be a
vQ> solution to the issue of confused output.
MM> I think it would / will. We would be able to give warnings and
MM> errors, by checking the snprintf() return codes.
My current working code gives an error for all the above
examples, e.g.,
> sprintf('%9999d', 1)
Error in sprintf("%9999d", 1) :
required resulting string length 9999 is > maximal 8191
it passes 'make check-devel' and I am inclined to commit that
code to R-devel (e.g. tomorrow).
Yes, the documentation will also have to be amended, but apart
from that, would people see a big problem with the "8192" limit
which now is suddenly of greater importance
{{as I said all along; hence my question to Wacek (and the
R-develers) if anybody found that limit too low}}
Of course, one could realloc()ate longer strings when needed
inside R's sprintf() code, but I reluctant to do that (render
the code yet more complicated ...) if nobody sees a need.
Martin
>>> More precisely, I see that some windows-only code relies on
>>> snprintf() being available whereas in at least on non-Windows
>>> section, I read /* we cannot assume snprintf here */
>>>
>>> Now such platform dependency issues and corresponding configure
>>> settings I do typically leave to other R-corers with a much
>>> wider overview about platforms and their compilers and C libraries.
>>>
vQ> it looks like src/main/sprintf.c is just buggy, and it's plausible that
vQ> the bug could be repaired in a platform-independent manner.
MM> definitely.
MM> In the mean time, I've actually found that what I first said on
MM> the usability of snprintf() in R's code base was only partly correct.
MM> There are other parts of R code where we use snprintf() for all
MM> platforms, hence we rely on its presence (and correct
MM> implementation!) and so we can and I think should use it in
MM> place of sprintf() in quite a few places inside R's sprintf.c
>>> BTW,
>>> 1) sprintf("%n %g", 1,1) also seg.faults
>>>
vQ> as do
vQ> sprintf('%n%g', 1, 1)
vQ> sprintf('%n%')
vQ> etc., while
vQ> sprintf('%q%g', 1, 1)
vQ> sprintf('%q%')
vQ> work just fine. strange, because per ?sprintf 'n' is not recognized as
vQ> a format specifier, so the output from the first two above should be as
vQ> from the last two above, respectively. (and likewise in the %S case,
vQ> discussed and bug-reported earlier.)
MM> I have now fixed these bugs at least;
MM> the more subtle "%<too_large_n>d" ones are different, and
MM> as I said, I'm convinced that a nice & clean fix for those will
MM> start using snprintf().
>>> 2) Did you have a true use case where the 8192 limit was an
>>> undesirable limit?
vQ> how does it matter?
MM> well, we could increase it, if it did matter.
MM> {{ you *could* have been more polite here, no?
MM> it *was* after all a serious question that I asked! }}
vQ> if you set a limit, be sure to consistently enforce
vQ> it and warn the user on attempts to exceed it. or write clearly in the
vQ> docs that such attempts will cause the output to be silently truncated.
MM> Sure, I'm not at all disagreeing on that, and if you read this into my
MM> posting, you misunderstand.
MM> Martin
[..........................]
More information about the R-devel
mailing list