[Rd] String encoding problem

Duncan Murdoch murdoch.duncan at gmail.com
Thu Jul 7 21:23:59 CEST 2016


On 07/07/2016 12:51 PM, peter dalgaard wrote:
> > On 07 Jul 2016, at 18:15 , Hadley Wickham <h.wickham at gmail.com> wrote:
> >
> > Right - I'm aware of that.  But to me, it doesn't seem correct to
> > print a string that is not a valid R string. Why is an unknown
> > encoding printed like UTF-8?
> >
>
> It isn't -- no UTF-8 would have the \xbf. I may be flogging a dead horse, but it seems to me that there are three alternatives:
>
> - refuse the input (x <- "\xc9\x82\xbf" gives "sorry, not a UTF-8 string" or so)
> - refuse to print it (print(x) gives "cannot print non-UTF-8 string")
> - what happens now
>
> and a fourth one might be to actually allow mixing of \u0007 and \x07 and \007, but I suspect that there are demons down the line which is why it is not happening now. (Does it ring a bell with anyone?)

A fifth option would be to use only hex escapes when invalid UTF-8 was 
found.  That would echo back the input in this case.  No idea if it 
would cause other problems.

Duncan Murdoch



More information about the R-devel mailing list