[Rd] as.Date (and strptime?) does not recognize " " as a blank
Spencer Graves
@pencer@gr@ve@ @end|ng |rom prod@y@e@com
Thu Jul 7 14:25:08 CEST 2022
Thanks, Prof. Ripley, for your further analysis of this issue. sg
On 7/7/22 6:59 AM, Prof Brian Ripley wrote:
> There is some misunderstanding here. The space is part of the format
> specified by SG to as.Date(), which passes it to strptime(). So SG asked
> to match a space and complained that a different character is not matched!
>
> Reading the documentation of strptime shows
>
> ‘%n’ Newline on output, arbitrary whitespace on input.
> ‘%t’ Tab on output, arbitrary whitespace on input.
>
> so one might hope that one could use those to specify whitespace instead
> of ASCII space in the format. But unfortunately whether a Unicode
> no-break space (U+00A0) is whitespace is a matter of opinion -- for
> example the PCRE author changed his a few years back.
>
> We don't have a reproducible example, but my attempt at reproduction
> suggests that U+00A0 is not regarded as whitespace on the system I used.
> We know this to be platform-specific (it uses the C function
> iswspace): glibc does not regard this as whitespace and the replacement
> functions used by R on macOS and Windows have followed suit.
>
> In short, ASCII space matches only itself, and the interpretation of
> 'blank' (in regexps) or 'whitespace' (in strptime or regexps) is
> platform-specific and liable to change.
>
>
> On 25/06/2022 14:13, Spencer Graves wrote:
>> Hi, Maxim et al.:
>>
>>
>> On 6/25/22 6:10 AM, Maxim Nazarov wrote:
>>> Hello,
>>>
>>>> When is a space not a space?
>>> I guess the answer is when it is a non-breaking one?..
>>>
>>> We can observe:
>>> > charToRaw(textutils::HTMLdecode(" "))
>>> [1] c2 a0
>>> > charToRaw(" ")
>>> [1] 20
>>> So one can argue that everything works correctly - `textutils`
>>> function converts HTML's non-breaking space ' ' into R's
>>> non-breaking space '\xa0', while %e format of as.Date expects a
>>> 'normal' space.
>>> But this is obviously not user-friendly especially since both symbols
>>> are displayed the same way on the console.
>>> So your options might be to either:
>>> * manually change all 'weird' spaces into normal ones with
>>> something like gsub("\\h", " ", ..., perl = TRUE) - for the list of
>>> other weird spaces see
>>> https://www.pcre.org/original/doc/html/pcrepattern.html#genericchartypes
>>> * persuade textutils author to change into a normal space
>>> (they seem to be working with a simple lookup table -
>>> https://github.com/enricoschumann/textutils/blob/b813c7bd4b55daef5fa7612e3fbfe82962711940/R/char_refs.R#L1465-L1466)
>>>
>>> * persuade R-Core (or submit a PR) to relax expectations of
>>> as.Date/strptime
>>>
>>
>> Thanks for the reply. Since "this is obviously not
>> user-friendly", as you noted, I felt a need to bring it to the
>> attention of this group, and let them decide what if anything they
>> would want to do about it.
>>
>>
>> In any event, I found a fix for my immediate problem. It's not
>> as elegant as yours, but it works.
>>
>> Best Wishes,
>> Spencer
>>
>>
>>
>>
>>> Kind regards,
>>> Maxim Nazarov
>>>
>>> ----- On Jun 25, 2022, at 8:37 AM, Spencer Graves
>>> spencer.graves using prodsyse.com wrote:
>>>
>>>> Hello, All:
>>>>
>>>>
>>>> When is a space not a space?
>>>>
>>>>
>>>> Consider the following:
>>>>
>>>>
>>>>> (pblmDate <- textutils::HTMLdecode(" 2 Mar 2018"))
>>>> [1] " 2 Mar 2018"
>>>>> as.Date(pblmDate, format='%e %b %Y')
>>>> [1] NA
>>>>> as.Date(' 2 Mar 2018', format='%e %b %Y')
>>>> [1] "2018-03-02"
>>>>
>>>>
>>>> Is this a feature or a bug?
>>>>
>>>>
>>>> I can work around it, now that I know what it is, but it took
>>>> me a
>>>> few hours to diagnose.
>>>>
>>>>
>>>> Thanks,
>>>> Spencer Graves
>>>>
>>>>
>>>> p.s. I got this from scraping a website with code that had worked for
>>>> me roughly 20 months ago. I suspect that in the interim, someone
>>>> probably replaced ' 2 Mar 2018' with " 2 Mar 2018".
>>>>
>>>> ______________________________________________
>>>> R-devel using r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
More information about the R-devel
mailing list