[Rd] as.Date (and strptime?) does not recognize "  " as a blank

Prof Brian Ripley r|p|ey @end|ng |rom @t@t@@ox@@c@uk
Thu Jul 7 13:59:57 CEST 2022


There is some misunderstanding here.  The space is part of the format 
specified by SG to as.Date(), which passes it to strptime(). So SG asked 
to match a space and complained that a different character is not matched!

Reading the documentation of strptime shows

      ‘%n’ Newline on output, arbitrary whitespace on input.
      ‘%t’ Tab on output, arbitrary whitespace on input.

so one might hope that one could use those to specify whitespace instead 
of ASCII space in the format.  But unfortunately whether a Unicode 
no-break space (U+00A0) is whitespace is a matter of opinion -- for 
example the PCRE author changed his a few years back.

We don't have a reproducible example, but my attempt at reproduction 
suggests that U+00A0 is not regarded as whitespace on the system I used. 
  We know this to be platform-specific (it uses the C function 
iswspace): glibc does not regard this as whitespace and the replacement 
functions used by R on macOS and Windows have followed suit.

In short, ASCII space matches only itself, and the interpretation of 
'blank' (in regexps) or 'whitespace' (in strptime or regexps) is 
platform-specific and liable to change.


On 25/06/2022 14:13, Spencer Graves wrote:
> Hi, Maxim et al.:
> 
> 
> On 6/25/22 6:10 AM, Maxim Nazarov wrote:
>> Hello,
>>
>>> When is a space not a space?
>> I guess the answer is when it is a non-breaking one?..
>>
>> We can observe:
>>   > charToRaw(textutils::HTMLdecode(" "))
>>   [1] c2 a0
>>   > charToRaw(" ")
>>   [1] 20
>> So one can argue that everything works correctly - `textutils` 
>> function converts HTML's non-breaking space ' ' into R's 
>> non-breaking space '\xa0', while %e format of as.Date expects a 
>> 'normal' space.
>> But this is obviously not user-friendly especially since both symbols 
>> are displayed the same way on the console.
>> So your options might be to either:
>>   * manually change all 'weird' spaces into normal ones with something 
>> like gsub("\\h", " ", ..., perl = TRUE) - for the list of other weird 
>> spaces see 
>> https://www.pcre.org/original/doc/html/pcrepattern.html#genericchartypes
>>   * persuade textutils author to change   into a normal space 
>> (they seem to be working with a simple lookup table - 
>> https://github.com/enricoschumann/textutils/blob/b813c7bd4b55daef5fa7612e3fbfe82962711940/R/char_refs.R#L1465-L1466) 
>>
>>   * persuade R-Core (or submit a PR) to relax expectations of 
>> as.Date/strptime
>>
> 
>        Thanks for the reply.  Since "this is obviously not 
> user-friendly", as you noted, I felt a need to bring it to the attention 
> of this group, and let them decide what if anything they would want to 
> do about it.
> 
> 
>        In any event, I found a fix for my immediate problem.  It's not 
> as elegant as yours, but it works.
> 
>        Best Wishes,
>        Spencer
> 
> 
> 
> 
>> Kind regards,
>> Maxim Nazarov
>>
>> ----- On Jun 25, 2022, at 8:37 AM, Spencer Graves 
>> spencer.graves using prodsyse.com wrote:
>>
>>> Hello, All:
>>>
>>>
>>>       When is a space not a space?
>>>
>>>
>>>       Consider the following:
>>>
>>>
>>>> (pblmDate <- textutils::HTMLdecode(" 2 Mar 2018"))
>>> [1] " 2 Mar 2018"
>>>> as.Date(pblmDate, format='%e %b %Y')
>>> [1] NA
>>>> as.Date(' 2 Mar 2018', format='%e %b %Y')
>>> [1] "2018-03-02"
>>>
>>>
>>>       Is this a feature or a bug?
>>>
>>>
>>>       I can work around it, now that I know what it is, but it took me a
>>> few hours to diagnose.
>>>
>>>
>>>       Thanks,
>>>       Spencer Graves
>>>
>>>
>>> p.s.  I got this from scraping a website with code that had worked for
>>> me roughly 20 months ago.  I suspect that in the interim, someone
>>> probably replaced ' 2 Mar 2018' with " 2 Mar 2018".
>>>
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


-- 
Brian D. Ripley,                  ripley using stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford



More information about the R-devel mailing list