[Rd] as.Date (and strptime?) does not recognize " " as a blank
Gabriel Becker
g@bembecker @end|ng |rom gm@||@com
Thu Jul 7 19:42:34 CEST 2022
Depends a bit on what you mean by "automatically". This seems to work for
me (note this has NOT been extensively tested on different OSes or even in
different locales/encodings):
library(XML)
myhtml <- "<html><body><table
id='hiya'><tr><th>colname</th></tr><tr><td> </td></tr><tr><td>
</td></tr></table></body></html>"
doc <- htmlParse(myhtml, asText = TRUE)
oldway <- readHTMLTable(doc, trim = FALSE)
identical(oldway$hiya$colname[1], oldway$hiya$colname[2]) # FALSE :(
decode_nbsp <- function(x) gsub(rawToChar(as.raw(c(0xc2, 0xa0))), " ", x,
fixed = TRUE, useBytes = TRUE)
fancypants <- function(node) decode_nbsp(xmlValue(node))
newandfancy <- readHTMLTable(doc, trim = FALSE, elFun = fancypants)
identical(newandfancy$hiya$colname[1], newandfancy$hiya$colname[2]) # TRUE
:D
Best,
~G
On Fri, Jun 24, 2022 at 11:48 PM Spencer Graves <spencer.graves using prodsyse.com>
wrote:
> p.s. Is there a way to get XML::readHTMLTable to automatically convert
> " " to a normal blank space?
>
>
> On 6/25/22 1:37 AM, Spencer Graves wrote:
> > Hello, All:
> >
> >
> > When is a space not a space?
> >
> >
> > Consider the following:
> >
> >
> > > (pblmDate <- textutils::HTMLdecode(" 2 Mar 2018"))
> > [1] " 2 Mar 2018"
> > > as.Date(pblmDate, format='%e %b %Y')
> > [1] NA
> > > as.Date(' 2 Mar 2018', format='%e %b %Y')
> > [1] "2018-03-02"
> >
> >
> > Is this a feature or a bug?
> >
> >
> > I can work around it, now that I know what it is, but it took me
> > a few hours to diagnose.
> >
> >
> > Thanks,
> > Spencer Graves
> >
> >
> > p.s. I got this from scraping a website with code that had worked for
> > me roughly 20 months ago. I suspect that in the interim, someone
> > probably replaced ' 2 Mar 2018' with " 2 Mar 2018".
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
[[alternative HTML version deleted]]
More information about the R-devel
mailing list