[R] [External] Re: help with web scraping
Rasmus Liland
jr@| @end|ng |rom po@teo@no
Sat Jul 25 18:30:52 CEST 2020
On 2020-07-25 09:56 -0500, Spencer Graves wrote:
> Dear Rasmus et al.:
It is LILAND et al., is it not? I do
not belong to a large Confucian family
structure (putting the hunter-gatherer
horse-rider tribe name first in all-caps
in the email), else it's customary to
put a comma in there, isn't it? ...
right, moving on:
On 2020-07-25 04:10, Rasmus Liland wrote:
>
> ?????
It might be a better idea to write the
reply in plain-text utf-8 or at least
Western or Eastern-European ISO euro
encoding instead of us-ascii (maybe
KOI8, ¯\_(ツ)_/¯) ... something in your
email got string-replaced by "?????" and
also "«" got replaced by "?".
Please research using Thunderbird, Claws
mail, or some other sane e-mail client;
they are great, I promise.
> Please excuse:? Before my last post, I
> had written code to do all that.?
Good!
> In brief, the political offices are
> "h3" tags.?
Yes, some type of header element at
least, in-between the various tables,
everything children of the div in the
element tree.
> I used "strsplit" to split the string
> at "<h3>".? I then wrote a
> function to find "</h3>", extract the
> political office and pass the rest to
> "XML::readHTMLTable", adding columns
> for party and political office.
Yes, doing that for the political office
is also possible, but the party is
inside the table's caption tag, which
end up as the name of the table in the
XML::readHTMLTable list ...
> However, this suppressed "<br/>"
> everywhere.?
Why is that, please explain.
> I thought there should be
> an option with something like
> "XML::readHTMLTable" that would not
> delete "<br/>" everywhere, but I
> couldn't find it.?
No, there is not, AFAIK. Please, if
anyone else knows, please say so *echoes
in the forest*
> If you aren't aware of one, I can
> gsub("<br/>", "\n", ...) on the string
> for each political office before
> passing it to "XML::readHTMLTable".? I
> just tested this:? It works.
Such a great hack! IMHO, this is much
more flexible than using
xml2::read_html, rvest::read_table,
dplyr::mutate like here[1]
> I have other web scraping problems in
> my work plan for the few days.?
Maybe, idk ...
> I will definitely try
> XML::htmlTreeParse, etc., as you
> suggest.
I wish you good luck,
Rasmus
[1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200725/bfa09420/attachment.sig>
More information about the R-help
mailing list