[R] [External] Re: help with web scraping

Sat Jul 25 18:30:52 CEST 2020

On 2020-07-25 09:56 -0500, Spencer Graves wrote:
> Dear Rasmus et al.:

It is LILAND et al., is it not?  I do 
not belong to a large Confucian family 
structure (putting the hunter-gatherer 
horse-rider tribe name first in all-caps 
in the email), else it's customary to 
put a comma in there, isn't it? ... 
right, moving on:

On 2020-07-25 04:10, Rasmus Liland wrote:
> 
>  ????? 

It might be a better idea to write the 
reply in plain-text utf-8 or at least 
Western or Eastern-European ISO euro 
encoding instead of us-ascii (maybe 
KOI8, ¯\_(ツ)_/¯) ...  something in your 
email got string-replaced by "?????" and 
also "«" got replaced by "?".

Please research using Thunderbird, Claws 
mail, or some other sane e-mail client; 
they are great, I promise.

> Please excuse:? Before my last post, I 
> had written code to do all that.? 

Good!

> In brief, the political offices are 
> "h3" tags.?

Yes, some type of header element at 
least, in-between the various tables, 
everything children of the div in the 
element tree.

> I used "strsplit" to split the string 
> at "<h3>".? I then wrote a 
> function to find "</h3>", extract the 
> political office and pass the rest to 
> "XML::readHTMLTable", adding columns 
> for party and political office.

Yes, doing that for the political office 
is also possible, but the party is 
inside the table's caption tag, which 
end up as the name of the table in the 
XML::readHTMLTable list ...

> However, this suppressed "<br/>" 
> everywhere.?

Why is that, please explain.

> I thought there should be 
> an option with something like 
> "XML::readHTMLTable" that would not 
> delete "<br/>" everywhere, but I 
> couldn't find it.?

No, there is not, AFAIK.  Please, if 
anyone else knows, please say so *echoes 
in the forest*

> If you aren't aware of one, I can 
> gsub("<br/>", "\n", ...) on the string 
> for each political office before 
> passing it to "XML::readHTMLTable".? I 
> just tested this:? It works.

Such a great hack!  IMHO, this is much 
more flexible than using 
xml2::read_html, rvest::read_table, 
dplyr::mutate like here[1]

> I have other web scraping problems in 
> my work plan for the few days.?

Maybe, idk ... 

> I will definitely try 
> XML::htmlTreeParse, etc., as you 
> suggest.

I wish you good luck,
Rasmus

[1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200725/bfa09420/attachment.sig>