[R] [External] Re: help with web scraping

Spencer Graves @pencer@gr@ve@ @end|ng |rom e||ect|vede|en@e@org
Sat Jul 25 19:43:20 CEST 2020


Dear Rasmus Liland et al.:


On 2020-07-25 11:30, Rasmus Liland wrote:
> On 2020-07-25 09:56 -0500, Spencer Graves wrote:
>> Dear Rasmus et al.:
> 
> It is LILAND et al., is it not?  ... else it's customary to
> put a comma in there, isn't it? ...


The APA Style recommends "Sharp et al., 2007":


https://blog.apastyle.org/apastyle/2011/11/the-proper-use-of-et-al-in-apa-style.html


	  Regarding Confucius, I'm confused.



> right, moving on:
> 
> On 2020-07-25 04:10, Rasmus Liland wrote:
>>

<snip>

> 
> Please research using Thunderbird, Claws
> mail, or some other sane e-mail client;
> they are great, I promise.


Thanks.  I researched it and turned of HTML.  Please excuse:  I noticed 
it was a problem, but hadn't prioritized time to research and fix it 
until your comment.  Thanks.

> 
>> Please excuse:? Before my last post, I
>> had written code to do all that.?
> 
> Good!
> 
>> In brief, the political offices are
>> "h3" tags.?
> 
> Yes, some type of header element at
> least, in-between the various tables,
> everything children of the div in the
> element tree.
> 
>> I used "strsplit" to split the string
>> at "<h3>".? I then wrote a
>> function to find "</h3>", extract the
>> political office and pass the rest to
>> "XML::readHTMLTable", adding columns
>> for party and political office.
> 
> Yes, doing that for the political office
> is also possible, but the party is
> inside the table's caption tag, which
> end up as the name of the table in the
> XML::readHTMLTable list ...
> 
>> However, this suppressed "<br/>"
>> everywhere.?
> 
> Why is that, please explain.
> 

	  I don't know why the Missouri Secretary of State's web site includes 
"<br/>" to signal a new line, but it does.  I also don't know why 
XML::readHTMLTable suppressed "<br/>" everywhere it occurred, but it did 
that.  After I used gsub to replace "<br/>" with "\n", I found that 
XML::readHTMLTable did not replace "\n", so I got what I wanted.


>> I thought there should be
>> an option with something like
>> "XML::readHTMLTable" that would not
>> delete "<br/>" everywhere, but I
>> couldn't find it.?
> 
> No, there is not, AFAIK.  Please, if
> anyone else knows, please say so *echoes
> in the forest*
> 
>> If you aren't aware of one, I can
>> gsub("<br/>", "\n", ...) on the string
>> for each political office before
>> passing it to "XML::readHTMLTable".? I
>> just tested this:? It works.
> 
> Such a great hack!  IMHO, this is much
> more flexible than using
> xml2::read_html, rvest::read_table,
> dplyr::mutate like here[1]
> 
>> I have other web scraping problems in
>> my work plan for the few days.?
> 
> Maybe, idk ...
> 
>> I will definitely try
>> XML::htmlTreeParse, etc., as you
>> suggest.
> 
> I wish you good luck,
> Rasmus
> 
> [1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells


	  And I added my solution to this problem to this Stackoverflow thread.


	  Thanks again,
	  Spencer
> 
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list