[R] [External] Re: help with web scraping

Sat Jul 25 16:56:38 CEST 2020

Dear Rasmus et al.:

On 2020-07-25 04:10, Rasmus Liland wrote:
> On 2020-07-24 10:28 -0500, Spencer Graves wrote:
>> Dear Rasmus:
>>
>>> Dear Spencer,
>>>
>>> I unified the party tables after the
>>> first summary table like this:
>>>
>>> 	url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>> 	M_sos <- RCurl::getURL(url)
>>> 	saveRDS(object=M_sos, file="dcp.rds")
>>> 	dat <- XML::readHTMLTable(M_sos)
>>> 	idx <- 2:length(dat)
>>> 	cn <- unique(unlist(lapply(dat[idx], colnames)))
>> This is useful for this application.
>>
>>> 	dat <- do.call(rbind,
>>> 	  sapply(idx, function(i, dat, cn) {
>>> 	    x <- dat[[i]]
>>> 	    x[,cn[!(cn %in% colnames(x))]] <- NA
>>> 	    x <- x[,cn]
>>> 	    x$Party <- names(dat)[i]
>>> 	    return(list(x))
>>> 	  }, dat=dat, cn=cn))
>>> 	dat[,"Date Filed"] <-
>>> 	  as.Date(x=dat[,"Date Filed"],
>>> 	          format="%m/%d/%Y")
>> This misses something extremely
>> important for this application:?  The
>> political office.? That's buried in
>> the HTML or whatever it is.? I'm using
>> something like the following to find
>> that:
>>
>> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])
> Dear Spencer,
>
> I came up with a solution, but it is not
> very elegant.  Instead of showing you
> the solution, hoping you understand
> everything in it, I istead want to give
> you some emphatic hints to see if you
> can come up with a solution on you own.
>
> - XML::htmlTreeParse(M_sos)
>    - *Gandalf voice*: climb the tree
>      until you find the content you are
>      looking for flat out at the level of
>      �The Children of the Div�, *uuuUUU*
>    - you only want to keep the table and
>      header tags at this level
> - Use XML::xmlValue to extract the
>    values of all the headers (the
>    political positions)
> - Observe that all the tables on the
>    page you were able to extract
>    previously using XML::readHTMLTable,
>    are at this level, shuffled between
>    the political position header tags,
>    this means you extract the political
>    position and party affiliation by
>    using a for loop, if statements,
>    typeof, names, and [] and [[]] to grab
>    different things from the list
>    (content or the bag itself).
>    XML::readHTMLTable strips away the
>    line break tags from the Mailing
>    address, so if you find a better way
>    of extracting the tables, tell me,
>    e.g. you get
>
> 	8805 HUNTER AVEKANSAS CITY MO 64138
>
>    and not
>
> 	8805 HUNTER AVE<br/>KANSAS CITY MO 64138
>
> When you've completed this �programming
> quest�, you're back at the level of the
> previous email, i.e.  you have have the
> same tables, but with political position
> and party affiliation added to them.

 ����� Please excuse:� Before my last post, I had written code to do all 
that.� In brief, the political offices are "h3" tags.� I used "strsplit" 
to split the string at "<h3>".� I then wrote a function to find "</h3>", 
extract the political office and pass the rest to "XML::readHTMLTable", 
adding columns for party and political office.

 ����� However, this suppressed "<br/>" everywhere.� I thought there 
should be an option with something like "XML::readHTMLTable" that would 
not delete "<br/>" everywhere, but I couldn't find it.� If you aren't 
aware of one, I can gsub("<br/>", "\n", ...) on the string for each 
political office before passing it to "XML::readHTMLTable".� I just 
tested this:� It works.

 ����� I have other web scraping problems in my work plan for the few 
days.� I will definitely try XML::htmlTreeParse, etc., as you suggest.

 ����� Thanks again.
 ����� Spencer Graves
>
> Best,
> Rasmus
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]