[R] [External] Re: help with web scraping
Spencer Graves
@pencer@gr@ve@ @end|ng |rom e||ect|vede|en@e@org
Sat Jul 25 16:56:38 CEST 2020
Dear Rasmus et al.:
On 2020-07-25 04:10, Rasmus Liland wrote:
> On 2020-07-24 10:28 -0500, Spencer Graves wrote:
>> Dear Rasmus:
>>
>>> Dear Spencer,
>>>
>>> I unified the party tables after the
>>> first summary table like this:
>>>
>>> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>> M_sos <- RCurl::getURL(url)
>>> saveRDS(object=M_sos, file="dcp.rds")
>>> dat <- XML::readHTMLTable(M_sos)
>>> idx <- 2:length(dat)
>>> cn <- unique(unlist(lapply(dat[idx], colnames)))
>> This is useful for this application.
>>
>>> dat <- do.call(rbind,
>>> sapply(idx, function(i, dat, cn) {
>>> x <- dat[[i]]
>>> x[,cn[!(cn %in% colnames(x))]] <- NA
>>> x <- x[,cn]
>>> x$Party <- names(dat)[i]
>>> return(list(x))
>>> }, dat=dat, cn=cn))
>>> dat[,"Date Filed"] <-
>>> as.Date(x=dat[,"Date Filed"],
>>> format="%m/%d/%Y")
>> This misses something extremely
>> important for this application:? The
>> political office.? That's buried in
>> the HTML or whatever it is.? I'm using
>> something like the following to find
>> that:
>>
>> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])
> Dear Spencer,
>
> I came up with a solution, but it is not
> very elegant. Instead of showing you
> the solution, hoping you understand
> everything in it, I istead want to give
> you some emphatic hints to see if you
> can come up with a solution on you own.
>
> - XML::htmlTreeParse(M_sos)
> - *Gandalf voice*: climb the tree
> until you find the content you are
> looking for flat out at the level of
> �The Children of the Div�, *uuuUUU*
> - you only want to keep the table and
> header tags at this level
> - Use XML::xmlValue to extract the
> values of all the headers (the
> political positions)
> - Observe that all the tables on the
> page you were able to extract
> previously using XML::readHTMLTable,
> are at this level, shuffled between
> the political position header tags,
> this means you extract the political
> position and party affiliation by
> using a for loop, if statements,
> typeof, names, and [] and [[]] to grab
> different things from the list
> (content or the bag itself).
> XML::readHTMLTable strips away the
> line break tags from the Mailing
> address, so if you find a better way
> of extracting the tables, tell me,
> e.g. you get
>
> 8805 HUNTER AVEKANSAS CITY MO 64138
>
> and not
>
> 8805 HUNTER AVE<br/>KANSAS CITY MO 64138
>
> When you've completed this �programming
> quest�, you're back at the level of the
> previous email, i.e. you have have the
> same tables, but with political position
> and party affiliation added to them.
����� Please excuse:� Before my last post, I had written code to do all
that.� In brief, the political offices are "h3" tags.� I used "strsplit"
to split the string at "<h3>".� I then wrote a function to find "</h3>",
extract the political office and pass the rest to "XML::readHTMLTable",
adding columns for party and political office.
����� However, this suppressed "<br/>" everywhere.� I thought there
should be an option with something like "XML::readHTMLTable" that would
not delete "<br/>" everywhere, but I couldn't find it.� If you aren't
aware of one, I can gsub("<br/>", "\n", ...) on the string for each
political office before passing it to "XML::readHTMLTable".� I just
tested this:� It works.
����� I have other web scraping problems in my work plan for the few
days.� I will definitely try XML::htmlTreeParse, etc., as you suggest.
����� Thanks again.
����� Spencer Graves
>
> Best,
> Rasmus
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
More information about the R-help
mailing list