[R] [External] Re: help with web scraping
William Michels
wjm1 @end|ng |rom c@@@co|umb|@@edu
Sat Jul 25 19:58:12 CEST 2020
Dear Spencer Graves (and Rasmus Liland),
I've had some luck just using gsub() to alter the offending "</br>"
characters, appending a "___" tag at each instance of "<br>" (first I
checked the text to make sure it didn't contain any pre-existing
instances of "___"). See the output snippet below:
> library(RCurl)
> library(XML)
> sosURL <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> sosChars <- getURL(sosURL)
> sosChars2 <- gsub("<br/>", "<br/>___", sosChars)
> MOcan <- readHTMLTable(sosChars2)
> MOcan[[2]]
Name Mailing Address Random
Number Date Filed
1 Raleigh Ritter 4476 FIVE MILE RD___SENECA MO 64865
185 2/25/2020
2 Mike Parson 1458 E 464 RD___BOLIVAR MO 65613
348 2/25/2020
3 James W. (Jim) Neely PO BOX 343___CAMERON MO 64429
477 2/25/2020
4 Saundra McDowell 3854 SOUTH AVENUE___SPRINGFIELD MO 65807
3/31/2020
>
It's true, there's one a 'section' of MOcan output that contains
odd-looking characters (see the "Total" line of MOcan[[1]]). But my
guess is you'll be deleting this 'line' anyway--and recalulating totals in R.
Now that you have a comprehensive list object, you should be able to
pull out districts/races of interest. You might want to take a look at
the "rlist" package, to see if it can make your work a little easier:
https://CRAN.R-project.org/package=rlist
https://renkun-ken.github.io/rlist-tutorial/index.html
HTH, Bill.
W. Michels, Ph.D.
On Sat, Jul 25, 2020 at 7:56 AM Spencer Graves
<spencer.graves using effectivedefense.org> wrote:
>
> Dear Rasmus et al.:
>
>
> On 2020-07-25 04:10, Rasmus Liland wrote:
> > On 2020-07-24 10:28 -0500, Spencer Graves wrote:
> >> Dear Rasmus:
> >>
> >>> Dear Spencer,
> >>>
> >>> I unified the party tables after the
> >>> first summary table like this:
> >>>
> >>> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> >>> M_sos <- RCurl::getURL(url)
> >>> saveRDS(object=M_sos, file="dcp.rds")
> >>> dat <- XML::readHTMLTable(M_sos)
> >>> idx <- 2:length(dat)
> >>> cn <- unique(unlist(lapply(dat[idx], colnames)))
> >> This is useful for this application.
> >>
> >>> dat <- do.call(rbind,
> >>> sapply(idx, function(i, dat, cn) {
> >>> x <- dat[[i]]
> >>> x[,cn[!(cn %in% colnames(x))]] <- NA
> >>> x <- x[,cn]
> >>> x$Party <- names(dat)[i]
> >>> return(list(x))
> >>> }, dat=dat, cn=cn))
> >>> dat[,"Date Filed"] <-
> >>> as.Date(x=dat[,"Date Filed"],
> >>> format="%m/%d/%Y")
> >> This misses something extremely
> >> important for this application:? The
> >> political office.? That's buried in
> >> the HTML or whatever it is.? I'm using
> >> something like the following to find
> >> that:
> >>
> >> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])
> > Dear Spencer,
> >
> > I came up with a solution, but it is not
> > very elegant. Instead of showing you
> > the solution, hoping you understand
> > everything in it, I istead want to give
> > you some emphatic hints to see if you
> > can come up with a solution on you own.
> >
> > - XML::htmlTreeParse(M_sos)
> > - *Gandalf voice*: climb the tree
> > until you find the content you are
> > looking for flat out at the level of
> > «The Children of the Div», *uuuUUU*
> > - you only want to keep the table and
> > header tags at this level
> > - Use XML::xmlValue to extract the
> > values of all the headers (the
> > political positions)
> > - Observe that all the tables on the
> > page you were able to extract
> > previously using XML::readHTMLTable,
> > are at this level, shuffled between
> > the political position header tags,
> > this means you extract the political
> > position and party affiliation by
> > using a for loop, if statements,
> > typeof, names, and [] and [[]] to grab
> > different things from the list
> > (content or the bag itself).
> > XML::readHTMLTable strips away the
> > line break tags from the Mailing
> > address, so if you find a better way
> > of extracting the tables, tell me,
> > e.g. you get
> >
> > 8805 HUNTER AVEKANSAS CITY MO 64138
> >
> > and not
> >
> > 8805 HUNTER AVE<br/>KANSAS CITY MO 64138
> >
> > When you've completed this «programming
> > quest», you're back at the level of the
> > previous email, i.e. you have have the
> > same tables, but with political position
> > and party affiliation added to them.
>
>
> Please excuse: Before my last post, I had written code to do all
> that. In brief, the political offices are "h3" tags. I used "strsplit"
> to split the string at "<h3>". I then wrote a function to find "</h3>",
> extract the political office and pass the rest to "XML::readHTMLTable",
> adding columns for party and political office.
>
>
> However, this suppressed "<br/>" everywhere. I thought there
> should be an option with something like "XML::readHTMLTable" that would
> not delete "<br/>" everywhere, but I couldn't find it. If you aren't
> aware of one, I can gsub("<br/>", "\n", ...) on the string for each
> political office before passing it to "XML::readHTMLTable". I just
> tested this: It works.
>
>
> I have other web scraping problems in my work plan for the few
> days. I will definitely try XML::htmlTreeParse, etc., as you suggest.
>
>
> Thanks again.
> Spencer Graves
> >
> > Best,
> > Rasmus
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list