[R] [External] Re: help with web scraping
Spencer Graves
@pencer@gr@ve@ @end|ng |rom e||ect|vede|en@e@org
Fri Jul 24 17:28:12 CEST 2020
Dear Rasmus:
On 2020-07-24 09:16, Rasmus Liland wrote:
> On 2020-07-24 08:20 -0500, luke-tierney using uiowa.edu wrote:
>> On Fri, 24 Jul 2020, Spencer Graves wrote:
>>> On 2020-07-23 17:46, William Michels wrote:
>>>> On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
>>>> <spencer.graves using effectivedefense.org> wrote:
>>>>> Hello, All:
>>>>>
>>>>> I've failed with multiple
>>>>> attempts to scrape the table of
>>>>> candidates from the website of
>>>>> the Missouri Secretary of
>>>>> State:
>>>>>
>>>>> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
>>>> Hi Spencer,
>>>>
>>>> I tried the code below on an older
>>>> R-installation, and it works fine.
>>>> Not a full solution, but it's a
>>>> start:
>>>>
>>>>> library(RCurl)
>>>> Loading required package: bitops
>>>>> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>>>> M_sos <- getURL(url)
>>> Hi Bill et al.:
>>>
>>> That broke the dam:� It gave me a
>>> character vector of length 1
>>> consisting of 218 KB.� I fed that to
>>> XML::readHTMLTable and
>>> purrr::map_chr, both of which
>>> returned lists of 337 data.frames.
>>> The former retained names for all
>>> the tables, absent from the latter.
>>> The columns of the former are all
>>> character;� that's not true for the
>>> latter.
>>>
>>> Sadly, it's not quite what I want:
>>> It's one table for each office-party
>>> combination, but it's lost the
>>> office designation. However, I'm
>>> confident I can figure out how to
>>> hack that.
>> Maybe try something like this:
>>
>> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>> h <- xml2::read_html(url)
>> tbl <- rvest::html_table(h)
> Dear Spencer,
>
> I unified the party tables after the
> first summary table like this:
>
> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> M_sos <- RCurl::getURL(url)
> saveRDS(object=M_sos, file="dcp.rds")
> dat <- XML::readHTMLTable(M_sos)
> idx <- 2:length(dat)
> cn <- unique(unlist(lapply(dat[idx], colnames)))
����� This is useful for this application.
> dat <- do.call(rbind,
> sapply(idx, function(i, dat, cn) {
> x <- dat[[i]]
> x[,cn[!(cn %in% colnames(x))]] <- NA
> x <- x[,cn]
> x$Party <- names(dat)[i]
> return(list(x))
> }, dat=dat, cn=cn))
> dat[,"Date Filed"] <-
> as.Date(x=dat[,"Date Filed"],
> format="%m/%d/%Y")
����� This misses something extremely important for this application:�
The political office.� That's buried in the HTML or whatever it is.� I'm
using something like the following to find that:
str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])
����� After I figure this out, I will use something like your code to
combine it all into separate tables for each office, and then probably
combine those into one table for the offices I'm interested in.� For my
present purposes, I don't want all the offices in Missouri, only the
executive positions and those representing parts of the Kansas City
metro area in the Missouri legislature.
����� Thanks again,
����� Spencer Graves
> write.table(dat, file="dcp.tsv", sep="\t",
> row.names=FALSE,
> quote=TRUE, na="N/A")
>
> Best,
> Rasmus
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
More information about the R-help
mailing list