[R] [External] Re: help with web scraping
Rasmus Liland
jr@| @end|ng |rom po@teo@no
Fri Jul 24 16:16:18 CEST 2020
On 2020-07-24 08:20 -0500, luke-tierney using uiowa.edu wrote:
> On Fri, 24 Jul 2020, Spencer Graves wrote:
> > On 2020-07-23 17:46, William Michels wrote:
> > > On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
> > > <spencer.graves using effectivedefense.org> wrote:
> > > > Hello, All:
> > > >
> > > > I've failed with multiple
> > > > attempts to scrape the table of
> > > > candidates from the website of
> > > > the Missouri Secretary of
> > > > State:
> > > >
> > > > https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
> > >
> > > Hi Spencer,
> > >
> > > I tried the code below on an older
> > > R-installation, and it works fine.
> > > Not a full solution, but it's a
> > > start:
> > >
> > > > library(RCurl)
> > > Loading required package: bitops
> > > > url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> > > > M_sos <- getURL(url)
> >
> > Hi Bill et al.:
> >
> > That broke the dam: It gave me a
> > character vector of length 1
> > consisting of 218 KB. I fed that to
> > XML::readHTMLTable and
> > purrr::map_chr, both of which
> > returned lists of 337 data.frames.
> > The former retained names for all
> > the tables, absent from the latter.
> > The columns of the former are all
> > character; that's not true for the
> > latter.
> >
> > Sadly, it's not quite what I want:
> > It's one table for each office-party
> > combination, but it's lost the
> > office designation. However, I'm
> > confident I can figure out how to
> > hack that.
>
> Maybe try something like this:
>
> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> h <- xml2::read_html(url)
> tbl <- rvest::html_table(h)
Dear Spencer,
I unified the party tables after the
first summary table like this:
url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
M_sos <- RCurl::getURL(url)
saveRDS(object=M_sos, file="dcp.rds")
dat <- XML::readHTMLTable(M_sos)
idx <- 2:length(dat)
cn <- unique(unlist(lapply(dat[idx], colnames)))
dat <- do.call(rbind,
sapply(idx, function(i, dat, cn) {
x <- dat[[i]]
x[,cn[!(cn %in% colnames(x))]] <- NA
x <- x[,cn]
x$Party <- names(dat)[i]
return(list(x))
}, dat=dat, cn=cn))
dat[,"Date Filed"] <-
as.Date(x=dat[,"Date Filed"],
format="%m/%d/%Y")
write.table(dat, file="dcp.tsv", sep="\t",
row.names=FALSE,
quote=TRUE, na="N/A")
Best,
Rasmus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200724/1d52dffb/attachment.sig>
More information about the R-help
mailing list