[R] [External] Re: help with web scraping
iuke-tier@ey m@iii@g oii uiow@@edu
iuke-tier@ey m@iii@g oii uiow@@edu
Fri Jul 24 15:20:09 CEST 2020
Maybe try something like this:
url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
h <- xml2::read_html(url)
tbl <- rvest::html_table(h)
Best,
luke
On Fri, 24 Jul 2020, Spencer Graves wrote:
> Hi Bill et al.:
>
>
> That broke the dam: It gave me a character vector of length 1
> consisting of 218 KB. I fed that to XML::readHTMLTable and purrr::map_chr,
> both of which returned lists of 337 data.frames. The former retained names
> for all the tables, absent from the latter. The columns of the former are
> all character; that's not true for the latter.
>
>
> Sadly, it's not quite what I want: It's one table for each
> office-party combination, but it's lost the office designation. However, I'm
> confident I can figure out how to hack that.
>
>
> Thanks,
> Spencer Graves
>
>
> On 2020-07-23 17:46, William Michels wrote:
>> Hi Spencer,
>>
>> I tried the code below on an older R-installation, and it works fine.
>> Not a full solution, but it's a start:
>>
>>> library(RCurl)
>> Loading required package: bitops
>>> url <-
>>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>> M_sos <- getURL(url)
>>> print(M_sos)
>> [1] "\r\n<!DOCTYPE html>\r\n\r\n<html
>> lang=\"en-us\">\r\n<head><title>\r\n\tSOS, Missouri - Elections:
>> Offices Filed in Candidate Filing\r\n</title><meta name=\"viewport\"
>> content=\"width=device-width, initial-scale=1.0\" [...remainder
>> truncated].
>>
>> HTH, Bill.
>>
>> W. Michels, Ph.D.
>>
>>
>>
>> On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
>> <spencer.graves using effectivedefense.org> wrote:
>>> Hello, All:
>>>
>>>
>>> I've failed with multiple attempts to scrape the table of
>>> candidates from the website of the Missouri Secretary of State:
>>>
>>>
>>> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
>>>
>>>
>>> I've tried base::url, base::readLines, xml2::read_html, and
>>> XML::readHTMLTable; see summary below.
>>>
>>>
>>> Suggestions?
>>> Thanks,
>>> Spencer Graves
>>>
>>>
>>> sosURL <-
>>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>>
>>> str(baseURL <- base::url(sosURL))
>>> # this might give me something, but I don't know what
>>>
>>> sosRead <- base::readLines(sosURL) # 404 Not Found
>>> sosRb <- base::readLines(baseURL) # 404 Not Found
>>>
>>> sosXml2 <- xml2::read_html(sosURL) # HTTP error 404.
>>>
>>> sosXML <- XML::readHTMLTable(sosURL)
>>> # List of 0; does not seem to be XML
>>>
>>> sessionInfo()
>>>
>>> R version 4.0.2 (2020-06-22)
>>> Platform: x86_64-apple-darwin17.0 (64-bit)
>>> Running under: macOS Catalina 10.15.5
>>>
>>> Matrix products: default
>>> BLAS:
>>> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>>> LAPACK:
>>> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets
>>> [6] methods base
>>>
>>> loaded via a namespace (and not attached):
>>> [1] compiler_4.0.2 tools_4.0.2 curl_4.3
>>> [4] xml2_1.3.2 XML_3.99-0.3
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa Phone: 319-335-3386
Department of Statistics and Fax: 319-335-3017
Actuarial Science
241 Schaeffer Hall email: luke-tierney using uiowa.edu
Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
More information about the R-help
mailing list