[R] [External] Re: help with web scraping
Rasmus Liland
jr@| @end|ng |rom po@teo@no
Sun Jul 26 17:43:59 CEST 2020
Dear William Michels,
On 2020-07-25 10:58 -0700, William Michels wrote:
>
> Dear Spencer Graves (and Rasmus Liland),
>
> I've had some luck just using gsub()
> to alter the offending "</br>"
> characters, appending a "___" tag at
> each instance of "<br>" (first I
> checked the text to make sure it
> didn't contain any pre-existing
> instances of "___"). See the output
> snippet below:
>
> > library(RCurl)
> > library(XML)
> > sosURL <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> > sosChars <- getURL(sosURL)
> > sosChars2 <- gsub("<br/>", "<br/>___", sosChars)
> > MOcan <- readHTMLTable(sosChars2)
> > MOcan[[2]]
> Name
> 1 Raleigh Ritter
> 2 Mike Parson
> 3 James W. (Jim) Neely
> 4 Saundra McDowell
> Mailing Address
> 1 4476 FIVE MILE RD___SENECA MO 64865
> 2 1458 E 464 RD___BOLIVAR MO 65613
> 3 PO BOX 343___CAMERON MO 64429
> 4 3854 SOUTH AVENUE___SPRINGFIELD MO 65807
> Random Number Date Filed
> 1 185 2/25/2020
> 2 348 2/25/2020
> 3 477 2/25/2020
> 4 3/31/2020
> >
>
> It's true, there's one a 'section' of
> MOcan output that contains odd-looking
> characters (see the "Total" line of
> MOcan[[1]]). But my guess is you'll be
> deleting this 'line' anyway--and
> recalulating totals in R.
Perhaps it's the this table you mean?
Offices Republican
1 Governor 4
2 Lieutenant Governor 4
3 Secretary of State 1
4 State Treasurer 1
5 Attorney General 1
6 U.S. Representative 24
7 State Senator 28
8 State Representative 187
9 Circuit Judge 18
10 Total 268\r\n___
Democratic Libertarian Green
1 5 1 1
2 2 1 1
3 1 1 1
4 1 1 1
5 2 1 0
6 16 9 0
7 22 2 1
8 137 6 2
9 1 0 0
10 187\r\n___ 22\r\n___ 7\r\n___
Constitution Total
1 0 11
2 0 8
3 1 5
4 0 4
5 0 4
6 0 49
7 0 53
8 1 333
9 0 19
10 2\r\n___ 486\r\n___
Yes, somehow the Windows[1] character
"0xD" gets converted to "\r\n" after
your gsub, "<br/>" is still ignored.
There is not a "0xD" inside the
td.AddressCol cells in the tables we are
interested in.
> Now that you have a comprehensive list
> object, you should be able to pull out
> districts/races of interest. You might
> want to take a look at the "rlist"
> package, to see if it can make your
> work a little easier:
>
> https://CRAN.R-project.org/package=rlist
> https://renkun-ken.github.io/rlist-tutorial/index.html
Thank you, this package seems useful.
Please can you provide a hint (maybe) as
to which of the many functions you were
thinking of? E.g. instead of using for
over the index of the list of headers
and tables, if typeof list or character,
and updating variables to write in the
political position to each table.
V
r
[1] https://stackoverflow.com/questions/5843495/what-does-m-character-mean-in-vim
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200726/023e8723/attachment.sig>
More information about the R-help
mailing list