[R] web scraping tables generated in multiple server pages

Wed May 11 19:48:23 CEST 2016

> On May 10, 2016, at 1:11 PM, boB Rudis <bob at rudis.net> wrote:
> 
> Unfortunately, it's a wretched, vile, SharePoint-based site. That
> means it doesn't use traditional encoding methods to do the pagination
> and one of the only ways to do this effectively is going to be to use
> RSelenium:
> 
>    library(RSelenium)
>    library(rvest)
>    library(dplyr)
>    library(pbapply)
> 
>    URL <- "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx"
> 
>    checkForServer()
>    startServer()
>    remDr <- remoteDriver$new()
>    remDr$open()

Thanks Bob/hrbrmstr;

At this point I got an error:

>    startServer()
>    remDr <- remoteDriver$new()
>    remDr$open()
[1] "Connecting to remote server"
Undefined error in RCurl call.Error in queryRD(paste0(serverURL, "/session"), "POST", qdata = toJSON(serverOpts)) : 

Running R 3.0.0 on a Mac (El Cap) in the R.app GUI. 
$ java -version
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

I asked myself: What additional information is needed to debug this? But then I thought I had a responsibility to search for earlier reports of this error on a Mac, and there were many. After reading this thread: https://github.com/ropensci/RSelenium/issues/54  I decided to try creating an "alias", mac-speak for a symlink, and put that symlink in my working directory (with no further chmod security efforts). I restarted R and re-ran the code which opened a Firefox browser window and then proceeded to page through many pages. Eventually, however it errors out with this message:

>    pblapply(1:69, function(i) {
+ 
+      if (i %in% seq(1, 69, 10)) {
+        pg <- read_html(remDr$getPageSource()[[1]])
+        ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
+ 
+      } else {
+        ref <- remDr$findElements("xpath",
+ sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
+ i))
+        ref[[1]]$clickElement()
+        pg <- read_html(remDr$getPageSource()[[1]])
+        ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
+ 
+      }
+      if ((i %% 10) == 0) {
+        ref <- remDr$findElements("xpath", ".//a[.='...']")
+        ref[[length(ref)]]$clickElement()
+      }
+ 
+      ret
+ 
+    }) -> tabs
   |+++++++++++                                       | 22% ~54s          Error in html_nodes(pg, "table")[[3]] : subscript out of bounds
> 
>    final_dat <- bind_rows(tabs)
Error in bind_rows(tabs) : object 'tabs' not found

There doesn't seem to be any trace of objects from all the downloading efforts that I could find. When I changed both instances of '69' to '30' it no longer errors out. Is there supposed to be an initial step of finding out how many pages are actually there befor setting the two iteration limits? I'm wondering if that code could be modified to return some intermediate values that would be amenable to further assembly efforts in the event of errors?

Sincerely;
David.

>    remDr$navigate(URL)
> 
>    pblapply(1:69, function(i) {
> 
>      if (i %in% seq(1, 69, 10)) {
> 
>        # the first item on the page is not a link but we can just grab the page
> 
>        pg <- read_html(remDr$getPageSource()[[1]])
>        ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
> 
>      } else {
> 
>        # we can get the rest of them by the link text directly
> 
>        ref <- remDr$findElements("xpath",
> sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
> i))
>        ref[[1]]$clickElement()
>        pg <- read_html(remDr$getPageSource()[[1]])
>        ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
> 
>      }
> 
>      # we have to move to the next actual page of data after every 10 links
> 
>      if ((i %% 10) == 0) {
>        ref <- remDr$findElements("xpath", ".//a[.='...']")
>        ref[[length(ref)]]$clickElement()
>      }
> 
>      ret
> 
>    }) -> tabs
> 
>    final_dat <- bind_rows(tabs)
>    final_dat <- final_dat[, c(1, 2, 5, 7, 8, 13, 14)] # the cols you want
>    final_dat <- final_dat[complete.cases(final_dat),] # take care of NAs
> 
>    remDr$quit()
> 
> 
> Prbly good ref code to have around, but you can grab the data & code
> here: https://gist.github.com/hrbrmstr/ec35ebb32c3cf0aba95f7bad28df1e98
> 
> (anything to help a fellow parent out :-)
> 
> -Bob
> 
> On Tue, May 10, 2016 at 2:45 PM, Michael Friendly <friendly at yorku.ca> wrote:
>> This is my first attempt to try R web scraping tools, for a project my
>> daughter is working on.  It concerns a data base of projects in Sao
>> Paulo, Brazil, listed at
>> http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx,
>> but spread out over 69 pages accessed through a javascript menu at the
>> bottom of the page.
>> 
>> Each web page contains 3 HTML tables, of which only the last contains
>> the relevant data.  In this, only a subset of columns are of interest.
>> I tried using the XML package as illustrated on several tutorial pages,
>> as shown below.  I have no idea how to automate this to extract these
>> tables from multiple web pages.  Is there some other package better
>> suited to this task?  Can someone help me solve this and other issues?
>> 
>> # Goal: read the data tables contained on 69 pages generated by the link
>> below, where
>> # each page is generated by a javascript link in the menu of the bottom
>> of the page.
>> #
>> # Each "page" contains 3 html tables, with names "Table 1", "Table 2",
>> and the only one
>> # of interest with the data, "grdRelSitGeralProcessos"
>> #
>> # From each such table, extract the following columns:
>> #- Processo
>> #- Endereço
>> #- Distrito
>> #- Area terreno (m2)
>> #- Valor contrapartida ($)
>> #- Area excedente (m2)
>> 
>> # NB: All of the numeric fields use "." as comma-separator and "," as
>> the decimal separator,
>> #   but because of this are read in as character
>> 
>> 
>> library(XML)
>> link <-
>> "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx"
>> 
>> saopaulo <- htmlParse(link)
>> saopaulo.tables <- readHTMLTable(saopaulo, stringsAsFactors = FALSE)
>> length(saopaulo.tables)
>> 
>> # its the third table on this page we want
>> sp.tab <- saopaulo.tables[[3]]
>> 
>> # columns wanted
>> wanted <- c(1, 2, 5, 7, 8, 13, 14)
>> head(sp.tab[, wanted])
>> 
>>> head(sp.tab[, wanted])
>>   Proposta Processo EndereÃ§o        Distrito
>> 1        1 2002-0.148.242-4 R. DOMINGOS LOPES DA SILVA X R. CORNÃ‰LIO
>> VAN CLEVE    VILA ANDRADE
>> 2        2 2003-0.129.667-3                      AV. DR. JOSÃ‰ HIGINO,
>> 200 E 216       AGUA RASA
>> 3        3 2003-0.065.011-2                       R. ALIANÃ‡A LIBERAL,
>> 980 E 990 VILA LEOPOLDINA
>> 4        4 2003-0.165.806-0                       R. ALIANÃ‡A LIBERAL,
>> 880 E 886 VILA LEOPOLDINA
>> 5        5 2003-0.139.053-0                R. DR. JOSÃ‰ DE ANDRADE
>> FIGUEIRA, 111    VILA ANDRADE
>> 6        6 2003-0.200.692-0                                R. JOSÃ‰ DE
>> JESUS, 66      VILA SONIA
>>   Ã rea Terreno (m2) Ã rea Excedente (m2) Valor Contrapartida (R$)
>> 1               0,00             1.551,14 127.875,98
>> 2               0,00             3.552,13 267.075,77
>> 3               0,00               624,99 70.212,93
>> 4               0,00               395,64 44.447,18
>> 5               0,00               719,68 41.764,46
>> 6               0,00               446,52 85.152,92
>> 
>> thanks,
>> 
>> 
>> --
>> Michael Friendly     Email: friendly AT yorku DOT ca
>> Professor, Psychology Dept. & Chair, Quantitative Methods
>> York University      Voice: 416 736-2100 x66249 Fax: 416 736-5814
>> 4700 Keele Street    Web:http://www.datavis.ca
>> Toronto, ONT  M3J 1P3 CANADA
>> 
>> 
>>        [[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA