[R] web scraping tables generated in multiple server pages

Marco Silva marco.prado.bs at gmail.com
Tue May 10 22:08:56 CEST 2016


Excerpts from Michael Friendly's message of 2016-05-10 14:45:28 -0400:
> This is my first attempt to try R web scraping tools, for a project my 
> daughter is working on.  It concerns a data base of projects in Sao 
> Paulo, Brazil, listed at 
> http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx, 
> but spread out over 69 pages accessed through a javascript menu at the 
> bottom of the page.
> 
> Each web page contains 3 HTML tables, of which only the last contains 
> the relevant data.  In this, only a subset of columns are of interest.  
> I tried using the XML package as illustrated on several tutorial pages, 
> as shown below.  I have no idea how to automate this to extract these 
> tables from multiple web pages.  Is there some other package better 
> suited to this task?  Can someone help me solve this and other issues?
> 
> # Goal: read the data tables contained on 69 pages generated by the link 
> below, where
> # each page is generated by a javascript link in the menu of the bottom 
> of the page.
> #
> # Each "page" contains 3 html tables, with names "Table 1", "Table 2", 
> and the only one
> # of interest with the data, "grdRelSitGeralProcessos"
> #
> # From each such table, extract the following columns:
> #- Processo
> #- Endereço
> #- Distrito
> #- Area terreno (m2)
> #- Valor contrapartida ($)
> #- Area excedente (m2)
> 
> # NB: All of the numeric fields use "." as comma-separator and "," as 
> the decimal separator,
> #   but because of this are read in as character
> 
> 
> library(XML)
> link <- 
> "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx"
> 
> saopaulo <- htmlParse(link)
> saopaulo.tables <- readHTMLTable(saopaulo, stringsAsFactors = FALSE)
> length(saopaulo.tables)
> 
> # its the third table on this page we want
> sp.tab <- saopaulo.tables[[3]]
> 
> # columns wanted
> wanted <- c(1, 2, 5, 7, 8, 13, 14)
> head(sp.tab[, wanted])
> 
>  > head(sp.tab[, wanted])
>    Proposta Processo Endereço        Distrito
> 1        1 2002-0.148.242-4 R. DOMINGOS LOPES DA SILVA X R. CORNÉLIO 
> VAN CLEVE    VILA ANDRADE
> 2        2 2003-0.129.667-3                      AV. DR. JOSÉ HIGINO, 
> 200 E 216       AGUA RASA
> 3        3 2003-0.065.011-2                       R. ALIANÇA LIBERAL, 
> 980 E 990 VILA LEOPOLDINA
> 4        4 2003-0.165.806-0                       R. ALIANÇA LIBERAL, 
> 880 E 886 VILA LEOPOLDINA
> 5        5 2003-0.139.053-0                R. DR. JOSÉ DE ANDRADE 
> FIGUEIRA, 111    VILA ANDRADE
> 6        6 2003-0.200.692-0                                R. JOSÉ DE 
> JESUS, 66      VILA SONIA
>    Área Terreno (m2) Área Excedente (m2) Valor Contrapartida (R$)
> 1               0,00             1.551,14 127.875,98
> 2               0,00             3.552,13 267.075,77
> 3               0,00               624,99 70.212,93
> 4               0,00               395,64 44.447,18
> 5               0,00               719,68 41.764,46
> 6               0,00               446,52 85.152,92
> 
> thanks,
> 
> 
> -- 
> Michael Friendly     Email: friendly AT yorku DOT ca
> Professor, Psychology Dept. & Chair, Quantitative Methods
> York University      Voice: 416 736-2100 x66249 Fax: 416 736-5814
> 4700 Keele Street    Web:http://www.datavis.ca
> Toronto, ONT  M3J 1P3 CANADA
> 
> 
# what is missing to you
?gsub
# aliasing
df <- sp.tab[, wanted]

# convert to double
as.double(                                      # convert to double
    gsub(',', '.',                              # makes the ',' to become '.'
    gsub('\\.', '', df$"Área Excedente (m2)"))  # get rid of the dot

You can easily put the names of the columns and use lapply on them to
convert all of them in same manner, that is left as an exercise.


-- 
Marco Arthur @ (M)arco Creatives



More information about the R-help mailing list