[R] web scraping tables generated in multiple server pages
Michael Friendly
friendly at yorku.ca
Tue May 10 20:45:28 CEST 2016
This is my first attempt to try R web scraping tools, for a project my
daughter is working on. It concerns a data base of projects in Sao
Paulo, Brazil, listed at
http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx,
but spread out over 69 pages accessed through a javascript menu at the
bottom of the page.
Each web page contains 3 HTML tables, of which only the last contains
the relevant data. In this, only a subset of columns are of interest.
I tried using the XML package as illustrated on several tutorial pages,
as shown below. I have no idea how to automate this to extract these
tables from multiple web pages. Is there some other package better
suited to this task? Can someone help me solve this and other issues?
# Goal: read the data tables contained on 69 pages generated by the link
below, where
# each page is generated by a javascript link in the menu of the bottom
of the page.
#
# Each "page" contains 3 html tables, with names "Table 1", "Table 2",
and the only one
# of interest with the data, "grdRelSitGeralProcessos"
#
# From each such table, extract the following columns:
#- Processo
#- Endereço
#- Distrito
#- Area terreno (m2)
#- Valor contrapartida ($)
#- Area excedente (m2)
# NB: All of the numeric fields use "." as comma-separator and "," as
the decimal separator,
# but because of this are read in as character
library(XML)
link <-
"http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx"
saopaulo <- htmlParse(link)
saopaulo.tables <- readHTMLTable(saopaulo, stringsAsFactors = FALSE)
length(saopaulo.tables)
# its the third table on this page we want
sp.tab <- saopaulo.tables[[3]]
# columns wanted
wanted <- c(1, 2, 5, 7, 8, 13, 14)
head(sp.tab[, wanted])
> head(sp.tab[, wanted])
Proposta Processo Endereço Distrito
1 1 2002-0.148.242-4 R. DOMINGOS LOPES DA SILVA X R. CORNÉLIO
VAN CLEVE VILA ANDRADE
2 2 2003-0.129.667-3 AV. DR. JOSÉ HIGINO,
200 E 216 AGUA RASA
3 3 2003-0.065.011-2 R. ALIANÇA LIBERAL,
980 E 990 VILA LEOPOLDINA
4 4 2003-0.165.806-0 R. ALIANÇA LIBERAL,
880 E 886 VILA LEOPOLDINA
5 5 2003-0.139.053-0 R. DR. JOSÉ DE ANDRADE
FIGUEIRA, 111 VILA ANDRADE
6 6 2003-0.200.692-0 R. JOSÉ DE
JESUS, 66 VILA SONIA
Ãrea Terreno (m2) Ãrea Excedente (m2) Valor Contrapartida (R$)
1 0,00 1.551,14 127.875,98
2 0,00 3.552,13 267.075,77
3 0,00 624,99 70.212,93
4 0,00 395,64 44.447,18
5 0,00 719,68 41.764,46
6 0,00 446,52 85.152,92
thanks,
--
Michael Friendly Email: friendly AT yorku DOT ca
Professor, Psychology Dept. & Chair, Quantitative Methods
York University Voice: 416 736-2100 x66249 Fax: 416 736-5814
4700 Keele Street Web:http://www.datavis.ca
Toronto, ONT M3J 1P3 CANADA
[[alternative HTML version deleted]]
More information about the R-help
mailing list