[R-es] Función para descargar información automaticamente de una web
javier.ruben.marcuzzi en gmail.com
javier.ruben.marcuzzi en gmail.com
Sab Jun 24 00:45:02 CEST 2017
Estimado Wilmer
Estoy realizando limpieza de archivos y hay uno del cuál copio y pego su contenido porque puede ser útil para su caso.
library("rvest")
htmlpage <- html("http://forecast.weather.gov/MapClick.php?lat=42.31674913306716&lon=-71.42487878862437&site=all&smap=1#.VRsEpZPF84I")
forecasthtml <- html_nodes(htmlpage, "#detailed-forecast-body b , .forecast-text")
forecast <- html_text(forecasthtml)
paste(forecast, collapse =" ")
#http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/
library(rvest)
lego_movie <- html("http://www.imdb.com/title/tt1490017/")
# lego_movie carga toda la página
lego_movie %>% # de esta pagina
html_node("strong span") %>% # busco en el dom
html_text() %>% # extraigo lo que tiene en ese lugar del dom
as.numeric() # lo comvierto a numerico si lo dejo com estaba leo en texto
lego_movie %>%
html_nodes("#titleCast .itemprop span") %>%
html_text()
lego_movie %>%
html_nodes("table") %>%
.[[3]] %>%
html_table()
##############################################################
library(rvest)
url <-"http://www.perfectgame.org/" ## page to spider
pgsession <-html_session(url) ## create session
pgform <-html_form(pgsession)[[1]] ## pull form from session
# Note the new variable assignment
filled_form <- set_values(pgform,
`ctl00$Header2$HeaderTop1$tbUsername` = "myemail en gmail.com",
`ctl00$Header2$HeaderTop1$tbPassword` = "mypassword")
submit_form(pgsession,filled_form)
#################################################
url <- "http://mobile.bahn.de/bin/mobil/query.exe/dox?country=DEU&rt=1&use_realtime_filter=1&webview=&searchMode=NORMAL"
sitzung <- html_session(url)
p1.form <- html_form(sitzung)[[1]]
p2 <- submit_form(sitzung, p1.form, submit='advancedProductMode')
p2.form <- html_form(p2)[[1]]
form.mod<- set_values( p2.form
,REQ0JourneyStopsS0G = "HH"
,REQ0JourneyStopsZ0G = "F"
)
final <- submit_form(sitzung, form.mod, submit='start')
###############
# http://stat4701.github.io/edav/2015/04/02/rvest_tutorial/
# Scraping indeed.com for jobs
# Submit the form on indeed.com for a job description and location using html_form() and set_values()
query = "data science"
loc = "New York"
session <- html_session("http://www.indeed.com")
form <- html_form(session)[[1]] # el formulario da lo siguiente
#<form> 'jobsearch' (GET /jobs)
#<input text> 'q':
# <input text> 'l':
# <input submit> '': Find Jobs
form <- set_values(form, q = query, l = loc)
submit_form(session, session, form)
submit_form(session, form, NULL)
submit_form(pgsession,filled_form)
library(httr)
url <- "http://www.indeed.com"
fd <- list(
submit = `ctl00$Header2$HeaderTop1$Button1`,
`ctl00$Header2$HeaderTop1$tbUsername` = "myemail en gmail.com",
`ctl00$Header2$HeaderTop1$tbPassword` = "mypassword")
resp<-POST(url, body=fd, encode="form")
content(resp)
# The rvest submit_form function is still under construction and does not work for web sites which build URLs (i.e. GET requests. It does seem to work for POST requests).
#url <- submit_form(session, indeed)
# Version 1 of our submit_form function
submit_form2 <- function(session, form){
library(XML)
url <- XML::getRelativeURL(form$url, session$url)
url <- paste(url,'?',sep='')
values <- as.vector(rvest:::submit_request(form)$values)
att <- names(values)
if (tail(att, n=1) == "NULL"){
values <- values[1:length(values)-1]
att <- att[1:length(att)-1]
}
q <- paste(att,values,sep='=')
q <- paste(q, collapse = '&')
q <- gsub(" ", "+", q)
url <- paste(url, q, sep = '')
html_session(url)
}
# Version 2 of our submit_form function
library(httr)
# Appends element of a list to another without changing variable type of x
# build_url function uses the httr package and requires a variable of the url class
appendList <- function (x, val)
{
stopifnot(is.list(x), is.list(val))
xnames <- names(x)
for (v in names(val)) {
x[[v]] <- if (v %in% xnames && is.list(x[[v]]) && is.list(val[[v]]))
appendList(x[[v]], val[[v]])
else c(x[[v]], val[[v]])
}
x
}
# Simulating submit_form for GET requests
submit_geturl <- function (session, form)
{
query <- rvest:::submit_request(form)
query$method <- NULL
query$encode <- NULL
query$url <- NULL
names(query) <- "query"
relativeurl <- XML::getRelativeURL(form$url, session$url)
basepath <- parse_url(relativeurl)
fullpath <- appendList(basepath,query)
fullpath <- build_url(fullpath)
fullpath
}
# Submit form and get new url
session1 <- submit_form2(session, form)
# Get reviews of last company using follow_link()
session2 <- follow_link(session1, css = "#more_9 li:nth-child(3) a")
reviews <- session2 %>% html_nodes(".description") %>% html_text()
reviews
Javier Rubén Marcuzzi
De: WILMER CONTRERAS SEPULVEDA
Enviado: miércoles, 21 de junio de 2017 19:55
Para: r-help-es en r-project.org
Asunto: [R-es] Función para descargar información automaticamente de una web
Buenas tardes.
Estoy interesado en saber si existe alguna función, script o algo parecido
en R para descargar datos automaticamente de una pagina Web.
Lo que quiero hacer es descargar la información presentada en esta Web:
http://www.upme.gov.co/GeneradorConsultas/Consulta_SuiConsumo.aspx?IdModulo=2&Servicio=4
para diferentes tipos de entrada, por ejemplo variar el campo que indica
"municipio" y tambien variar el campo que indica la consulta.
Muchas Gracias.
Wilmer Contreras S.
[[alternative HTML version deleted]]
_______________________________________________
R-help-es mailing list
R-help-es en r-project.org
https://stat.ethz.ch/mailman/listinfo/r-help-es
[[alternative HTML version deleted]]
Más información sobre la lista de distribución R-help-es