[R-es] Función para descargar información automaticamente de una web

javier.ruben.marcuzzi en gmail.com javier.ruben.marcuzzi en gmail.com
Sab Jun 24 00:45:02 CEST 2017


Estimado Wilmer

Estoy realizando limpieza de archivos y hay uno del cuál copio y pego su contenido porque puede ser útil para su caso.


library("rvest")
htmlpage <- html("http://forecast.weather.gov/MapClick.php?lat=42.31674913306716&lon=-71.42487878862437&site=all&smap=1#.VRsEpZPF84I")
forecasthtml <- html_nodes(htmlpage, "#detailed-forecast-body b , .forecast-text")
forecast <- html_text(forecasthtml)
paste(forecast, collapse =" ")


#http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/
library(rvest)
lego_movie <- html("http://www.imdb.com/title/tt1490017/")
# lego_movie carga toda la página
lego_movie %>%                            # de esta pagina
  html_node("strong span") %>%          # busco en el dom
  html_text() %>%                       # extraigo lo que tiene en ese lugar del dom
  as.numeric()                          # lo comvierto a numerico  si lo dejo com estaba leo en texto

lego_movie %>%
  html_nodes("#titleCast .itemprop span") %>%
  html_text()

lego_movie %>%
  html_nodes("table") %>%
  .[[3]] %>%
  html_table()




##############################################################
library(rvest)

url       <-"http://www.perfectgame.org/"   ## page to spider
pgsession <-html_session(url)               ## create session
pgform    <-html_form(pgsession)[[1]]       ## pull form from session

# Note the new variable assignment 

filled_form <- set_values(pgform,
                          `ctl00$Header2$HeaderTop1$tbUsername` = "myemail en gmail.com", 
                          `ctl00$Header2$HeaderTop1$tbPassword` = "mypassword")

submit_form(pgsession,filled_form)
#################################################

url     <- "http://mobile.bahn.de/bin/mobil/query.exe/dox?country=DEU&rt=1&use_realtime_filter=1&webview=&searchMode=NORMAL"
sitzung <- html_session(url)
p1.form <- html_form(sitzung)[[1]]
p2      <- submit_form(sitzung, p1.form, submit='advancedProductMode')
p2.form <- html_form(p2)[[1]]
form.mod<- set_values( p2.form
                       ,REQ0JourneyStopsS0G     = "HH"
                       ,REQ0JourneyStopsZ0G     = "F"
)
final   <- submit_form(sitzung, form.mod, submit='start')


###############
#   http://stat4701.github.io/edav/2015/04/02/rvest_tutorial/ 

# Scraping indeed.com for jobs

# Submit the form on indeed.com for a job description and location using html_form() and set_values()
query = "data science"
loc = "New York"
session <- html_session("http://www.indeed.com")
form <- html_form(session)[[1]]     # el formulario da lo siguiente
#<form> 'jobsearch' (GET /jobs)
#<input text> 'q': 
#  <input text> 'l': 
#  <input submit> '': Find Jobs
form <- set_values(form, q = query, l = loc)

submit_form(session, session, form)
submit_form(session,  form, NULL)
submit_form(pgsession,filled_form)


library(httr)

url <- "http://www.indeed.com"

fd <- list(
  submit = `ctl00$Header2$HeaderTop1$Button1`,
  `ctl00$Header2$HeaderTop1$tbUsername`  = "myemail en gmail.com",
  `ctl00$Header2$HeaderTop1$tbPassword`  = "mypassword")

resp<-POST(url, body=fd, encode="form")
content(resp) 


# The rvest submit_form function is still under construction and does not work for web sites which build URLs (i.e. GET requests. It does seem to work for POST requests). 
#url <- submit_form(session, indeed)

# Version 1 of our submit_form function
submit_form2 <- function(session, form){
  library(XML)
  url <- XML::getRelativeURL(form$url, session$url)
  url <- paste(url,'?',sep='')
  values <- as.vector(rvest:::submit_request(form)$values)
  att <- names(values)
  if (tail(att, n=1) == "NULL"){
    values <- values[1:length(values)-1]
    att <- att[1:length(att)-1]
  }
  q <- paste(att,values,sep='=')
  q <- paste(q, collapse = '&')
  q <- gsub(" ", "+", q)
  url <- paste(url, q, sep = '')
  html_session(url)
}

# Version 2 of our submit_form function
library(httr)
# Appends element of a list to another without changing variable type of x
# build_url function uses the httr package and requires a variable of the url class
appendList <- function (x, val)
{
  stopifnot(is.list(x), is.list(val))
  xnames <- names(x)
  for (v in names(val)) {
    x[[v]] <- if (v %in% xnames && is.list(x[[v]]) && is.list(val[[v]]))
      appendList(x[[v]], val[[v]])
    else c(x[[v]], val[[v]])
  }
  x
}

# Simulating submit_form for GET requests
submit_geturl <- function (session, form)
{
  query <- rvest:::submit_request(form)
  query$method <- NULL
  query$encode <- NULL
  query$url <- NULL
  names(query) <- "query"
  
  relativeurl <- XML::getRelativeURL(form$url, session$url)
  basepath <- parse_url(relativeurl)
  
  fullpath <- appendList(basepath,query)
  fullpath <- build_url(fullpath)
  fullpath
}


# Submit form and get new url
session1 <- submit_form2(session, form)

# Get reviews of last company using follow_link()
session2 <- follow_link(session1, css = "#more_9 li:nth-child(3) a")
reviews <- session2 %>% html_nodes(".description") %>% html_text()
reviews

Javier Rubén Marcuzzi

De: WILMER CONTRERAS SEPULVEDA
Enviado: miércoles, 21 de junio de 2017 19:55
Para: r-help-es en r-project.org
Asunto: [R-es] Función para descargar información automaticamente de una web

Buenas tardes.

Estoy interesado en saber si existe alguna función, script o algo parecido
en R para descargar datos automaticamente de una pagina Web.

Lo que quiero hacer es descargar la información presentada en esta Web:

http://www.upme.gov.co/GeneradorConsultas/Consulta_SuiConsumo.aspx?IdModulo=2&Servicio=4


para diferentes tipos de entrada, por ejemplo variar el campo que indica
"municipio" y tambien variar el campo que indica la consulta.

Muchas Gracias.

Wilmer Contreras S.

	[[alternative HTML version deleted]]

_______________________________________________
R-help-es mailing list
R-help-es en r-project.org
https://stat.ethz.ch/mailman/listinfo/r-help-es


	[[alternative HTML version deleted]]



Más información sobre la lista de distribución R-help-es