[R] webscraping a multi-level website
Ilio Fornasero
|||o|orn@@ero @end|ng |rom hotm@||@com
Thu Apr 18 10:35:36 CEST 2019
Hello.
I am trying to webscrape a website including some link from which I have to pick info.This is a thing going on since some day.
## Yet, I am getting the page and the urls I am interested in.
url <- "http://www.fao.org/countryprofiles/en/"
webscrape <- read_html(url)
urls <- webscrape %>%
html_nodes(".linkcountry") %>%
html_attr("href") %>%
as.character()
## This is a chance to get the single links
urls <- paste0("http://www.fao.org", urls)
## Nevertheless, I prefere this option:
urls <- paste0("http://www.fao.org", urls_country <- data.frame(country=character(), country_url=character()))
## Then I loop to attain News
for (i in urls) {
webscrape1 <- read_html(i)
country <- webscrape1 %>%
html_nodes(".#newsItems") %>%
html_text() %>%
as.character()
country_url <- webscrape1 %>%
html_nodes(".#newsItems") %>%
html_attr("href") %>%
as.character()
temp_fao <- data.frame(country,country_url)
urls_country <- rbind(urls_country,temp_fao)
cat("*")
}
In any case, I get the following message:
Error in open.connection(x, "rb") :
Could not resolve host: www.fao.orginteger(0)
Any hint?
Thanks in advance
[[alternative HTML version deleted]]
More information about the R-help
mailing list