[R] Authentication and Web Site Scraping
G.Maubach at gmx.de
G.Maubach at gmx.de
Sat Jan 21 11:19:44 CET 2017
Hi All,
I would like to learn how to scrape a web site which is password protected. I do my training with my Delicious web site. I will obey all rules and legislation existent.
The delicious export api was shut down. I assume that the web site will be shut down in the foreseeable future. In my Coursera Course I learned that it is possible to scrape web sites and extract the information in it. I would like to use this possibility to download the bookmark pages and extract the bookmarks with its accompanying tags as an alternative to the non-existant export api.
I started with
-- cut --
url_base <- "https://del.icio.us/gmaubach?&page="
data_created <- as.character(Sys.Date())
filename_base <-
paste0(
data_created,
"_Delicious_Page_")
page_start <- 1
page_end <- 670
for (page in seq_along(page_start:page_end))
{
download.file(
url = paste0(
url_base,
as.character(page)),
destfile = paste0(
filename_base,
as.character(page)))
}
-- cut --
This way approx. 1000 bookmarks are not loaded cause only the public bookmarks are shown. I know that it is possible to authenticate using something like
-- cut --
page <- GET("https://del.icio.us",
authenticate("user", "password"))
-- cut --
To not have to authenticate over and over again, it is possible to use handles like
-- cut --
delicious <- handle("https://del.icio.us")
-- cut --
I do not know how I have to put it all together. What would be a statement sequence in getting all stored booksmarks on the pages 1..670 using authentication?
Kind regards
Georg
More information about the R-help
mailing list