[R] R help - Web Scraping of Google News using R
Kumar Gauraw
string.gauraw at gmail.com
Tue May 24 17:21:38 CEST 2016
Hello Experts,
I am trying to scrap data from Google news for a particular topic using XML
and Curl Package of R. I am able to extract the summary part of the news
through *XPath* but in a similar way, I am trying to extract title and
Links of news which is not working.Please note this work is just for POC
purpose and I would make maximum of 500 requests per day so that Google TOS
remains intact.
library(XML)
library(RCurl)
getGoogleURL <- function(search.term, domain = '.co.in', quotes=TRUE)
{
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
getGoogleURL <- paste('http://www.google', domain,
'/search?hl=en&gl=in&tbm=nws&authuser=0&q=',search.term, sep='')
}
search.term <- "IPL 2016"
quotes <- "FALSE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
getGoogleSummary <- function(google.url) {
doc <- getURL(google.url, httpheader = c("User-Agent" = "R(2.10.0)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
nodes <- getNodeSet(html, "//div[@class='st']")
return(sapply(nodes, function(x) x <- xmlValue(x)))
}
*#Problem is with this part of code*
getGoogleTitle <- function(google.url) {
doc <- getURL(google.url, httpheader = c("User-Agent" = "R(2.10.0)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
* nodes <- getNodeSet(html, "//a[@class='l _HId']")*
return(sapply(nodes, function(x) x <- xmlValue(x)))
}
Kindly help me to understand where I am getting wrong so that I can rectify
the code and get the correct output.
Thank you.
With Regards,
Kumar Gauraw
[[alternative HTML version deleted]]
More information about the R-help
mailing list