[R] R help - Web Scraping of Google News using R
boB Rudis
bob at rudis.net
Wed May 25 06:16:52 CEST 2016
What you are doing wrong is both trying yourself and asking others to
violate Google's Terms of Service and (amongst other things) get your
IP banned along with anyone who aids you (or worse). Please don't.
Just because something can be done does not mean it should be done.
On Tue, May 24, 2016 at 11:21 AM, Kumar Gauraw <string.gauraw at gmail.com> wrote:
> Hello Experts,
>
> I am trying to scrap data from Google news for a particular topic using XML
> and Curl Package of R. I am able to extract the summary part of the news
> through *XPath* but in a similar way, I am trying to extract title and
> Links of news which is not working.Please note this work is just for POC
> purpose and I would make maximum of 500 requests per day so that Google TOS
> remains intact.
>
>
> library(XML)
>
> library(RCurl)
>
> getGoogleURL <- function(search.term, domain = '.co.in', quotes=TRUE)
>
> {
>
> search.term <- gsub(' ', '%20', search.term)
>
> if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
>
> getGoogleURL <- paste('http://www.google', domain,
> '/search?hl=en&gl=in&tbm=nws&authuser=0&q=',search.term, sep='')
>
> }
>
> search.term <- "IPL 2016"
>
> quotes <- "FALSE"
>
> search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
>
> getGoogleSummary <- function(google.url) {
>
> doc <- getURL(google.url, httpheader = c("User-Agent" = "R(2.10.0)"))
>
> html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
>
> nodes <- getNodeSet(html, "//div[@class='st']")
>
> return(sapply(nodes, function(x) x <- xmlValue(x)))
>
> }
>
> *#Problem is with this part of code*
>
> getGoogleTitle <- function(google.url) {
>
> doc <- getURL(google.url, httpheader = c("User-Agent" = "R(2.10.0)"))
>
> html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
>
> * nodes <- getNodeSet(html, "//a[@class='l _HId']")*
>
> return(sapply(nodes, function(x) x <- xmlValue(x)))
>
> }
>
> Kindly help me to understand where I am getting wrong so that I can rectify
> the code and get the correct output.
>
> Thank you.
>
> With Regards,
> Kumar Gauraw
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list