[R] Developing a web crawler / R "webkit" or something similar?

Mike Marchywka marchywka at hotmail.com
Thu Mar 3 15:07:19 CET 2011








> Date: Thu, 3 Mar 2011 01:22:44 -0800
> From: antujsrv at gmail.com
> To: r-help at r-project.org
> Subject: [R] Developing a web crawler
>
> Hi,
>
> I wish to develop a web crawler in R. I have been using the functionalities
> available under the RCurl package.
> I am able to extract the html content of the site but i don't know how to go

In general this can be a big effort but there may be things in 
text processing packages you could adapt to execute html and javascript.
However, I guess what I'd be looking for is something like a "webkit"
package or other open source browser with or without an "R" interface.
This actually may be an ideal solution for a lot of things as you get
all the content handlers of at least some browser. 


Now that you mention it, I wonder if there are browser plugins to handle
"R" content ( I'd have to give this some thought, put a script up as
a web page with mime type "test/R" and have it execute it in R. )



> about analyzing the html formatted document.
> I wish to know the frequency of a word in the document. I am only acquainted
> with analyzing data sets.
> So how should i go about analyzing data that is not available in table
> format.
>
> Few chunks of code that i wrote:
> w <-
> getURL("http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes")
> write.table(w,"test.txt")
> t <- readLines(w)
>
> readLines also didnt prove out to be of any help.
>
> Any help would be highly appreciated. Thanks in advance.
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
 		 	   		  


More information about the R-help mailing list