[R] Scrap java scripts and styles from an html document
antujsrv at gmail.com
Thu Apr 7 13:15:50 CEST 2011
I am working on developing a web crawler.
What I want is a cleaned html document with only the html tags and textual
so that i can figure out the pattern of the web page. This is being done to
information from the webpage like comments for a particular product.
For e.g the amazon.com has all such comments within the
occuring for breaks. So tags which appear the most help us in
locating the required information. Different websites have different
but its more likely that tags that will occur the most will have the
relevant information enclosed in them.
So, once the html page is cleaned, it would be easy to role up the tags and
knowing their frequency of occurrence, we can target the information.
Should there be any suggestions to help, please let me know. I would be more
View this message in context: http://r.789695.n4.nabble.com/Scrap-java-scripts-and-styles-from-an-html-document-tp3413894p3433052.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help