[R] Scrap java scripts and styles from an html document

Duncan Temple Lang duncan at wald.ucdavis.edu
Tue Mar 29 17:57:09 CEST 2011



On 3/28/11 11:38 PM, antujsrv wrote:
> Hi,
> 
> I am working on developing a web crawler in R and I needed some help with
> regard to removal of javascripts and style sheets from the html document of
> a web page.
> 
> i tried using the xml package, hence the function xpathApply
> library(XML)
> txt =
> xpathApply(html,"//body//text()[not(ancestor::script)][not(ancestor::style)]",
> xmlValue)
> 
> The output comes out as text lines, without any html tags. I want the html
> tags to remain intact and scrap only the javascript and styles from it. 

Well then you would be best served to use that approach, i.e.
find the nodes named script and style and then remove them from
the tree. Then you have the document as a single object
rather than a bunch of individual elements.

So

 nodes = xpathApply(html, "//body//script | //body//style")
 removeNodes(nodes)

 saveXML(html)


But you don't say what you want to end up with or what you are doing with
the resulting content or why you have to remove the JavaScript content, etc.

  D.

> 
> Any help would be highly appreciated.
> Thanks in advance.
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/Scrap-java-scripts-and-styles-from-an-html-document-tp3413894p3413894.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list