[R] Using xpathapply or getnodeset to get text between two distinct tags

Simon Kiss sjkiss at gmail.com
Fri May 11 19:14:33 CEST 2012


Hello:
 
The following code extracts the links to the daily transcripts of Canada's House Of Commons.  'links' is a matrix of URLs (ncol=1), each of which points to one day's transcripts.

If you inspect the code for scrape(links[1]), you will find that periodically there appears an italicitze tag after a paragraph tag (<p some text ><i>Translation</i></p>. At this point, the speaker is speaking French.

Then there are some <div> tags that list some text, and then, after the speaker has returned to English, you get the same formula as above, <p some text><i>English</i></p><div> some speech </div><div>Some Speech </div>
Ultimately, what I'd like to do i count the words between the <i> tags 'Tanslation' and 'English'.
I'm pretty sure I can get the text into the tm package to do the word counts, what I really don't know how to is return the text between 'Translation' and 'English' so that I can mark it as 'French' and then return the text between 'English' and 'Translation' and mark it as English.  
Does any one have any suggestions? Yours truly,
Simon J. Kiss


#Necessary libraries
library(XML)
library(scrapeR)
#URL for links to 2012 transcripts
hansard<-c('http://www.parl.gc.ca/housechamberbusiness/ChamberSittings.aspx?View=H&Language=E&Mode=1&Parl=41&Ses=1')
#Scrape the page with the links
doc<-scrape(url=hansard, parse=TRUE, follow=TRUE)
#Not sure what exactly this does, but it is necessary
doc<-doc[[1]]
#Get the xmlRoot directory
doc<-  xmlRoot(doc)
#Get nodes that contain only the links to each day's transcripts
links<-  getNodeSet(doc, "//a[@class='PublicationCalendarLink']/@href")
links<-matrix(links)
#Paste those href links to the root URL
links<-apply(links, 1, function(x) paste('http://www.parl.gc.ca', x, sep=''))
#Inspect
links[1]
#Scrape text from first URL in 'links'
oneday<-scrape(links[1])[[1]]

#Return p/i elements from 'oneday'
getNodeset(oneday, "//p/i")

#sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C/en_US.UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] scrapeR_0.1.6  RCurl_1.91-1   bitops_1.0-4.1 XML_3.9-4     

loaded via a namespace (and not attached):
[1] tools_2.15.0
*********************************
Simon J. Kiss, PhD
Assistant Professor, Wilfrid Laurier University
73 George Street
Brantford, Ontario, Canada
N3T 2C9
Cell: +1 905 746 7606



More information about the R-help mailing list