[R] Webscraping - How to Scrape Out Text Into R As If Copied & Pasted From Webpage?

Henrique Dallazuanna wwwhsd at gmail.com
Thu Oct 27 02:04:37 CEST 2011


Use XPATH query:

web.pg <- htmlTreeParse(file=web.pg, ignoreBlanks=TRUE, useInternalNodes = TRUE)

# Job title
xpathApply(web.pg, "//span[@class='normal']//b", xmlValue)

On Wed, Oct 26, 2011 at 9:36 PM, Moser, Gary <Gary_Moser at heald.edu> wrote:
> Greetings,
>
>
>
> I am trying to get all of the text from a web page as if I "selected
> all" on the page, pasted into a text file, and then read in the text
> file with read.csv().
>
>
>
> # this is the actual page I'm trying to acquire text from:
>
> web.pg <- readLines("http://www.airweb.org/?page=574")
>
>
>
> # then parsed in hopes of an easier structure to work with:
>
> web.pg <- htmlTreeParse(file=web.pg, ignoreBlanks=TRUE)
>
>
>
> Now I have a lovely html tree, but don't know the best way to get just
> the text components (job descriptions, job titles, etc...) as they
> appear on the web site. I'd like to do a little text mining and make a
> wordcloud using the text. Can anybody suggest a method to achieve this
> result?
>
>
>
> Thank you,
>
>
>
> Gary R. Moser
>
> Institutional Research Analyst
>
> Heald College
>
> p <- 415.808.1533
>
> f <- 415.808.1598
>
> gary_moser at heald.edu <mailto:gary_moser at heald.edu>
>
>
>
>
>
> Disclaimer: This communication may contain Heald College confidential and proprietary data. This message is intended only for the personal and confidential use of the designated recipients named above. If you are not the intended recipient of this message you are hereby notified that any review, dissemination, distribution or copying of this message is strictly prohibited. In addition, if you have received this message in error, please advise the sender by reply email and delete the message.
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O



More information about the R-help mailing list