[R] Extracting a website text content using R

Steven McKinney smckinney at bccrc.ca
Thu Aug 2 02:53:40 CEST 2007

>-----Original Message-----
>From: r-help-bounces at stat.math.ethz.ch on behalf of Am Stat
>Sent: Wed 8/1/2007 2:19 PM
>To: r-help at stat.math.ethz.ch
>Subject: [R] Extracting a website text content using R
>Dear useR,

>Just wandering whether it is possible that there is any function in R could
>let me get the text contents for a certain website.

>Thanks a lot!




Is this what you had in mind?

> foo <- scan(url("http://cran.r-project.org/"), what = "character")
Read 69 items
> paste(unlist(foo), collapse = " ")
[1] "<!DOCTYPE HTML PUBLIC -//IETF//DTD HTML//EN > <html> <head> <title>The Comprehensive R Archive Network</title> <link rel=\"icon\" href=\"favicon.ico\" type=\"image/x-icon\"> <link rel=\"shortcut icon\" href=\"favicon.ico\" type=\"image/x-icon\"> <link rel=\"stylesheet\" type=\"text/css\" href=\"R.css\"> </head> <FRAMESET cols=\"1*, 4*\" border=0> <FRAMESET rows=\"120, 1*\"> <FRAME src=\"logo.html\" name=\"logo\" frameborder=0> <FRAME src=\"navbar.html\" name=\"contents\" frameborder=0> </FRAMESET> <FRAME src=\"banner.shtml\" name=\"banner\" frameborder=0> <noframes> <h1>The Comprehensive R Archive Network</h1> Your browser seems not to support frames, here is the <A href=\"navbar.html\">contents page</A> of CRAN. </noframes> </FRAMESET>"

Try the search phrase

cran scan url

in Google for more hits on
info about R functions that
can deal with URLs.

In R try

> apropos("URL")
 [1] "contourLines"   "URLdecode"      "URLencode"      "browseURL"      "contrib.url"    "main.help.url"  "url.show"      
 [8] "loadURL"        "read.table.url" "scan.url"       "source.url"     "url"           


R-help at stat.math.ethz.ch mailing list
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list