[R] Web scrabing - getURL with delay

Mon Aug 13 18:00:04 CEST 2012

Perhaps ?Sys.sleep between scrapes. If this slows things down too much you may be able to parallelize by host site with ?mclapply.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Kasper Christensen <kasper2304 at gmail.com> wrote:

>Hi R people.
>
>Im currently trying to construct a piece of R-code that can retrieve a
>list
>of webpages I have stored as a csv file and save the content of the
>webpages into separate txt files. I want to retrieve a total number of
>6000
>threads posted at a forum, to try to build/train a classifier that can
>tell
>me if the thread contains valuable information.
>
>*Until now* I have managed to get the following code to work:
>
>*> library(foreign)*
>*> library(RCurl)*
>*Indl�ser kr�vet pakke: bitops*
>*> addresses <- read.csv("~/Extract post - forum.csv")*
>*> for (i in addresses) full.text <- getURL(i) *
>*> text.sub <- gsub("<.+?>", "", full.text)*
>*> text <- data.frame(text.sub)*
>*> outpath <-"~/forum - RawData"*
>*> x <- 1:nrow(text)*
>*> for(i in x) {*
>*+ write(as.character(text[i,1]), file =
>paste(outpath,"/",i,".txt",sep=""))
>*
>*+ }*
>*> *
>*
>*
>(I have both mac iOS and Windows)
>
>This piece of code is not my own work and I therefore send a warm thank
>you
>to Christopher Gandrud and co authors for providing this piece of code.
>
>*The problem*
>The code works like a charm looking up all the different addresses I
>have
>stored in my csv file. The csv file I constructed as:
>
>*Link*
>*"webadress 1"*
>*"webadress 2"*
>*"webadress n"*
>*
>*
>The problem is that i get empty output files and files saying "Server
>overloaded". However I do also get files that contains the information
>intended. The pattern of "bad" and "good" files a different from each
>time
>i run the code with total n, telling me that it is not the code that is
>the
>problem. No need to say it is probably my many request that is causing
>the
>overload and as I am pretty new in the area I did not believe that this
>would be a problem. When realizing that this WAS a problem I tried
>reducing
>the number of requests to 100 at a time, which gave me all text files
>containing the info I wanted.
>
>Therefore I am looking for some kind of solution to this problem, and
>my
>own best solution would be to build something into the code that makes
>it
>send x number of request with a z given interval (5 seconds maybe),
>until I
>have retrieved the total n of webpages in the csv file. If it fails to
>retrieve a webpage it would be nice to sort the "bad" text files into a
>"redo" folder which could then be run afterwards.
>
>Any type of solution is welcome. As said I am pretty new with r-coding
>but
>i some coding experience with VBA.
>
>Best
>Kasper
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.