[R] Extracting a a chunk of text from a pdf file

Victor vdemart at gmail.com
Sun Sep 18 20:16:19 CEST 2011


That's exactly the way I work. Here you are a chunk of text of my script. 
To put it in a nutshell I'm  already  extracting - by means of grep and gsub from indweb (luckily an html file)  - the web addresses 
like  http://www.terna.it/LinkClick.aspx?fileticket=TTQuOPUf%2fs0%3d&tabid=435&mid=3072 and the likes, pdf files (unfortunately for me).
That's why I need to "translate" the pdf into a txt file.
 
Ciao
Vittorio

==============================
indweb<-"http://www.terna.it/default/Home/SISTEMA_ELETTRICO/dispacciamento/dati_esercizio/dati_giornalieri/confronto.aspx"
testo<-readLines(indweb)


k<-grep("^(.)+dnn_ctr3072_DocumentTerna_grdDocuments_(.)+CategoryCell\">(\\d\\d)/(\\d\\d)/201(\\d)",testo)
n<-length(k)
# Poichè le date sono in ordine decrescente, ordina in ordine crescente
k<-k[order(k,decreasing=TRUE)]

for (i in 1:length(k) ) {

	data<-gsub("^(.)+dnn_ctr3072_DocumentTerna_grdDocuments_(.)+CategoryCell\">","",testo[k[1]])
	data<-paste(substr(data,7,10), substr(data,4,5), substr(data,1,2), sep="-")
	mysel<-paste("select count(*) from richiesta where data=\"",data,"\";",sep="")
	dataesiste<-as.integer(dbGetQuery(con,mysel))

	if (dataesiste == 0) {
		
		rif<-gsub("\">Confronto Giornaliero(.)+","",testo[k[30]])
 		rif<-gsub("^(.)+href=\"","",rif)
		pag<-paste("http://www.terna.it",rif,sep="")
		pagina<-readLines(pag)
………………………………………………….
………………………………………………….
………………………………………………….

Il giorno 18/set/2011, alle ore 18:25, Joshua Wiley ha scritto:

> On Sun, Sep 18, 2011 at 7:44 AM, Victor <vdemart at gmail.com> wrote:
>> Unfortunately pdf2text doesn't seem to exist either in linux or mac osx.
> 
> I think Jeff's main point was to search for software specific for your
> task (convert a pdf to text).  Formatting will be lost so once you get
> your text files, I would look at regular expressions to try to find
> the right part of text to grab.  Some general functions that seem like
> they might be relevant:
> 
> ## for getting the text into R
> ?readLines
> ?scan
> ## for finding the part you need
> ?regexp
> ?grep
> 
> Cheers,
> 
> Josh
> 
> 
>> Ciao Vittorio
>> 
>> Il giorno 17/set/2011, alle ore 21:00, Jeff Newmiller ha scritto:
>> 
>>> Doesn't seen like an R task, but see pdf2text? (From pdftools, UNIX command line tools)
>>> ---------------------------------------------------------------------------
>>> Jeff Newmiller The ..... ..... Go Live...
>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
>>> Live: OO#.. Dead: OO#.. Playing
>>> Research Engineer (Solar/Batteries O.O#. #.O#. with
>>> /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
>>> ---------------------------------------------------------------------------
>>> Sent from my phone. Please excuse my brevity.
>>> 
>>> Victor <vdemart at gmail.com> wrote:
>>> In an R script I need to extract some  figures from  many web pages in pdf format. As an example see http://www.terna.it/LinkClick.aspx?fileticket=TTQuOPUf%2fs0%3d&tabid=435&mid=3072 from which I would like to extract the "Totale: 1,025,823").
>>> Is there any solution?
>>> Ciao
>>> Vittorio
>>> 
>>> 
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> 
>>        [[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 
> 
> 
> -- 
> Joshua Wiley
> Ph.D. Student, Health Psychology
> Programmer Analyst II, ATS Statistical Consulting Group
> University of California, Los Angeles
> https://joshuawiley.com/



More information about the R-help mailing list