[R] readPDF() -- unsure how to install xpdf to make this work?
Uwe Ligges
ligges at statistik.tu-dortmund.de
Sun Nov 16 21:34:12 CET 2008
Tony Breyal wrote:
> Hi,
>
> Uwe -- ahh, thank you kindly, I was able to do a web search after
> reading your post above in order to find a guide on how to set the
> path in windows (i wasn't aware that this is how a file is made
> avaiable to the system). I haven't got it to work yet, but at least
> i'm on the right track! also just after reading your post, i've
> discoverd the system() function in R, what wonderful thing that is!
>
> Clair -- I'm still working on getting the files to be accessable to
> the system, but in the mean time i have just discovered the system()
> function in R which is work around for the moment... so using your
> example, you could do:
> ## R code
>> system(paste('"C:/Program Files/xpdf/pdftotext.exe"', '"C:/Documents and Settings/clair/Desktop/test/r-intro.pdf"'), wait=FALSE)
>
> the above will create a new text document in your c:/../test folder.
>
>
> Now obviously, we want to use the readPDF() function in package: tm.
> so on my uni laptop, running windows XP, this is what i have done:
>
> 1. Click through: start >> control panel >> system
> 2. Click the Advanced tab.
> 3. Click Environment variables.
> 4. Click New (under 'system') to add a new variable name and value.
> 4a. name: pdftotext
> 4b. value: C:\Program Files\xpdf\pdftotext.exe
> 5. Click New (under 'system') to add a new variable name and value.
> 4a. name: pdfinfo
> 4b. value: C:\Program Files\xpdf\pdfinfo.exe
>
No, instead of 4 and 5, change the environemnt variable PATH to
PATH
...[all what is already in there]...;C:\Program Files\xpdf
Uwe Ligges
> In theory, i think, that should work. however so far it hasn't, so not
> quite sure what to do. but at least in the mean time we have the system
> () function as a work around. If you can figure out what i'm doing
> wrong (probably something obvious knowing me!) please do let me know.
>
> Cheers,
> Tony Breyal
>
>
>
> On 16 Nov, 18:14, Uwe Ligges <lig... at statistik.tu-dortmund.de> wrote:
>> clair.crossup... at googlemail.com wrote:
>>> I never said it *should* work.
>>> I was simply trying something out that works on other types of files
>>> I've needed in the past (eg: html, csv, dat, etc.). I don't know the
>>> details of the pdf format, but I thought it was worth a try, certainly
>>> no harm in experimenting, and hence I learned that pdfs aren't stored
>>> in the same way that other files i've used in the past are. that's
>>> fine, good to learn new things.
>>> As for trying the readPDF() function, yes, I have downloaded and used
>>> xpdf to convert pdfs into plain text since reading the OP email.
>>> However, ow you can make xpdf available to the system so that readPDF
>>> () works in R? i don't know, hence why I posted in this thread.
>>> You clearly seem to have a solution, fancy sharing?
>> Sure, I thought that could not be a real question:
>> Set your environment variable PATH so that it additionally points to the
>> directory where these tools are installed. As you would do for any other
>> software that is to be called without knowledge where it is installed.
>>
>> Uwe Ligges
>>
>>
>>
>>> Clair Crossupton xx
>>> On 16 Nov, 12:34, Uwe Ligges <lig... at statistik.tu-dortmund.de> wrote:
>>>> clair.crossup... at googlemail.com wrote:
>>>>> Hello, I was just wondering if you had found a solution? I am having
>>>>> the same difficulty of converting pdf's into plain text documents in
>>>>> R. I originally thought I could use the readLines() function, but as
>>>>> you can see below that did not work.
>>>> Why the hell should it? It is designed to read *text* files. And what
>>>> you get below is exactly how your PDF file looks like if you read it as
>>>> text which it is NOT. Why do you not also go the readPDF() way (and yes,
>>>> it is not always possible nor reliable to go that way).
>>>> Uwe Ligges
>>>>> R> my.destfile <- "C:\\Documents and Settings\\clair\\Desktop\\test\\r-
>>>>> intro.pdf"
>>>>> R> my.url <- "http://cran.r-project.org/doc/manuals/R-intro.pdf"
>>>>> R> download.file(url = my.url, destfile=my.destfile, mode='wb')
>>>>> R> txt <- readLines(my.destfile)
>>>>> R> txt
>>>>> [1]
>>>>> "%PDF-1.4"
>>>>> [2]
>>>>> "%ÐÔÅØ"
>>>>> [3] "1 0 obj
>>>>> <<"
>>>>> [4] "/Length 587
>>>>> "
>>>>> [5] "/Filter /
>>>>> FlateDecode"
>>>>> [6]
>>>>> ">>"
>>>>> [7]
>>>>> "stream"
>>>>> [8] "xÚmTM ¢@\020½ó+z\017&ÎÁ±?\024tBL\020$ñ°ãd4›½*´.‰\002\001<øï·_•èÌf
>>>>> \017’W¯_wÕ«îrðãc;Šòê`GæUŠOÛV×&³£øç¾ö\006ƒ¤Ê®\027[vïÖæ6ïWÛ7ñÑTÙÖvb
>>>>> \030¯“uYt/N¼.³ó5·½êÿ¢¥=\025åS‚<b¸³¿G› "
>>>>> Warm Regards,
>>>>> Clair
>>>>> On 13 Nov, 15:10, Tony Breyal <tony.bre... at googlemail.com> wrote:
>>>>>> Dear R-Help,
>>>>>> I need to convert a set of '.pdf' files into an equivalent set of
>>>>>> '.txt' files. This is so that i can do some text mining on the
>>>>>> content.
>>>>>> In the latest R-News letter (http://cran.r-project.org/doc/Rnews/
>>>>>> Rnews_2008-2.pdf), the package 'tm' for text mining is mentioned. In
>>>>>> that lovely package, there is a function called 'readPDF()'. In order
>>>>>> to use this, ?readPDF says
>>>>>> "Note that this PDF reader needs both the tools pdftotext and
>>>>>> pdfinfo installed and accessable on your system."
>>>>>> These tools are available fromhttp://www.foolabs.com/xpdf/download.html
>>>>>> I am able to download this and use it easily from a dos window to
>>>>>> convert a pdf file into a txt file.
>>>>>> Question: how do i make these tools available to R, so that i can use
>>>>>> the readPDF() function?
>>>>>> Thank you in advance for any help, and I hope the above made sense.
>>>>>> Tony Breyal
>>>>>> ###OS = Windows Vista Ultimate>> sessionInfo()
>>>>>> R version 2.8.0 (2008-10-20)
>>>>>> i386-pc-mingw32
>>>>>> locale:
>>>>>> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
>>>>>> 1252;LC_MONETARY=English_United Kingdom.
>>>>>> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>>>>>> attached base packages:
>>>>>> [1] grid stats graphics grDevices utils datasets
>>>>>> methods base
>>>>>> other attached packages:
>>>>>> [1] tm_0.3-1 XML_1.98-1 Snowball_0.0-3
>>>>>> RWeka_0.3-14 rJava_0.6-0 Matrix_0.999375-16
>>>>>> lattice_0.17-15 filehash_2.0
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] proxy_0.4-1
>>>>>> ______________________________________________
>>>>>> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>> ------------------------------------------------------------------------
>>>>> ______________________________________________
>>>>> R-h... at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>> ______________________________________________
>>>> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> ______________________________________________
>>> R-h... at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list