[R] readPDF() -- unsure how to install xpdf to make this work?
Gabor Grothendieck
ggrothendieck at gmail.com
Sun Nov 16 22:42:57 CET 2008
The PATH is a list of directories, not filenames. It will look
for files in those directories.
The easiest method discussed on the batchfiles site is to
use the Redmond utility.
You may or may not need to exit R and start it up again once you
have set your path.
If your programs are self contained you may be able to just
put them into an existing directory that is already on the path
to avoid having to set the path.
On Sun, Nov 16, 2008 at 3:20 PM, Tony Breyal <tony.breyal at googlemail.com> wrote:
> Hi Gabor, yes, i used the link from that website to figure out the
> steps to setting my path: http://www.computerhope.com/issues/ch000549.htm
>
> but i'm still doing something wrong it seems (see my last post in
> response to Joris).
>
> Cheers,
> Tony Breyal
>
>
> On 16 Nov, 20:07, "Gabor Grothendieck" <ggrothendi... at gmail.com>
> wrote:
>> There is information on http;//batchfiles.googlecode.com
>> on setting your PATH.
>>
>>
>>
>> On Sun, Nov 16, 2008 at 2:41 PM, Tony Breyal <tony.bre... at googlemail.com> wrote:
>> > Hi,
>>
>> > Uwe -- ahh, thank you kindly, I was able to do a web search after
>> > reading your post above in order to find a guide on how to set the
>> > path in windows (i wasn't aware that this is how a file is made
>> > avaiable to the system). I haven't got it to work yet, but at least
>> > i'm on the right track! also just after reading your post, i've
>> > discoverd the system() function in R, what wonderful thing that is!
>>
>> > Clair -- I'm still working on getting the files to be accessable to
>> > the system, but in the mean time i have just discovered the system()
>> > function in R which is work around for the moment... so using your
>> > example, you could do:
>> > ## R code
>> >> system(paste('"C:/Program Files/xpdf/pdftotext.exe"', '"C:/Documents and Settings/clair/Desktop/test/r-intro.pdf"'), wait=FALSE)
>>
>> > the above will create a new text document in your c:/../test folder.
>>
>> > Now obviously, we want to use the readPDF() function in package: tm.
>> > so on my uni laptop, running windows XP, this is what i have done:
>>
>> > 1. Click through: start >> control panel >> system
>> > 2. Click the Advanced tab.
>> > 3. Click Environment variables.
>> > 4. Click New (under 'system') to add a new variable name and value.
>> > 4a. name: pdftotext
>> > 4b. value: C:\Program Files\xpdf\pdftotext.exe
>> > 5. Click New (under 'system') to add a new variable name and value.
>> > 4a. name: pdfinfo
>> > 4b. value: C:\Program Files\xpdf\pdfinfo.exe
>>
>> > In theory, i think, that should work. however so far it hasn't, so not
>> > quite sure what to do. but at least in the mean time we have the system
>> > () function as a work around. If you can figure out what i'm doing
>> > wrong (probably something obvious knowing me!) please do let me know.
>>
>> > Cheers,
>> > Tony Breyal
>>
>> > On 16 Nov, 18:14, Uwe Ligges <lig... at statistik.tu-dortmund.de> wrote:
>> >> clair.crossup... at googlemail.com wrote:
>> >> > I never said it *should* work.
>>
>> >> > I was simply trying something out that works on other types of files
>> >> > I've needed in the past (eg: html, csv, dat, etc.). I don't know the
>> >> > details of the pdf format, but I thought it was worth a try, certainly
>> >> > no harm in experimenting, and hence I learned that pdfs aren't stored
>> >> > in the same way that other files i've used in the past are. that's
>> >> > fine, good to learn new things.
>>
>> >> > As for trying the readPDF() function, yes, I have downloaded and used
>> >> > xpdf to convert pdfs into plain text since reading the OP email.
>> >> > However, ow you can make xpdf available to the system so that readPDF
>> >> > () works in R? i don't know, hence why I posted in this thread.
>>
>> >> > You clearly seem to have a solution, fancy sharing?
>>
>> >> Sure, I thought that could not be a real question:
>> >> Set your environment variable PATH so that it additionally points to the
>> >> directory where these tools are installed. As you would do for any other
>> >> software that is to be called without knowledge where it is installed.
>>
>> >> Uwe Ligges
>>
>> >> > Clair Crossupton xx
>>
>> >> > On 16 Nov, 12:34, Uwe Ligges <lig... at statistik.tu-dortmund.de> wrote:
>> >> >> clair.crossup... at googlemail.com wrote:
>> >> >>> Hello, I was just wondering if you had found a solution? I am having
>> >> >>> the same difficulty of converting pdf's into plain text documents in
>> >> >>> R. I originally thought I could use the readLines() function, but as
>> >> >>> you can see below that did not work.
>> >> >> Why the hell should it? It is designed to read *text* files. And what
>> >> >> you get below is exactly how your PDF file looks like if you read it as
>> >> >> text which it is NOT. Why do you not also go the readPDF() way (and yes,
>> >> >> it is not always possible nor reliable to go that way).
>>
>> >> >> Uwe Ligges
>>
>> >> >>> R> my.destfile <- "C:\\Documents and Settings\\clair\\Desktop\\test\\r-
>> >> >>> intro.pdf"
>> >> >>> R> my.url <- "http://cran.r-project.org/doc/manuals/R-intro.pdf"
>> >> >>> R> download.file(url = my.url, destfile=my.destfile, mode='wb')
>> >> >>> R> txt <- readLines(my.destfile)
>> >> >>> R> txt
>> >> >>> [1]
>> >> >>> "%PDF-1.4"
>> >> >>> [2]
>> >> >>> "%ÐÔÅØ"
>> >> >>> [3] "1 0 obj
>> >> >>> <<"
>> >> >>> [4] "/Length 587
>> >> >>> "
>> >> >>> [5] "/Filter /
>> >> >>> FlateDecode"
>> >> >>> [6]
>> >> >>> ">>"
>> >> >>> [7]
>> >> >>> "stream"
>> >> >>> [8] "xÚmTM ¢@\020½ó+z\017&ÎÁ±?\024tBL\020$ñ°ãd4›½*´.‰\002\001<øï·_•èÌf
>> >> >>> \017'W¯_wÕ«îrðãc;Šòê`GæUŠOÛV×&³£øç¾ö\006ƒ¤Ê(R)\027[vïÖæ6ïWÛ7ñÑTÙÖvb
>> >> >>> \030¯"uYt/N¼.³ó5·½êÿ¢¥=\025åS‚<b¸³¿G› "
>> >> >>> Warm Regards,
>> >> >>> Clair
>> >> >>> On 13 Nov, 15:10, Tony Breyal <tony.bre... at googlemail.com> wrote:
>> >> >>>> Dear R-Help,
>> >> >>>> I need to convert a set of '.pdf' files into an equivalent set of
>> >> >>>> '.txt' files. This is so that i can do some text mining on the
>> >> >>>> content.
>> >> >>>> In the latest R-News letter (http://cran.r-project.org/doc/Rnews/
>> >> >>>> Rnews_2008-2.pdf), the package 'tm' for text mining is mentioned. In
>> >> >>>> that lovely package, there is a function called 'readPDF()'. In order
>> >> >>>> to use this, ?readPDF says
>> >> >>>> "Note that this PDF reader needs both the tools pdftotext and
>> >> >>>> pdfinfo installed and accessable on your system."
>> >> >>>> These tools are available fromhttp://www.foolabs.com/xpdf/download.html
>> >> >>>> I am able to download this and use it easily from a dos window to
>> >> >>>> convert a pdf file into a txt file.
>> >> >>>> Question: how do i make these tools available to R, so that i can use
>> >> >>>> the readPDF() function?
>> >> >>>> Thank you in advance for any help, and I hope the above made sense.
>> >> >>>> Tony Breyal
>> >> >>>> ###OS = Windows Vista Ultimate>> sessionInfo()
>> >> >>>> R version 2.8.0 (2008-10-20)
>> >> >>>> i386-pc-mingw32
>> >> >>>> locale:
>> >> >>>> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
>> >> >>>> 1252;LC_MONETARY=English_United Kingdom.
>> >> >>>> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>> >> >>>> attached base packages:
>> >> >>>> [1] grid stats graphics grDevices utils datasets
>> >> >>>> methods base
>> >> >>>> other attached packages:
>> >> >>>> [1] tm_0.3-1 XML_1.98-1 Snowball_0.0-3
>> >> >>>> RWeka_0.3-14 rJava_0.6-0 Matrix_0.999375-16
>> >> >>>> lattice_0.17-15 filehash_2.0
>> >> >>>> loaded via a namespace (and not attached):
>> >> >>>> [1] proxy_0.4-1
>> >> >>>> ______________________________________________
>> >> >>>> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> >> >>>> and provide commented, minimal, self-contained, reproducible code.
>> >> >>> ------------------------------------------------------------------------
>> >> >>> ______________________________________________
>> >> >>> R-h... at r-project.org mailing list
>> >> >>>https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> >> >>> and provide commented, minimal, self-contained, reproducible code.
>>
>> >> >> ______________________________________________
>> >> >> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>> >> >> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> >> >> and provide commented, minimal, self-contained, reproducible code.
>>
>> >> > ______________________________________________
>> >> > R-h... at r-project.org mailing list
>> >> >https://stat.ethz.ch/mailman/listinfo/r-help
>> >> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> >> > and provide commented, minimal, self-contained, reproducible code.
>>
>> >> ______________________________________________
>> >> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>>
>> > ______________________________________________
>> > R-h... at r-project.org mailing list
>> >https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list