[R] Error in Corpus() in tm package

Sun Aug 18 19:18:20 CEST 2013

Le dimanche 18 août 2013 à 09:19 -0700, Ajinkya Kale a écrit :
> I did exactly what you mentioned... tried subset of these documents
> and found out there were some junk non-txt files which were causing
> this issue. Everything worked fine with dirsource once I deleted them
> from the dir.
> But I feel these functions should also tell what file they are failing
> at.... I have ended up debugging with sub sets of input one too many
> times. 
Good. Could you send us (or maybe privately to me) at least an excerpt
of the file that is enough to reproduce the bug? Indeed it would be nice
to get a more explicit error message from tm if possible.


Regards

> 
> On Aug 18, 2013 9:01 AM, "Milan Bouchet-Valat" <nalimilan at club.fr>
> wrote:
>         Le samedi 17 août 2013 à 11:16 -0700, Ajinkya Kale a écrit :
>         > It contains all text files which were converted from doc,
>         docx, ppt
>         > etc. using libreoffice.
>         > Some of them are non-english text documents.
>         >
>         >
>         > Sorry I cannot share the corpus.. but if someone can shed
>         light on
>         > what might cause this error then I can try to eliminate
>         those
>         > documents if some specific docs are causing it.
>         I think you should go the other way round: try with only one
>         document
>         and see if it works, and do enough attempts to find out in
>         what cases it
>         works and in what cases it fails. If it always fails, try with
>         examples
>         provided by tm, and then with parts of your documents.
>         
>         I don't think it makes sense to try to use VectorSource() as
>         it would
>         imply reimplementing DirSource().
>         
>         
>         Regards
>         
>         > On Sat, Aug 17, 2013 at 9:55 AM, Milan Bouchet-Valat
>         > <nalimilan at club.fr> wrote:
>         >         Le vendredi 16 août 2013 à 19:35 -0700, Ajinkya Kale
>         a écrit :
>         >         > I am trying to use the text mining package ... I
>         keep
>         >         getting this error :
>         >         >
>         >         > rm(list=ls())
>         >         > library(tm)
>         >         > sourceDir <- "Z:\\projectk_viz\\docs_to_index"
>         >         > ovid <- Corpus(DirSource(sourceDir),readerControl
>         =
>         >         list(language = "lat"))
>         >         >
>         >         > Error in if (vectorized && (length <= 0))
>         stop("vectorized
>         >         sources must
>         >         > have positive length") : missing value where
>         TRUE/FALSE
>         >         needed
>         >         >
>         >         > I am not sure what it means.
>         >
>         >         The posting guide asks for a reproducible example.
>         If you
>         >         cannot make
>         >         available to us the contents of sourceDir, at least
>         you should
>         >         tell us
>         >         what kind of files it contains. Have you tried with
>         only some
>         >         of the
>         >         files the directory contains ?
>         >
>         >
>         >         Regards
>         >
>         >         > --ajinkya
>         >         >
>         >         >       [[alternative HTML version deleted]]
>         >         >
>         >         > ______________________________________________
>         >         > R-help at r-project.org mailing list
>         >         > https://stat.ethz.ch/mailman/listinfo/r-help
>         >         > PLEASE do read the posting guide
>         >         http://www.R-project.org/posting-guide.html
>         >         > and provide commented, minimal, self-contained,
>         reproducible
>         >         code.
>         >
>         >
>         >
>         >
>         >
>         > --
>         >
>         > Sincerely,
>         > Ajinkya
>         > http://ajinkya.info
>         >
>