[R] text analysis errors

Jim Lemon drj|m|emon @end|ng |rom gm@||@com
Thu Jan 7 01:34:21 CET 2021


Hi Gordon,
Looks to me as though you may have to extract the text from the Word
files. Export As Text.

Jim

On Thu, Jan 7, 2021 at 10:40 AM Gordon Ballingrud
<gob.allingrud using gmail.com> wrote:
>
> Hello all,
>
> I have asked this question on many forums without response. And although
> I've made progress myself, I am stuck as to how to respond to a particular
> error message.
>
> I have a question about text-analysis packages and code. The general idea
> is that I am trying to perform readability analyses on a collection of
> about 4,000 Word files. I would like to do any of a number of such
> analyses, but the problem now is getting R to recognize the uploaded files
> as data ready for analysis. But I have been getting error messages. Let me
> show what I have done so far. I have three separate commands because I
> broke the file of 4,000 files up into three separate ones because,
> evidently, the file was too voluminous to be read alone in its entirety.
> So, I divided the files up into three roughly similar folders. They are
> called ‘WPSCASES’ one through three. Here is my code, with the error
> messages for each command recorded below:
>
> token <-
> tokenize("/Users/Gordon/Desktop/WPSCASESONE/",lang="en",doc_id="sample")
>
> The code is the same for the other folders; the name of the folder is
> different, but otherwise identical.
>
> The error message reads:
>
> *Error in nchar(tagged.text[, "token"], type = "width") : invalid multibyte
> string, element 348*
>
> The error messages are the same for the other two commands. But the
> 'element' number is different. It's 925 for the second folder, and 4302 for
> the third.
>
> token2 <-
> tokenize("/Users/Gordon/Desktop/WPSCASES2/",lang="en",doc_id="sample")
>
> token3 <-
> tokenize("/Users/Gordon/Desktop/WPSCASES3/",lang="en",doc_id="sample")
>
> These are the other commands if that's helpful.
>
> I’ve tried to discover whether the ‘element’ that the error message
> mentions corresponds to the file of that number in the file’s order. But
> since folder 3 does not have 4,300 files in it, I think that that was
> unlikely. Please let me know if you can figure out how to fix this stuff so
> that I can start to use ‘koRpus’ commands, like ‘readability’ and its
> progeny.
>
> Thank you,
> Gordon
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list