[R] Help request: Parsing docx files for key words and appending to a spreadsheet

Roy Mendelssohn - NOAA Federal roy@mende|@@ohn @end|ng |rom no@@@gov
Fri Dec 29 19:25:16 CET 2023


Hi Andy:

I don’t have an answer but I do have what I hope is some friendly advice.  Generally the more information you can provide,  the more likely you will get help that is useful.  In your case you say that you tried several packages and they didn’t do what you wanted.  Providing that code,  as well as why they didn’t do what you wanted (be specific)  would greatly facilitate things.

Happy new year,

-Roy


> On Dec 29, 2023, at 10:14 AM, Andy <phaedrusv using gmail.com> wrote:
> 
> Hello
> 
> I am trying to work through a problem, but feel like I've gone down a rabbit hole. I'd very much appreciate any help.
> 
> The task: I have several directories of multiple (some directories, up to 2,500+) *.docx files (newspaper articles downloaded from Lexis+) that I want to iterate through to append to a spreadsheet only those articles that satisfy a condition (i.e., a specific keyword is present for >= 50% coverage of the subject matter). Lexis+ has a very specific structure and keywords are given in the row "Subject".
> 
> I'd like to be able to accomplish the following:
> 
> (1) Append the title, the month, the author, the number of words, and page number(s) to a spreadsheet
> 
> (2) Read each article and extract keywords (in the docs, these are listed in 'Subject' section as a list of keywords with a percentage showing the extent to which the keyword features in the article (e.g., FAST FASHION (72%)) and to append the keyword and the % coverage to the same row in the spreadsheet. However, I want to ensure that the keyword coverage meets the threshold of >= 50%; if not, then pass onto the next article in the directory. Rinse and repeat for the entire directory.
> 
> So far, I've tried working through some Stack Overflow-based solutions, but most seem to use the textreadr package, which is now deprecated; others use either the officer or the officedown packages. However, these packages don't appear to do what I want the program to do, at least not in any of the examples I have found, nor in the vignettes and relevant package manuals I've looked at.
> 
> The first point is, is what I am intending to do even possible using R? If it is, then where do I start with this? If these docx files were converted to UTF-8 plain text, would that make the task easier?
> 
> I am not a confident coder, and am really only just getting my head around R so appreciate a steep learning curve ahead, but of course, I don't know what I don't know, so any pointers in the right direction would be a big help.
> 
> Many thanks in anticipation
> 
> Andy
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list