Help request: Parsing docx files for key words and appending to a spreadsheet

Fri Dec 29 21:25:06 CET 2023

I would also look at https://pandoc.org perhaps which can
export a number of formats...

And for spreadsheets https://github.com/jqnatividad/qsv is my
goto weapon.  Can also read and write XLSX and others.

A sample document or two would always be helpful...


On 29/12/2023 21:01, CALUM POLWART wrote:
> It sounded like he looked at officeR but I would agree
> content <- officer::docx_summary("filename.docx")
> Would get the text content into an object called content.
> That object is a data.frame so you can then manipulate it.
> To be more specific, we might need an example of the DF
>> On Fri, Dec 29, 2023 at 10:14 AM Andy <phaedrusv using gmail.com>
>> wrote:
>>> I'd like to be able to accomplish the following:
>>> (1) Append the title, the month, the author, the number of
>>> words, and page number(s) to a spreadsheet
>>> (2) Read each article and extract keywords (in the docs,
>>> these are listed in 'Subject' section as a list of
>>> keywords with a percentage showing the extent to which the
>>> keyword features in the article (e.g., FAST FASHION (72%))
>>> and to append the keyword and the % coverage to the same
>>> row in the spreadsheet.  However, I want to ensure that
>>> the keyword coverage meets the threshold of >= 50%; if
>>> not, then pass onto the next article in the directory.
>>> Rinse and repeat for the entire directory.

