[R] Help request: Parsing docx files for key words and appending to a spreadsheet
Dr Eberhard Lisse
no@p@m @end|ng |rom ||@@e@NA
Wed Jan 3 13:26:16 CET 2024
If you do something like this
for i in $(pandoc --list-output-formats);
do pandoc -f docx -t $i -o test.$i Now\ they\ want\ us\ to\ charge\
our\ electric\ cars\ from\ litter\ bins.docx;
done
you get approximately 65 formats, from which you can pick one which you can
write a little parser for. The dokuwiki one for example uses long lines
which
makes parsing easier.
el
On 2023-12-30 13:57 , Andy wrote:
> Good idea, El - thanks.
>
> The link is
> https://docs.google.com/document/d/1QwuaWZk6tYlWQXJ3WLczxC8Cda6zVERk/edit?usp=sharing&ouid=103065135255080058813&rtpof=true&sd=true
>
> This is helpful.
>
> From the article, which is typical of Lexis+ output, I want to
> extract the following fields and append to a Calc/ Excel spreadsheet.
> Given the volume of articles I have to work through, if this can be
> iterative and semi-automatic, that would be a god send and I might be
> able to do some actual research on the articles before I reach my
> pensionable age. :-)
>
> Title Newspaper Date Section and page number Length Byline Subject
> (only if the threshold of coverage for a specific subject is
>> =50% is reached (e.g. Greenwashing (51%)) - if not, enter 'nil' and
>>
> move onto the next article in the folder
>
> This is the ambition. I am clearly a long way short of that though.
>
> Many thanks. Andy
More information about the R-help
mailing list