[R] tm: custom reader for readPlain
Simon Kiss
sjkiss at gmail.com
Tue Jan 8 23:03:12 CET 2013
Hmm...Thanks a lot! that seems like really useful stuff. It might be a bit over my head, but I'll look into it.
The articles are all contained in one text file, but they are clearly delimited (either by a series of -------- ) or the regular expression ^Document.[0-9].
Simon
On 2013-01-08, at 4:44 PM, Milan Bouchet-Valat wrote:
> Le mardi 08 janvier 2013 à 15:56 -0500, Simon Kiss a écrit :
>> Hello:
>> I have a series of newspaper articles from a Canadian newspaper
>> database (Canadian Newsstand) that look just like below.
>>
>> I've read through this vignette
>> (http://cran.r-project.org/web/packages/tm/vignettes/extensions.pdf)
>> about creating a custom reader to extract meta-data, but I can't
>> understand how to apply this in the context of a text document, rather
>> than in the tabular format as in the vignette. You can see there's
>> all kinds of valuable information in each document -Author, page
>> number, publication year, section, publication title....
>> Can anyone provide some suggestions to someone unfamiliar with the tm
>> package as to how to go about creating a custom reader for this
>> situation?
> You should create a reader function that takes as an input the text
> content you pasted at the end of your messages, parses it as
> appropriate, and returns a PlainTextDocument. The information can be set
> using the meta() function on the document object before returning it.
> You can see how this process works by looking at the readFactivaHTML.R
> file from my tm.plugin.factiva package, and probably from other packages
> too (do not use readFactivaXML.R as it uses a method that only works for
> XML input). Of course, parsing the input will take some work, but it
> shouldn't be too hard if you split each line into a field identifier
> (the part before ":") and the value of the field, and create a character
> vector from that.
>
> An information you did not give us is how are distributed the different
> articles you need to import. If they are each in a separate files, you
> can adapt DirSource() from tm so that it calls your reader function on
> each file. If they are in one file, you need to create a custom source
> that will read the file, split it and call the reader function on the
> part corresponding to each article; this latter way is illustrated by
> the HTML part of the FactivaSource.R file (again, skip the XML part).
>
> Finally, maybe you can extract the articles in a different format,
> ideally in XML, which is easier to use? Or maybe this newspaper is
> available on Factiva, in which case my package will work for you?
>
>
> Hope this helps
>
>
>> Yours truly,
>> Simon Kiss
>>
>> ____________________________________________________________
>>
>> Document 1 of 40
>> First Nation agrees not to block trains
>> Author: SHAWN BERRY Legislature Bureau
>> Publication info: Daily Gleaner [Fredericton, N.B] 07 Jan 2013: A.3.
>> http://remote.libproxy.wlu.ca/login?url=http://search.proquest.com/docview/1266701269?accountid=15090
>> Abstract: Participants are also concerned about Chief Theresa Spence who stopped eating solid food on Dec. 11 in a bid to secure a meeting between First Nations leaders, Prime Minister Stephen Harper and Gov. Gen. David Johnston to discuss the treaty relationship.
>> Links: null
>> Full Text: A bunch of text about a story here
>> Subject: Railroads; Native North Americans; Meetings; Injunctions
>> Title: First Nation agrees not to block trains
>> Publication title: Daily Gleaner
>> First page: A.3
>> Publication year: 2013
>> Publication date: Jan 7, 2013
>> Year: 2013
>> Section: Main
>> Publisher: Infomart, a division of Postmedia Network Inc.
>> Place of publication: Fredericton, N.B.
>> Country of publication: Canada
>> Journal subject: GENERAL INTEREST PERIODICALS--UNITED STATES
>> ISSN: 08216983
>> Source type: Newspapers
>> Language of publication: English
>> Document type: News
>> ProQuest document ID: 1266701269
>> Document URL: http://remote.libproxy.wlu.ca/login?url=http://search.proquest.com/docview/1266701269?accountid=15090
>> Copyright: (Copyright (c) 2013 The Daily Gleaner (Fredericton))
>> Last updated: 2013-01-07
>> Database: Canadian Newsstand Complete
>>
>>
>> *********************************
>> Simon J. Kiss, PhD
>> Assistant Professor, Wilfrid Laurier University
>> 73 George Street
>> Brantford, Ontario, Canada
>> N3T 2C9
>> Cell: +1 905 746 7606
>>
>> Please avoid sending me Word, PowerPoint or Excel attachments. Sending these documents puts pressure on many people to use Microsoft software and helps to deny them any other choice. In effect, you become a buttress of the Microsoft monopoly.
>>
>> To convert to plain text choose Text Only or Text Document as the Save As Type. Your computer may also have a program to convert to PDF format. Select File, then Print. Scroll through available printers and select the PDF converter. Click on the Print button and enter a name for the PDF file when requested.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
*********************************
Simon J. Kiss, PhD
Assistant Professor, Wilfrid Laurier University
73 George Street
Brantford, Ontario, Canada
N3T 2C9
Cell: +1 905 746 7606
Please avoid sending me Word, PowerPoint or Excel attachments. Sending these documents puts pressure on many people to use Microsoft software and helps to deny them any other choice. In effect, you become a buttress of the Microsoft monopoly.
To convert to plain text choose Text Only or Text Document as the Save As Type. Your computer may also have a program to convert to PDF format. Select File, then Print. Scroll through available printers and select the PDF converter. Click on the Print button and enter a name for the PDF file when requested.
More information about the R-help
mailing list