[R] tm package, custom reader

Milan Bouchet-Valat nalimilan at club.fr
Sat Jan 14 15:20:34 CET 2012


Le vendredi 13 janvier 2012 à 09:00 -0800, pl.rudy at gmail.com a écrit :
> I need help with creating custom xml reader for use with the tm package.  The
> objective is to crate a corpus for analysis.  Files that I'm working with
> come from solr and are in a funky XML format never the less I'm able to
> parse the XML files using  solrDocs.R function provided by Duncan Temple
> Lang.  
> 
> The problem I'm having that once I parse the document I need to create a
> custom reader that would be compatible with the  tm package.  
> 
> If someone build a custom reader for tm package, or has some ideas of how to
> go about this,  I would greatly appreciate the help.
I've just written a custom XML source for tm a few days ago, so I guess
I can help. First, tm has a document explaining how to write an XML
reader [1], and it's relatively easy.

Though, I think you shouldn't base your tm reader on the functions
solrDocs.R, since they don't share the same structure as what tm
expects. But you can probably adapt the code from there.

To sum up how tm extensions work, you should have one function parsing
the XML file and returning one XML string for each document in a corpus:
this is the source. And one function parsing these per-document XML
strings, and filling the document's body and meta-data from the XML
tags. I think your code can be simpler than solrDocs.R since you
probably know beforehand which tags are useful for you, which aren't,
and what their types are.

Feel free to ask for help on specific issues you may have. But please
provide a short XML example (and possible code). Also, when you're done,
please consider making this available, either from tm itself, or from a
new package, if it can be useful to others.


Regards

1: http://cran.r-project.org/web/packages/tm/vignettes/extensions.pdf



More information about the R-help mailing list