CSVSource in tm Package

Armin Goralczyk agoralczyk at gmail.com
Sun Jan 6 16:53:40 CET 2008


I tried to use the CSVSource in the TextDocCol function in the tm package. But
a) data from several columns is concatenated in one entry and
b) data in a large text column is broken into several entries
I hoped that it would be possible to assign columns as metadata to one
entry with one specific column being the original text to analyze.

Here is an example from the vignette (the backslash in the output is
not in the original data):

> cars <- system.file("texts", "cars.csv", package = "tm");
> tdc <- TextDocCol(CSVSource(cars))
Read 5 items
> inspect(tdc)
A text document collection with 5 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator
Available variables in the data frame are:

[1] "1997,\"Ford\",\"Mustang\",\"3000.00\""

[1] "1999,\"Chevy\",\"Venture\",4900.00"

[1] "1996,\"Chrylser\",\"Cherokee\",\"4799.00\""

[1] "2005,\"Ferrari\",\"Modena\",\"80999.00\""

[1] "1973,\"Tank\",\"\",\"9900.00\""

Also I have a question about the best workflow for text mining/analysis:

My original data is in a mySQL table. Is it possible to import the
data directly into TextDocCol without creating an intermediate csv

I am using

> R.Version()
[1] "powerpc-apple-darwin8.10.1"

[1] "powerpc"

[1] "darwin8.10.1"

[1] "powerpc, darwin8.10.1"

[1] ""

[1] "2"

[1] "6.1"

[1] "2007"

[1] "11"

[1] "26"

$`svn rev`
[1] "43537"

[1] "R"

[1] "R version 2.6.1 (2007-11-26)"

Armin Goralczyk, M.D.
Universitätsmedizin Göttingen
Abteilung Allgemein- und Viszeralchirurgie
Rudolf-Koch-Str. 40
39099 Göttingen
Dept. of General Surgery
University of Göttingen
Göttingen, Germany

