[R] reading formatted txt file into a data frame

Thu May 6 18:14:42 CEST 2010

Hi Tony,

On Thu, May 6, 2010 at 9:58 AM, Tony B <tony.breyal at googlemail.com> wrote:
> Dear all
>
> Lets say I have a plain text file as follows:
>
>> cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who",
> +       "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy",
> +       "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ]
> Babylon [5]"),
> +       sep = "\n", file = "tmp.txt")
>
> I would somehow like to read in this file to R and covert it into a
> data frame like this:
>
>> DF <- data.frame(ID = c("001", "002", "003"),
> +                 Writer = c("Steven Moffat", "Joss Whedon", "J.
> Michael Straczynski"),
> +                 Rating = c("8.9", "8.8", "7.4"),
> +                 Text = c("Doctor Who", "Buffy", "Babylon [5]"),
> stringsAsFactors = FALSE)
>
>
> My initial thoughts were to use readLines on the text file and maybe
> do some regular expressions and also use strsplit(..); but having
> confused myself after several attempts I was wondering if there is a
> way, perhaps using maybe read.table instead?  My end goal is to
> hopefully convert DF into an XML structure.

I can't think of an easy way to do it with a simple read.table call.

As you suggested, I'd try to whip this into shape by loading into a
character vector using "readLines" / strsplit / regular expression.

If your data is so well behaved, why not try splitting your lines by
"]", then do some mincing.

For instance:
## Simulate a readLines on your file
lines<- c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who",
+ "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy",
+ "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ]
+ Babylon [5]")

## Create an empty data.frame
df <- data.frame(id=character(length(lines)), writer=character(length(lines)),
                        rating=numeric(length(lines)),
text=character(length(lines)))

pieces <- strsplit(lines, "]", fixed=TRUE)

## Store into their seperate pieces for more processing
ids <- sapply(pieces, '[[', 1)
writers <- sapply(pieces, '[[', 2)
ratings <- sapply(pieces, '[[', 3)
texts <- sapply(pieces, '[[', 4)

## You can use regexes again, or strsplit judiciously
clean.ids <- sapply(strsplit(ids, ' '), '[', 2)
clean.writers <- sapply(strsplit(writers, ':', fixed=TRUE), '[', 2)
...

Honestly, if your data isn't all that well behaved, I'd probably do
this in another language like Python to whip it into a "cleaner" tab
separated file that can easily be read into R. I tend to like Python's
matching behavior with regex's a bit better ...

-steve
-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact