[R] reading formatted txt file into a data frame
Steve Lianoglou
mailinglist.honeypot at gmail.com
Thu May 6 18:14:42 CEST 2010
Hi Tony,
On Thu, May 6, 2010 at 9:58 AM, Tony B <tony.breyal at googlemail.com> wrote:
> Dear all
>
> Lets say I have a plain text file as follows:
>
>> cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who",
> + "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy",
> + "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ]
> Babylon [5]"),
> + sep = "\n", file = "tmp.txt")
>
> I would somehow like to read in this file to R and covert it into a
> data frame like this:
>
>> DF <- data.frame(ID = c("001", "002", "003"),
> + Writer = c("Steven Moffat", "Joss Whedon", "J.
> Michael Straczynski"),
> + Rating = c("8.9", "8.8", "7.4"),
> + Text = c("Doctor Who", "Buffy", "Babylon [5]"),
> stringsAsFactors = FALSE)
>
>
> My initial thoughts were to use readLines on the text file and maybe
> do some regular expressions and also use strsplit(..); but having
> confused myself after several attempts I was wondering if there is a
> way, perhaps using maybe read.table instead? My end goal is to
> hopefully convert DF into an XML structure.
I can't think of an easy way to do it with a simple read.table call.
As you suggested, I'd try to whip this into shape by loading into a
character vector using "readLines" / strsplit / regular expression.
If your data is so well behaved, why not try splitting your lines by
"]", then do some mincing.
For instance:
## Simulate a readLines on your file
lines<- c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who",
+ "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy",
+ "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ]
+ Babylon [5]")
## Create an empty data.frame
df <- data.frame(id=character(length(lines)), writer=character(length(lines)),
rating=numeric(length(lines)),
text=character(length(lines)))
pieces <- strsplit(lines, "]", fixed=TRUE)
## Store into their seperate pieces for more processing
ids <- sapply(pieces, '[[', 1)
writers <- sapply(pieces, '[[', 2)
ratings <- sapply(pieces, '[[', 3)
texts <- sapply(pieces, '[[', 4)
## You can use regexes again, or strsplit judiciously
clean.ids <- sapply(strsplit(ids, ' '), '[', 2)
clean.writers <- sapply(strsplit(writers, ':', fixed=TRUE), '[', 2)
...
Honestly, if your data isn't all that well behaved, I'd probably do
this in another language like Python to whip it into a "cleaner" tab
separated file that can easily be read into R. I tend to like Python's
matching behavior with regex's a bit better ...
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the R-help
mailing list