[R] reading formatted txt file into a data frame
Tony B
tony.breyal at googlemail.com
Fri May 7 14:19:03 CEST 2010
Thank you all for your help, this has solved my problem. My main
problem with using gsubfn was that i was getting confused by the
square brackets in
[^]]+[^]
but I now have a much better understanding of what this means.
Cheers!
Tony Breyal
On 6 May, 19:38, Gabor Grothendieck <ggrothendi... at gmail.com> wrote:
> This is very similar to the solution in Jim's post
> except the regular expressions can be made
> slightly simpler due to the use of strapply and a
> few of the regular expressions have been made a
> bit different even apart from that. Its not
> always clear what the general case is based on example
> so the regular expressions may need to be tweaked
> once the full data is available but this does work
> on the sample shown.
>
> Here:
>
> \\d+ means one or more digits
> [^]]+[^] ] means one or non-] characters followed by a
> final character which is neither ] nor space
> \\S+ means one or more non-space characters
> \\S+ . (.*) means one or more non-space characters followed by
> space followed by any character followed by space followed by any
> sequence of characters
>
> In each case the portion of the regular expression
> in parentheses is captured and returned by
> strapply.
>
> library(gsubfn)
>
> # input is input data as in Jim's post
> data.frame(ID = strapply(input, "ID: (\\d+)", c, simplify = TRUE),
> Writer = strapply(input, "Writer: ([^]]+[^] ])", c, simplify = TRUE),
> Rating = strapply(input, "Rating: (\\S+)", c, simplify = TRUE),
> Text = strapply(input, "Rating: \\S+ . (.*)", c, simplify = TRUE),
> stringsAsFactors = FALSE)
>
>
>
>
>
> On Thu, May 6, 2010 at 12:24 PM, jim holtman <jholt... at gmail.com> wrote:
> > Try this:
>
> >> cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who",
> > + "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy",
> > + "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4
> > ]Babylon"),
> > + sep = "\n", file = "tmp.txt")
>
> >> # read in the data and parse it assuming it has the same structure
> >> input <- readLines('tmp.txt')
> >> # parse it item by item
> >> x.id <- sub(".*\\[ID: ([[:digit:]]+).*", "\\1 <file://0.0.0.1/>", input)
> >> x.writer <- sub(".*\\[Writer:([^]]+).*", '\\1', input)
> >> x.rating <- sub(".*\\[Rating: ([0-9.]+).*", '\\1', input)
> >> x.prog <- sub(".*\\](.*)", '\\1', input)
> >> #create dataframe
> >> data.frame(id=x.id, writer=x.writer, rating=x.rating, prog=x.prog)
> > id writer rating prog
> > 1 001 Steven Moffat 8.9 Doctor Who
> > 2 002 Joss Whedon 8.8 Buffy
> > 3 003 J. Michael Straczynski 7.4 Babylon
>
> > On Thu, May 6, 2010 at 9:58 AM, Tony B <tony.bre... at googlemail.com> wrote:
>
> >> Dear all
>
> >> Lets say I have a plain text file as follows:
>
> >> > cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who",
> >> + "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy",
> >> + "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ]
> >> Babylon [5]"),
> >> + sep = "\n", file = "tmp.txt")
>
> >> I would somehow like to read in this file to R and covert it into a
> >> data frame like this:
>
> >> > DF <- data.frame(ID = c("001", "002", "003"),
> >> + Writer = c("Steven Moffat", "Joss Whedon", "J.
> >> Michael Straczynski"),
> >> + Rating = c("8.9", "8.8", "7.4"),
> >> + Text = c("Doctor Who", "Buffy", "Babylon [5]"),
> >> stringsAsFactors = FALSE)
>
> >> My initial thoughts were to use readLines on the text file and maybe
> >> do some regular expressions and also use strsplit(..); but having
> >> confused myself after several attempts I was wondering if there is a
> >> way, perhaps using maybe read.table instead? My end goal is to
> >> hopefully convert DF into an XML structure.
>
> >> Thank you kindly in advance for your time,
> >> Tony Breyal
>
> >> # Windows Vista
> >> > sessionInfo()
> >> R version 2.11.0 (2010-04-22)
> >> i386-pc-mingw32
>
> >> locale:
> >> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United
> >> Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
> >> LC_NUMERIC=C LC_TIME=English_United Kingdom.
> >> 1252
>
> >> attached base packages:
> >> [1] stats graphics grDevices utils datasets methods
> >> base
>
> >> other attached packages:
> >> [1] XML_2.8-1
>
> >> loaded via a namespace (and not attached):
> >> [1] tools_2.11.0
>
> >> ______________________________________________
> >> R-h... at r-project.org mailing list
> >>https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >>http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> >> and provide commented, minimal, self-contained, reproducible code.
>
> > --
> > Jim Holtman
> > Cincinnati, OH
> > +1 513 646 9390
>
> > What is the problem that you are trying to solve?
>
> > [[alternative HTML version deleted]]
>
> > ______________________________________________
> > R-h... at r-project.org mailing list
> >https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> --
> You received this message because you are subscribed to the Google Groups "R-help-archive" group.
> To post to this group, send email to r-help-archive at googlegroups.com.
> To unsubscribe from this group, send email to r-help-archive+unsubscribe at googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/r-help-archive?hl=en.
More information about the R-help
mailing list