[R] extracting data from a list of unformatted text files

Thu Nov 20 14:35:10 CET 2008

Here is a way to process the file.  You will have to add the loop,
error checking, piecing multiple files together, and determination of
the end of the data:

> x <- "I give below a sample of the kind of the information in the text file :
+ ########
+ #(a lot of preceding text)
+ 2008-10-01      06:30:12                2 of 3
+ page
+
+ #(some lines of text - varies from file to file)
+ sekvens    890
+ # lines of text
+ sNo     start            stop            direction        value
+ 1        70                85                up                60.2
+ 3        60                90                down            71.5
+ #########
+
+ In each of the files that I choose, I want to first go to the
appropriate page number. This is the first line in the above text and
the page number is 2 (from 2 of 3). The date and time preceding the
page number vary from file to file, but the next line always has the
word, page.
+ After that, I am interested in the number following the word,
sekvens. Also, the table underneath."
> input <- readLines(textConnection(x))
> closeAllConnections()
> # find 'page'
> pageNo <- grep("^page", input)
> # backup one line and look for "2 of"
> page2 <- grep("2 of ", input[pageNo - 1])
> # compute the start of the data and delete preceeding data
> startData <- pageNo[page2]
> input <- tail(input, -startData)
> # find 'sekvens'
> sek.indx <- grep("^sekvens", input)
> # extract number after
> sek.value <- sub(".*?(\\d+).*", "\\1", input[sek.indx], perl=TRUE)
> # find start of table
> sNo.indx <- grep("sNo", input)
> # read the data (you did not say how to determine the end, so I will read the three lines
> values <- read.table(textConnection(input[sNo.indx + (0:2)]), header=TRUE)
> closeAllConnections()
> sek.value
[1] "890"
> values
  sNo start stop direction value
1   1    70   85        up  60.2
2   3    60   90      down  71.5

On Thu, Nov 20, 2008 at 5:18 AM, ravi <rv15i at yahoo.se> wrote:
> Hi,
> I want to extract information from a number of text files in a folder. The files are named as : 82534.txt, 82555.txt, 8282787.txt etc.
>
> I give below a sample of the kind of the information in the text file :
> ########
> #(a lot of preceding text)
> 2008-10-01      06:30:12                2 of 3
> page
>
> #(some lines of text - varies from file to file)
> sekvens    890
> # lines of text
> sNo     start            stop            direction        value
> 1        70                85                up                60.2
> 3        60                90                down            71.5
> #########
>
> In each of the files that I choose, I want to first go to the appropriate page number. This is the first line in the above text and the page number is 2 (from 2 of 3). The date and time preceding the page number vary from file to file, but the next line always has the word, page.
> After that, I am interested in the number following the word, sekvens. Also, the table underneath.
>
> Finally, I want to collect all the data in a data frame with the following structure :
>
> fileno    sekvens    sNo    start    stop    direction    value
> 82534    890            1        70       85    up            60.2
> 82534    890            3        60        90    down        71.5
> 82555     ..               ..        ..        ..        ..            ..
>
> There are a number of topics involved here where I have almost no familiarity. First, the use of regular expressions to specify the files that I want from a folder. Next, how do I locate a particular section (or page) in the text file from the description that I am interested in? Should these files be read in their entirety first, or is it possible to directly go the section with the relevant text? Next, how do I extract the data in the form that I want?
>
> I have identified the following commands that would be useful for me here : list.files(), readLines(), strsplit().
> I would appreciate some help in getting started here. I would certainly benefit from a few hints. I would also appreciate it if I could get some links to references with examples showing how similiar problems are tackled.
> Thanking you,
> Ravi
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?