[R] extracting data from a list of unformatted text files

Thu Nov 20 11:18:29 CET 2008

Hi,
I want to extract information from a number of text files in a folder. The files are named as : 82534.txt, 82555.txt, 8282787.txt etc.

I give below a sample of the kind of the information in the text file :
########
#(a lot of preceding text)
2008-10-01      06:30:12                2 of 3
page

#(some lines of text - varies from file to file)
sekvens    890
# lines of text
sNo     start            stop            direction        value
1        70                85                up                60.2
3        60                90                down            71.5
#########

In each of the files that I choose, I want to first go to the appropriate page number. This is the first line in the above text and the page number is 2 (from 2 of 3). The date and time preceding the page number vary from file to file, but the next line always has the word, page.
After that, I am interested in the number following the word, sekvens. Also, the table underneath.

Finally, I want to collect all the data in a data frame with the following structure :

fileno    sekvens    sNo    start    stop    direction    value
82534    890            1        70       85    up            60.2
82534    890            3        60        90    down        71.5
82555     ..               ..        ..        ..        ..            ..

There are a number of topics involved here where I have almost no familiarity. First, the use of regular expressions to specify the files that I want from a folder. Next, how do I locate a particular section (or page) in the text file from the description that I am interested in? Should these files be read in their entirety first, or is it possible to directly go the section with the relevant text? Next, how do I extract the data in the form that I want? 

I have identified the following commands that would be useful for me here : list.files(), readLines(), strsplit().
I would appreciate some help in getting started here. I would certainly benefit from a few hints. I would also appreciate it if I could get some links to references with examples showing how similiar problems are tackled.
Thanking you,
Ravi