[R] extracting data from a list of unformatted text files
ravi
rv15i at yahoo.se
Thu Nov 20 11:18:29 CET 2008
Hi,
I want to extract information from a number of text files in a folder. The files are named as : 82534.txt, 82555.txt, 8282787.txt etc.
I give below a sample of the kind of the information in the text file :
########
#(a lot of preceding text)
2008-10-01 06:30:12 2 of 3
page
#(some lines of text - varies from file to file)
sekvens 890
# lines of text
sNo start stop direction value
1 70 85 up 60.2
3 60 90 down 71.5
#########
In each of the files that I choose, I want to first go to the appropriate page number. This is the first line in the above text and the page number is 2 (from 2 of 3). The date and time preceding the page number vary from file to file, but the next line always has the word, page.
After that, I am interested in the number following the word, sekvens. Also, the table underneath.
Finally, I want to collect all the data in a data frame with the following structure :
fileno sekvens sNo start stop direction value
82534 890 1 70 85 up 60.2
82534 890 3 60 90 down 71.5
82555 .. .. .. .. .. ..
There are a number of topics involved here where I have almost no familiarity. First, the use of regular expressions to specify the files that I want from a folder. Next, how do I locate a particular section (or page) in the text file from the description that I am interested in? Should these files be read in their entirety first, or is it possible to directly go the section with the relevant text? Next, how do I extract the data in the form that I want?
I have identified the following commands that would be useful for me here : list.files(), readLines(), strsplit().
I would appreciate some help in getting started here. I would certainly benefit from a few hints. I would also appreciate it if I could get some links to references with examples showing how similiar problems are tackled.
Thanking you,
Ravi
More information about the R-help
mailing list