[R] extracting data from a list of unformatted text files
ravi
rv15i at yahoo.se
Thu Nov 20 15:47:01 CET 2008
Jim,
Thank you so much. There is a lot for me here to dig into, learn and understand. But you have made my task so much easier by giving me sufficient material to get started. Once again, thanks a lot.
/Ravi
----- Original Message ----
From: jim holtman <jholtman at gmail.com>
To: ravi <rv15i at yahoo.se>
Cc: r-help at r-project.org
Sent: Thursday, 20 November, 2008 14:35:10
Subject: Re: [R] extracting data from a list of unformatted text files
Here is a way to process the file. You will have to add the loop,
error checking, piecing multiple files together, and determination of
the end of the data:
> x <- "I give below a sample of the kind of the information in the text file :
+ ########
+ #(a lot of preceding text)
+ 2008-10-01 06:30:12 2 of 3
+ page
+
+ #(some lines of text - varies from file to file)
+ sekvens 890
+ # lines of text
+ sNo start stop direction value
+ 1 70 85 up 60.2
+ 3 60 90 down 71.5
+ #########
+
+ In each of the files that I choose, I want to first go to the
appropriate page number. This is the first line in the above text and
the page number is 2 (from 2 of 3). The date and time preceding the
page number vary from file to file, but the next line always has the
word, page.
+ After that, I am interested in the number following the word,
sekvens. Also, the table underneath."
> input <- readLines(textConnection(x))
> closeAllConnections()
> # find 'page'
> pageNo <- grep("^page", input)
> # backup one line and look for "2 of"
> page2 <- grep("2 of ", input[pageNo - 1])
> # compute the start of the data and delete preceeding data
> startData <- pageNo[page2]
> input <- tail(input, -startData)
> # find 'sekvens'
> sek.indx <- grep("^sekvens", input)
> # extract number after
> sek.value <- sub(".*?(\\d+).*", "\\1", input[sek.indx], perl=TRUE)
> # find start of table
> sNo.indx <- grep("sNo", input)
> # read the data (you did not say how to determine the end, so I will read the three lines
> values <- read.table(textConnection(input[sNo.indx + (0:2)]), header=TRUE)
> closeAllConnections()
> sek.value
[1] "890"
> values
sNo start stop direction value
1 1 70 85 up 60.2
2 3 60 90 down 71.5
On Thu, Nov 20, 2008 at 5:18 AM, ravi <rv15i at yahoo.se> wrote:
> Hi,
> I want to extract information from a number of text files in a folder. The files are named as : 82534.txt, 82555.txt, 8282787.txt etc.
>
> I give below a sample of the kind of the information in the text file :
> ########
> #(a lot of preceding text)
> 2008-10-01 06:30:12 2 of 3
> page
>
> #(some lines of text - varies from file to file)
> sekvens 890
> # lines of text
> sNo start stop direction value
> 1 70 85 up 60.2
> 3 60 90 down 71.5
> #########
>
> In each of the files that I choose, I want to first go to the appropriate page number. This is the first line in the above text and the page number is 2 (from 2 of 3). The date and time preceding the page number vary from file to file, but the next line always has the word, page.
> After that, I am interested in the number following the word, sekvens. Also, the table underneath.
>
> Finally, I want to collect all the data in a data frame with the following structure :
>
> fileno sekvens sNo start stop direction value
> 82534 890 1 70 85 up 60.2
> 82534 890 3 60 90 down 71.5
> 82555 .. .. .. .. .. ..
>
> There are a number of topics involved here where I have almost no familiarity. First, the use of regular expressions to specify the files that I want from a folder. Next, how do I locate a particular section (or page) in the text file from the description that I am interested in? Should these files be read in their entirety first, or is it possible to directly go the section with the relevant text? Next, how do I extract the data in the form that I want?
>
> I have identified the following commands that would be useful for me here : list.files(), readLines(), strsplit().
> I would appreciate some help in getting started here. I would certainly benefit from a few hints. I would also appreciate it if I could get some links to references with examples showing how similiar problems are tackled.
> Thanking you,
> Ravi
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
More information about the R-help
mailing list