[R] help reading a variably formatted text file
Michael Na Li
lina at u.washington.edu
Tue Nov 19 21:41:34 CET 2002
On Tue, 19 Nov 2002, Corey Moffet stated:
> Dear R-Help,
>
> I have a generated file that looks like the following:
> ....
> Is this a reasonable thing to do in R? Are there some functions that will
> make this task less difficult? Is there a function that alows you to read
> a small amount of information, parse it, test it, and then begin reading
> again where it left off?
This function seems to work, on your sample file at least,
read.hill <- function (file)
{
lines <- scan (file, what = "", sep = "\n", quiet = TRUE)
## Get the line starting with ' char'
chars <- grep ("^ char", lines)
## Get the number of columns
ncols <- get.numbers (lines[chars])
## Get the column labels
labels <- lines[rep (chars, ncols) +
as.vector (sapply (ncols, seq, from = 1))]
##
days.col <- grep ("Days", labels)
runoff.col <- grep ("Runoff", labels)
## Get the numbers
toSkip <- grep ("Daily values", lines) + 1
toRead <- grep ("Minimum/Maximum", lines) - 2 - toSkip
temp <- unlist (strsplit (lines[(toSkip+1):(toSkip+toRead)],
split = " +"))
## There are some "" at the first column
temp <- matrix (temp, ncol = length (labels) + 1, byrow = TRUE)
data.frame (days = as.numeric (temp[, days.col + 1]),
runoff = as.numeric (temp[, runoff.col + 1]))
}
get.numbers () is a function that I wrote to extract numbers from a character
vector that match a certain pattern.
get.numbers <- function (ss, pattern, ignore.case = FALSE) {
if (!missing (pattern)) {
ss <- grep (pattern, x = ss, ignore.case = ignore.case,
extended = TRUE, value = TRUE)
}
if (length (ss) == 0) {
return (NULL)
}
## split at non numeric, non-dot characters and two or more dots
## FIXME: this is not the optimal split
token <- strsplit (ss, split = "([^-+.0-9]|--+|\\+\\++|\\.\\.+| \t)")
## remove any trailing '.'
token <- lapply (token, function (x) sub ("\\.$", "", x))
## remove empty strings and convert to numeric
token <- lapply (token, function (x) {
as.numeric (x[sapply (x, function (y) y != "")])
})
if (is.null (names (ss))) {
names (token) <- ss
} else {
names (token) <- names (ss)
}
token
}
As a test:
> read.hill ("hillslope.dat")
days runoff
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
10 10 0
11 11 0
12 12 0
13 13 0
14 14 0
15 15 0
16 16 0
17 17 0
18 18 0
19 19 0
20 20 0
21 21 0
22 22 0
23 23 0
24 24 0
25 25 0
26 26 0
27 27 0
28 28 0
29 29 0
30 30 0
31 31 0
As Jason pointed out, Perl might be more suitable to this job. However, I do
like using R to parse many weird files. I find maintaining R scripts much
easier than Perl and it is often more convenient to read a file directly into
R.
It would be nice to have more powerful regex in R, such as returning matched
substring grouped with "()".
Michael
--
----------------------------------------------------------------------------
Michael Na Li
Email: lina at u.washington.edu
Department of Biostatistics, Box 357232
University of Washington, Seattle, WA 98195
---------------------------------------------------------------------------
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list