[R] regexpr and parsing question

Wed Jan 31 00:21:44 CET 2007

On Tue, 2007-01-30 at 17:23 -0500, Kimpel, Mark William wrote:
> The main problem I am trying to solve it this:
> 
> I am importing a tab delimited file whose first line contains only one
> column, which is a descriptor of the form "col_1 col_2 col_3", i.e. the
> colnames are not tab delineated but are separated by whitespace. I would
> like to parse this first line and make such that it becomes the colnames
> of the rest of the file, which I am reading into R using read.delim().
> The file is so huge that I must do this in R.
> 
> My first question is this: What is the best way to accomplish what I
> want to do?

Mark,

The first thing that comes to mind is a two pass approach on the file:

First pass: (using example file with your first line)

# Get the first line into a vector to set the colnames for the DF
# during the second pass
ColNames <- unlist(read.table("test.txt", nrow = 1, as.is = TRUE))

> str(ColNames)
 Named chr [1:3] "col_1" "col_2" "col_3"
 - attr(*, "names")= chr [1:3] "V1" "V2" "V3"

Second pass:

# Now read the rest of the file, skipping the first line
DF <- read.delim("test.txt", skip = 1, col.names = ColNames)

I believe that should get you the full data set and set the colnames
based upon the first line. This should pretty much obviate the need for
everything below here.

> My other questions revolve around some failed attempts on my part to
> solve the problem on my own using regular expressions. I thought that
> perhaps I could change the first line to "c("col_1", "col_2", "col_3")
> using gsub. I was having trouble figuring out how R uses the backslash
> character because I know that sometimes the backslash one would use in
> Perl needs to be a double backslash in R.

You would not want to change the first line as you have it above, as it
would not be parsed properly using read.table() family functions.

> Here is a sample of what I tried and what I got:
> 
> a<-"col_1 col_2 col_3"
> 
> > gsub("\\s", " " , a) 
> 
> [1] "col_1 col_2 col_3"
> 
> > gsub("\\s", "\\s" , a) 
> 
> [1] "col_1scol_2scol_3"
> 
> As you can see, it looks like R is taking a regular expression for
> "pattern", but not taking it for "replacement". Why is this?

There are various settings for how regex are interpreted by/within R.
See ?grep and note the various arguments to the functions there and how
they impact R's behavior here.

Also, note that there is a difference (to further complicate your
life...) between the characters that R displays by default using print()
and how they are displayed using cat(). See below.

> a
[1] "col_1 col_2 col_3"

> gsub(" ", ", " , a)
[1] "col_1, col_2, col_3"

or to get you to your vector statement above:

Note the result here:

> paste("c(\"", gsub(" ", "\", \"" , a), "\")", sep = "")
[1] "c(\"col_1\", \"col_2\", \"col_3\")"

Now see how it displays when the escaped double quote chars are
interpreted properly using cat():

> cat(paste("c(\"", gsub(" ", "\", \"" , a), "\")", sep = ""), "\n")
c("col_1", "col_2", "col_3") 

> Assuming that I did want to solve my original problem with gsub and then
> turn the string into an R object, how would I get gsub to return
> "c("col_1", "col_2", "col_3") using my original string?

Again, note the two pass solution above.  It's easier, unless you would
want to consider using awk/sed from a CLI, which I generally avoid at
all costs...

> Finally, is there a way to declare a string as a regular expression so
> that R sees it the same way other languages, such as Perl do, i.e. make
> the backslash be interpreted the same way? For someone who is just
> learning regular expressions as I am, it is very frustrating to read
> about them in references and then have to translate what I've learned
> into R syntax. I was thinking that instead of enclosing the string in
> "", one could use THIS.IS.A.REGULAR.EXPRESSION(), similar to the way we
> use I() in formulae.

Part of the challenge is noting the different behaviors of regex within
R and how that behavior is affected by the aforementioned arguments.
Also, noting how the output is displayed within R relative to the
interpretation of escaped characters as is seen above.

> These are a bunch of questions, but obviously I have a lot to learn!
> 
> Thanks,
> 
> Mark

HTH,

Marc Schwartz