[R] Reading a file w/ two delimiters
Gabor Grothendieck
ggrothendieck at gmail.com
Fri Nov 18 16:58:54 CET 2011
On Fri, Nov 18, 2011 at 10:26 AM, David Winsemius
<dwinsemius at comcast.net> wrote:
>
> On Nov 18, 2011, at 9:28 AM, jim holtman wrote:
>
>> It is pretty straightforward in R:
>>
>>> x <-
>>> readLines(textConnection("sadf|asdf|asdf\tqwer|qwer|qwer\tzxcv|zxcv|zxfcgv"))
>>> closeAllConnections()
>>> # convert tabs to newlines
>>> x <- gsub("\t", "\n", x)
>
> Did the rules get liberalized for escaping patterns? Or have I been
> unnecessarily expending backslashes all these years. I thought that one
> needed 3 blackslashes. This code does work and I am wondering if/when I
> "didn't get the memo". (I do see that there is a line early in the ?regex
> page that suggests I have been deluded all along.)
>
> "The current implementation interprets \a as BEL, \e asESC, \f as FF, \n as
> LF, \r as CR and \t as TAB."
>
>> x <-
>> readLines(textConnection("sadf|asdf|asdf\tqwer|qwer|qwer\tzxcv|zxcv|zxfcgv"))
>> closeAllConnections()
>> # convert tabs to newlines
>> x2 <- gsub("\\\t", "\n", x)
>> x2
> [1] "sadf|asdf|asdf\nqwer|qwer|qwer\nzxcv|zxcv|zxfcgv"
>
> So I guess my question is (now) why the triple-slash technique even works?
>
There are two levels of parsing: first its converted to a character
string by R and in that parse "\\\t" gets converted to a backslash
character followed by a tab character (2 characters). Secondly, the
regular expression parser interprets those two characters as a tab.
For example, consider these:
> gsub("\\\t", "x", "\\\t,\t") # 1
[1] "\\x,x"
> gsub("\\\t", "x", "\\\t,\t", fixed = TRUE) # 2
[1] "x,\t"
The first arg in 1 is processed into backslash tab by R and then the
regular expression parser processes that into just tab; however, the
third argument in 1 is processed by R to backslash tab comma tab and
is not further processed since its not regarded as a regular
expression. Thus the result follows.
In contrast the first arg in 2 is processed into backlash tab by R as
before but now its not regarded as a regular expression so the second
level of interpretation that occurred in 1 is not performed. Rather,
only occurrences of backslash tab get replaced instead of occurrences
of tab.
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
More information about the R-help
mailing list