[R] Is there a better way to parse strings than this?

Fri Apr 15 02:28:34 CEST 2011

Thanks for the suggestions, they were all exactly what I was looking for.
(I knew that had to be a more elegant way then my brute force method)

One question though.

I was playing around with strsplit but couldn't get it to work, I realised
my problem was that I was using "." as the string.

I was trying strsplit(string,"\.\.\.") as per the suggestion in Venables
and Ripleys book to "(use '\.' to match '.')", which is in the Regular
expressions section.

I noticed that in the suggestions sent to me people used:
strsplit(test,"\\.\\.\\.")

Could anyone please explain why I should have used "\\.\\.\\." rather than
"\.\.\."?

Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP Commercialisation and Innovation,
Data Analysis, Modelling and Training
(mobile) 0410 689 945
(fax / office) (+618) 8952 7878
chris at trickysolutions.com.au

-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
Sent: Wednesday, 13 April 2011 10:55 PM
To: Chris Howden
Cc: r-help at r-project.org
Subject: Re: [R] Is there a better way to parse strings than this?

On Wed, Apr 13, 2011 at 12:07 AM, Chris Howden
<chris at trickysolutions.com.au> wrote:
> Hi Everyone,
>
>
> I needed to parse some strings recently.
>
> The code I've wound up using seems rather clunky, and I was wondering if
> anyone had any suggestions on a better way?
>
> Basically I do the following:
>
> 1) Use substr() to do the parsing
> 2) Use regexpr() to find the location of the string I want to parse on,
I
> then pass this onto substr()
> 3) Use nchar() as the stop input to substr() where necessary
>
>
>
> I've got a simple example of the parsing code I used below. It takes
> questionnaire variable names that includes the question and the brand it
> was answered for and then parses it so the variable name and the brand
are
> in separate columns. I then use this to restructure the data from
> unstacked to stacked, but that's another story.
>
>> # this is the data set
>> test
> [1] "A5.Brands.bought...Dulux"
> [2] "A5.Brands.bought...Haymes"
> [3] "A5.Brands.bought...Solver"
> [4] "A5.Brands.bought...Taubmans.or.Bristol"
> [5] "A5.Brands.bought...Wattyl"
> [6] "A5.Brands.bought...Other"
>
>> # Where do I want to parse?
>> break1 <-  regexpr('...',test, fixed=TRUE)
>> break1
> [1] 17 17 17 17 17 17
> attr(,"match.length")
> [1] 3 3 3 3 3 3
>
>> # Put Variable name in a variable
>> str1 <- substr(test,1,break1-1)
>> str1
> [1] "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought"
> "A5.Brands.bought"
> [5] "A5.Brands.bought" "A5.Brands.bought"
>
>> # Put Brand name in a variable
>> str2 <- substr(test,break1+3, nchar(test))
>> str2
> [1] "Dulux"               "Haymes"              "Solver"
> [4] "Taubmans.or.Bristol" "Wattyl"              "Other"
>
>

Try this:

> x <- c("A5.Brands.bought...Dulux", "A5.Brands.bought...Haymes",
+ "A5.Brands.bought...Solver")
>
> do.call(rbind, strsplit(x, "...", fixed = TRUE))
     [,1]               [,2]
[1,] "A5.Brands.bought" "Dulux"
[2,] "A5.Brands.bought" "Haymes"
[3,] "A5.Brands.bought" "Solver"
>
> # or
> xa <- sub("...", "\1", x, fixed = TRUE)
> read.table(textConnection(xa), sep = "\1", as.is = TRUE)
                V1     V2
1 A5.Brands.bought  Dulux
2 A5.Brands.bought Haymes
3 A5.Brands.bought Solver

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com