[R] parsing strings between [ ] in columns

Gabor Grothendieck ggrothendieck at gmail.com
Thu Feb 18 12:14:23 CET 2010


Here is a solution using strapply in the gsubfn package.

First we define a proto object p containing a single method, i.e.
function, called fun.  fun will take one [...] construct and split it
into the numeric vector v using strsplit and will also assign it
names.  strapply has a built in variable, count, that is maintained
automatically in the proto object that will be used for determining
which letter to use.

Using strapply apply fun in p to each substring matching this regexp
"\\[([01, ]*)\\]".  This regexpr matches [ followed by a string of
characters made up of 0, 1, comma and space, followed by ] and applies
p$fun to each such occurrence.  (Modify the regexp appropriately if
the true problem has different characteristics.)

Finally, simplify = rbind will cause the resulting vectors to  be
rbind'ed together.  (If the different rows of myDF do not have the
same structure then omit the simplify = rbind argument of strapply to
get out a list.)


p <- proto(fun = function(this, x) {
	v <- as.numeric(strsplit(x, ",")[[1]])
	names(v) <- paste(LETTERS[count], seq_along(v), sep = "")
	v
})
strapply(as.character(myDF[[1]]), "\\[([01, ]*)\\]", p, simplify = rbind)


Here is what the output looks like:	
	
> strapply(as.character(myDF[[1]]), "\\[([01, ]*)\\]", p, simplify = rbind)
     A1 A2 A3 B1 B2
[1,]  1  0  0  0  1
[2,]  1  1  0  0  1
[3,]  1  0  0  1  1
[4,]  0  0  1  0  1

See http://gsubfn.googlecode.com and the gsubfn vignette for more info.


On Thu, Feb 18, 2010 at 3:29 AM, milton ruser <milton.ruser at gmail.com> wrote:
> Dear all,
>
> I have a data.frame with a column like the x shown below
> myDF<-data.frame(cbind(x=c("[[1, 0, 0], [0, 1]]",
>   "[[1, 1, 0], [0, 1]]","[[1, 0, 0], [1, 1]]",
>   "[[0, 0, 1], [0, 1]]")))
>> myDF
>                    x
> 1 [[1, 0, 0], [0, 1]]
> 2 [[1, 1, 0], [0, 1]]
> 3 [[1, 0, 0], [1, 1]]
> 4 [[0, 0, 1], [0, 1]]
>
> As you can see my x column is composed of some
> strings between [[]], and using colon to separate
> some "fields".
>
> I need to identify the numbers of
> groups inside the main [ ] and call each
> group with different sequential string.
> On the example above I would like to have:
>
>  A         B
> 1 [1, 0, 0] [0, 1]
> 2 [1, 1, 0] [0, 1]
> 3 [1, 0, 0] [1, 1]
> 4 [0, 0, 1] [0, 1]
> Although here I have only two groups, my
> real dataset will have much more (~30).
> After identify the groups I would like
> to idenfity the subgroups:
>  A1 A2 A3  B1 B2
> 1 1  0  0   0  1
> 2 1  1  0   0  1
> 3 1  0  0   1  1
> 4 0  0  1   0  1
>
> Any hint are welcome.
>
> milton ribeiro
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list