[R] parsing strings between [ ] in columns
Gabor Grothendieck
ggrothendieck at gmail.com
Thu Feb 18 12:14:23 CET 2010
Here is a solution using strapply in the gsubfn package.
First we define a proto object p containing a single method, i.e.
function, called fun. fun will take one [...] construct and split it
into the numeric vector v using strsplit and will also assign it
names. strapply has a built in variable, count, that is maintained
automatically in the proto object that will be used for determining
which letter to use.
Using strapply apply fun in p to each substring matching this regexp
"\\[([01, ]*)\\]". This regexpr matches [ followed by a string of
characters made up of 0, 1, comma and space, followed by ] and applies
p$fun to each such occurrence. (Modify the regexp appropriately if
the true problem has different characteristics.)
Finally, simplify = rbind will cause the resulting vectors to be
rbind'ed together. (If the different rows of myDF do not have the
same structure then omit the simplify = rbind argument of strapply to
get out a list.)
p <- proto(fun = function(this, x) {
v <- as.numeric(strsplit(x, ",")[[1]])
names(v) <- paste(LETTERS[count], seq_along(v), sep = "")
v
})
strapply(as.character(myDF[[1]]), "\\[([01, ]*)\\]", p, simplify = rbind)
Here is what the output looks like:
> strapply(as.character(myDF[[1]]), "\\[([01, ]*)\\]", p, simplify = rbind)
A1 A2 A3 B1 B2
[1,] 1 0 0 0 1
[2,] 1 1 0 0 1
[3,] 1 0 0 1 1
[4,] 0 0 1 0 1
See http://gsubfn.googlecode.com and the gsubfn vignette for more info.
On Thu, Feb 18, 2010 at 3:29 AM, milton ruser <milton.ruser at gmail.com> wrote:
> Dear all,
>
> I have a data.frame with a column like the x shown below
> myDF<-data.frame(cbind(x=c("[[1, 0, 0], [0, 1]]",
> "[[1, 1, 0], [0, 1]]","[[1, 0, 0], [1, 1]]",
> "[[0, 0, 1], [0, 1]]")))
>> myDF
> x
> 1 [[1, 0, 0], [0, 1]]
> 2 [[1, 1, 0], [0, 1]]
> 3 [[1, 0, 0], [1, 1]]
> 4 [[0, 0, 1], [0, 1]]
>
> As you can see my x column is composed of some
> strings between [[]], and using colon to separate
> some "fields".
>
> I need to identify the numbers of
> groups inside the main [ ] and call each
> group with different sequential string.
> On the example above I would like to have:
>
> A B
> 1 [1, 0, 0] [0, 1]
> 2 [1, 1, 0] [0, 1]
> 3 [1, 0, 0] [1, 1]
> 4 [0, 0, 1] [0, 1]
> Although here I have only two groups, my
> real dataset will have much more (~30).
> After identify the groups I would like
> to idenfity the subgroups:
> A1 A2 A3 B1 B2
> 1 1 0 0 0 1
> 2 1 1 0 0 1
> 3 1 0 0 1 1
> 4 0 0 1 0 1
>
> Any hint are welcome.
>
> milton ribeiro
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list