[R] subsetting character vector into groups of numerics

Tue Oct 29 00:56:24 CET 2002

Patrick Connolly <p.connolly at hortresearch.co.nz> writes:

> I'm sure there's a simple way to do this, but I can only think of
> complicated ones.
> 
> 
> I have a number of character vectors that look something like this:
> 
> "12 78 23 9 76 43 2 15 41 81 92 5(92 12) (81 78 5 76 9 41) (23 2 15 43)"
> 
> I wish to get it into a list of numerical vectors like this:
> 
> $Group
> [1] 12 78 23 9 76 43 2 15 41 81 92 5
> 
> $Subgroup1
> [1] 92 12
> 
> $Subgroup2
> [1] 81 78 5 76 9 41
> 
> $Subgroup3
> [1] 23 2 15 43
> 
> I can't rely on the closing parenthesis as the last character in the
> vector, though the subgroup could be clearly defined without it.
> Numbers are obvious to the eye, but are not always separated from one
> another consistently.  Part of the reason for this exercise is to
> check that the Group is made up of the Subgroups with no elements
> missing, so getting Group is not simply a matter of concatenating the
> subgroups.
> 
> 
> Ideas appreciated.

Hmm... You seem to be telling us what the format is not. If you want
us to come up with something for the machine to do, it's not too
useful that things are "obvious to the eye"! 

If the format is consistently like the above with subgroups in (),
then you could start with using some of the deeper magic of gsub() to
turn the format into something which would be easier to split into
individual vectors, e.g.

> gsub("\\(([^)]*)\\)", "/\\1", x)
[1] "12 78 23 9 76 43 2 15 41 81 92 5/92 12 /81 78 5 76 9 41 /23 2 15 43"

[What was that? Well, "(" is a special grouping operator in regular
expressions; it isn't part of the RE as such, but things inside (..)
can be referred to with backreferences like \1, which of course needs
to be entered as "\\1". \( is an actual left parenthesis, again
written with the doubled backslash. [^)]* is a sequence consisting of
any character except left parentheses (which is not a grouping
operator when it sits within square brackets). So we're finding the
bits of text delimited by ( and ) and replacing them with a / and the
content of the parentheses. Got it? Don't worry if you don't, I didn't
get it right till the 12th try either! The important thing is knowing
that this kind of stuff is possible if you stare at it long enough.]

Now that it is in an easier format we can use strsplit to get
individual parts:

> s <- strsplit(gsub("\\(([^)]*)\\)", "/\\1", x),"/")
> s
[[1]]
[1] "12 78 23 9 76 43 2 15 41 81 92 5" "92 12 "                          
[3] "81 78 5 76 9 41 "                 "23 2 15 43"                      

and once we have those we might use scan() on each string to get the
numbers. This requires the use of a text connection, like this

> lapply(s[[1]], function(x)scan(textConnection(x)))
Read 12 items
Read 2 items
Read 6 items
Read 4 items
[[1]]
 [1] 12 78 23  9 76 43  2 15 41 81 92  5

[[2]]
[1] 92 12

[[3]]
[1] 81 78  5 76  9 41

[[4]]
[1] 23  2 15 43

...

Your turn!

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._