[R] Pattern match

Wed Apr 20 22:49:45 CEST 2011

Hi:

This is a bit of a roundabout approach; I'm sure that folks with regex
expertise will trump this in a heartbeat. I modified the last piece of
the string a bit to accommodate the approach below. Depending on where
the strings have line breaks, you may have some odd '\n' characters
inserted.

# Step 1: read the input as a single character string
u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"

# Step 2: Split input lines by the ';' delimiter and then use lapply()
to split variable names from values.
# This results in a nested list for ulist2.
ulist <- strsplit(u, ';')
ulist2 <- lapply(ulist, function(s) strsplit(s, '='))

# Step 3: Break out the results into a matrix whose first column is
the variable name
# and whose second column is the value (with parens included)
# This avoids dealing with nested lists
v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)

# Step 4: Strip off the parens
w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
colnames(w) <- c('Name', 'Value')
w
      Name                 Value
 [1,] "SpeciesCommon"      "Human"
 [2,] "SpeciesScientific"  "Homo sapiens"
 [3,] "ReactiveCentres"    "N,C,C,C,+H,O,C,C,C,C,O,H"
 [4,] "BondInvolved"       "C-H"
 [5,] "EzCatDBID"          "S00343"
 [6,] "BondFormed"         "O-H,O-H"
 [7,] "Bond"               "255B"
 [8,] "Cofactors"          "CuII,CU,501,A,CuII,CU,502,A"
 [9,] "CatalyticSwissProt" "P25006"
[10,] "SpeciesScientific"  "Achromobacter\ncycloclastes"
[11,] "SpeciesCommon"      "Bacteria"
[12,] "Reactive"           "Ce+"

# Step 5: Subset out the values of the SpeciesScientific variables
subset(as.data.frame(w), Name == 'SpeciesScientific', select = 'Value')
                         Value
2                 Homo sapiens
10 Achromobacter\ncycloclastes

One possible 'advantage' of this approach is that if you have a number
of string records of this type, you can create nested lists for each
string and then manipulate the lists to get what you need. Hopefully
you can use some of these ideas for other purposes as well.

Dennis

On Wed, Apr 20, 2011 at 10:17 AM, Neeti <nikkihathi at gmail.com> wrote:
> Hi ALL,
>
> I have very simple question regarding pattern matching. Could anyone tell me
> how to I can use R to retrieve string pattern from text file.  for example
> my file contain following information
>
> SpeciesCommon=(Human);SpeciesScientific=(Homo
> sapiens);ReactiveCentres=(N,C,C,C,+
> H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+
> eciesScientific=(Achromobacter
> cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
>
> and I want to extract “SpeciesScientific = (?)” information from this file.
> Problem is in 3rd line where SpeciesScientific word is divided with +.
>
> Could anyone help me please?
> Thank you
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>