[R] Pattern match

Fri Apr 22 12:42:15 CEST 2011

Thank you for your message. please see attach file for the template/test
dataset of my file.

On Thu, Apr 21, 2011 at 1:30 PM, David Winsemius <dwinsemius at comcast.net>wrote:

>
> On Apr 21, 2011, at 5:27 AM, neetika nath wrote:
>
>  Thank you Dennis,
>>
>> yes the problem is the input file. i have .rdf file and the format is in
>> same way i have posted earlier. if i open that file in notepad++ the lines
>> are divided or broken  with CR+LF character. so any suggestion to retrieve
>> SpeciesScientific information without changing the input file?
>>
>
> You might consider attaching the original file named with an extension of
> `.txt`, since your verbal description does not match your included example.
> What I see after the various servers have passed this around and inserted
> line-ends is the string `SpeciesScientific` in the first line, rather than
> in the third.
>
> --
> David
>
> --
>
>>
>> Thank you
>>
>> On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <djmuser at gmail.com> wrote:
>>
>>  Hi:
>>>
>>> This is a bit of a roundabout approach; I'm sure that folks with regex
>>> expertise will trump this in a heartbeat. I modified the last piece of
>>> the string a bit to accommodate the approach below. Depending on where
>>> the strings have line breaks, you may have some odd '\n' characters
>>> inserted.
>>>
>>> # Step 1: read the input as a single character string
>>> u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
>>>
>>>
>>> sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
>>> cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
>>>
>>> # Step 2: Split input lines by the ';' delimiter and then use lapply()
>>> to split variable names from values.
>>> # This results in a nested list for ulist2.
>>> ulist <- strsplit(u, ';')
>>> ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
>>>
>>> # Step 3: Break out the results into a matrix whose first column is
>>> the variable name
>>> # and whose second column is the value (with parens included)
>>> # This avoids dealing with nested lists
>>> v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
>>>
>>> # Step 4: Strip off the parens
>>> w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
>>> colnames(w) <- c('Name', 'Value')
>>> w
>>>    Name                 Value
>>> [1,] "SpeciesCommon"      "Human"
>>> [2,] "SpeciesScientific"  "Homo sapiens"
>>> [3,] "ReactiveCentres"    "N,C,C,C,+H,O,C,C,C,C,O,H"
>>> [4,] "BondInvolved"       "C-H"
>>> [5,] "EzCatDBID"          "S00343"
>>> [6,] "BondFormed"         "O-H,O-H"
>>> [7,] "Bond"               "255B"
>>> [8,] "Cofactors"          "CuII,CU,501,A,CuII,CU,502,A"
>>> [9,] "CatalyticSwissProt" "P25006"
>>> [10,] "SpeciesScientific"  "Achromobacter\ncycloclastes"
>>> [11,] "SpeciesCommon"      "Bacteria"
>>> [12,] "Reactive"           "Ce+"
>>>
>>> # Step 5: Subset out the values of the SpeciesScientific variables
>>> subset(as.data.frame(w), Name == 'SpeciesScientific', select = 'Value')
>>>                       Value
>>> 2                 Homo sapiens
>>> 10 Achromobacter\ncycloclastes
>>>
>>>
>>> One possible 'advantage' of this approach is that if you have a number
>>> of string records of this type, you can create nested lists for each
>>> string and then manipulate the lists to get what you need. Hopefully
>>> you can use some of these ideas for other purposes as well.
>>>
>>> Dennis
>>>
>>>
>>>
>>> On Wed, Apr 20, 2011 at 10:17 AM, Neeti <nikkihathi at gmail.com> wrote:
>>>
>>>> Hi ALL,
>>>>
>>>> I have very simple question regarding pattern matching. Could anyone
>>>> tell
>>>>
>>> me
>>>
>>>> how to I can use R to retrieve string pattern from text file.  for
>>>>
>>> example
>>>
>>>> my file contain following information
>>>>
>>>> SpeciesCommon=(Human);SpeciesScientific=(Homo
>>>> sapiens);ReactiveCentres=(N,C,C,C,+
>>>>
>>>> H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
>>>
>>>>
>>>> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+
>>>
>>>> eciesScientific=(Achromobacter
>>>> cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
>>>>
>>>> and I want to extract “SpeciesScientific = (?)” information from this
>>>>
>>> file.
>>>
>>>> Problem is in 3rd line where SpeciesScientific word is divided with +.
>>>>
>>>> Could anyone help me please?
>>>> Thank you
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>>
>>> http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
>>>
>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>>
>>> http://www.R-project.org/posting-guide.html
>>>
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>        [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> David Winsemius, MD
> West Hartford, CT
>
>
-------------- next part --------------
--
$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION
lyticCATH=(3.40.50.360);BondOrderChanged=(C-N,1,C=N,2,C=C,2,C-C,1,C-C,1,C=C,2,C-+
C,1,C=C,2,C=C,2,C-C,1,C-C,1,C=C,2,C=O,2,C-O,1,C=O,2,C-O,1);CatalyticResidues=(Gl+
y149A,Tyr155A,His161A);Cofactors=(FAD,FAD,601,none);CatalyticSwissProt=(P15559);+
SpeciesCommon=(Human);SpeciesScientific=(Homo sapiens);ReactiveCentres=(N,C,C,C,+
H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+

--
$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION
$DATUM CatalyticCATH=(2.60.40.420);CatalyticResidues=(Asp98A,His135A,Cys136A,His+
255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+
eciesScientific=(Achromobacter cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
ntres=(N,O,H,Cu);BondFormed=(O-H);BondCleaved=(O-N);PreviousEC=(1.7.99.3,1.9.3.2+
);Return=(Yes);CreatedBy=(GLH,GJB,DEA);DLU=(24102008);MID=(M0004);KEGG=(R00785).

--
$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION
$DATUM OverallComment=(The reference states that this mechanism was elucidated a+
t low pH. This enzyme specifically removes basic or hydrophobic amino acid resid+
ues from the C-terminus of the peptide substrate.);CatalyticCATH=(3.40.50.1820);+
CatalyticResidues=(Gly53A,Ser146A,Tyr147A,Asp338B,His397B);CatalyticSwissProt=(P+
08819);SpeciesCommon=(Wheat);SpeciesScientific=(Triticum aestivum);ReactiveCentr+
es=(N,H,O,C);EzCatDBID=(S00374);BondFormed=(N-H,C-O);BondCleaved=(C-N,O-H);Retur+
n=(Yes);DLU=(24102008);MID=(M0005);CreatedBy=(GLH,GJB,DEA).