[R] Parsing a Simple Chemical Formula

Mon Dec 27 03:42:38 CET 2010

I think the OP had a very limited need but there is something
more sophisticated that may be of larger insterest called "SMILES"
which attempts to capture some structural information about a molecule
in a text sting. Reducing pictures to tractable text is an important step
in many analysis efforts and i was curious what others may be able to say about
R support for things like this.

A quick google search turned up this, 

http://cran.r-project.org/web/packages/rpubchem/rpubchem.pdf

but I wasn't sure if there are more packages for manipulating
different ball and stick collections( the atom and bond descriptions
could just as easily represent any other collection of nodes
and connections).

You can get some idea what this does by typing your favorite chemical
name here,

http://pubchem.ncbi.nlm.nih.gov/

and the entries give something called "Canonical SMILES structures"
For example, 

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=8030&loc=ec_rcs

UPAC Name: thiophene
Canonical SMILES: C1=CSC=C1
InChI: InChI=1S/C4H4S/c1-2-4-5-3-1/h1-4H
InChIKey: YTPLMLYBLZKORZ-UHFFFAOYSA-N [Click for Info] 

> From: hanson at depauw.edu
> To: ggrothendieck at gmail.com
> Date: Sun, 26 Dec 2010 20:01:45 -0500
> CC: r-help at stat.math.ethz.ch
> Subject: Re: [R] Parsing a Simple Chemical Formula
>
> Well let me just say thanks and WOW! Four great ideas, each worthy of
> study and I'll learn several things from each. Interestingly, these
> solutions seem more general and more compact than the solutions I
> found on the 'net using python and perl. More evidence for the power
> of R! A big thanks to each of you! Bryan
>
> On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
>
> > On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson 
> > wrote:
> >> Hello R Folks...
> >>
> >> I've been looking around the 'net and I see many complex solutions in
> >> various languages to this question, but I have a pretty simple need
> >> (and I'm
> >> not much good at regex). I want to use a chemical formula as a
> >> function
> >> argument. The formula would be in "Hill order" which is to list C,
> >> then H,
> >> then all other elements in alphabetical order. My example will
> >> have only a
> >> limited number of elements, few enough that one can search directly
> >> for each
> >> element. So some examples would be C5H12, or C5H12O or C5H11BrO
> >> (note that
> >> for oxygen and bromine, O or Br, there is no following number
> >> meaning a 1 is
> >> implied).
> >>
> >> Let's say
> >>
> >>> form <- "C5H11BrO"
> >>
> >> I'd like to get the count of each element, so in this case I need
> >> to extract
> >> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the
> >> molecular
> >> weight by mulitplying). Sounds pretty simple, but my experiments
> >> with grep
> >> and strsplit don't immediately clue me into an obvious solution.
> >> As I said,
> >> I don't need a general solution to the problem of calculating
> >> molecular
> >> weight from an arbitrary formula, that seems quite challenging,
> >> just a way
> >> to convert "form" into a list or data frame which I can then do the
> >> math on.
> >>
> >> Here's hoping this is a simple issue for more experienced R users!
> >> TIA,
> >
> > This can be done by strapply in gsubfn. It matches the regular
> > expression to the target string passing the back references (the
> > parenthesized portions of the regular expression) through a specified
> > function as successive arguments.
> >
> > Thus the first arg is form, your input string. The second arg is the
> > regular expression which matches an upper case letter optionally
> > followed by lower case letters and all that is optionally followed by
> > digits. The third arg is a function shown in a formula
> > representation. strapply passes the back references (i.e. the portions
> > within parentheses) to the function as the two arguments. Finally
> > simplify is another function in formula notation which turns the
> > result into a matrix and then a data frame. Finally we make the
> > second column of the data frame numeric.
> >
> > library(gsubfn)
> >
> > DF <- strapply(form,
> > "([A-Z][a-z]*)(\\d*)",
> > ~ c(..1, if (nchar(..2)) ..2 else 1),
> > simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors =
> > FALSE))
> > DF[[2]] <- as.numeric(DF[[2]])
> >
> > DF looks like this:
> >
> >> DF
> > V1 V2
> > 1 C 5
> > 2 H 11
> > 3 Br 1
> > 4 O 1
> >
> >
> >
> > --
> > Statistics & Software Consulting
> > GKX Group, GKX Associates Inc.
> > tel: 1-877-GKX-GROUP
> > email: ggrothendieck at gmail.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.