[R] Parsing a Simple Chemical Formula

Mon Dec 27 13:54:49 CET 2010

----------------------------------------
> Date: Sun, 26 Dec 2010 20:24:23 -0800
> From: spencer.graves at structuremonitoring.com
> To: hanson at depauw.edu
> CC: r-help at stat.math.ethz.ch
> Subject: Re: [R] Parsing a Simple Chemical Formula
>
> Mike Marchywka's post mentioned a CRAN package, "rpubchem",
> missed by my search for "chemical formula". A further search for
> "chemical" and "chemistry" still missed it. "compound" found it.
> Adding "compounds" and combining them with "union" produced a list of
> 564 links in 219 packages; 7 of the help pages were for "rpubchem".
> The package with the most matches is "seacarb" (seawater carbonate
> chemistry with R: 21 matches), followed by "CHNOSZ", previously
> mentioned (19 matches). " rpubchem" is the 22nd package on this list (5
> matches, with a max score of 32, less than the max score of 2 other
> packages with 5 matches).

This is why I always like to have ASCII text help that I can throw into
a flat file and search myself with bash scripts outside of any program
I'm trying to figure out. These problems of looking
for things I don't know, like the guy who wanted to optimize his double
loop but apparently didn't know the names of equivalent matrix operations, 
had a more elaborate but somewhat similar problem. 

Generally for chem or med tools, ncbi is a good place to hunt for vocabulary
and facilities. I have various scripts for sorting vocabularies but still
need various vocabularies ( for example, I'd like to go to IUPAC and download
a list of systematic and trivial chemical names etc). In any case, this
can be a big step in finding things in help or "real" literature. 
catch ya on the flip flop good buddy LOL.

>
>
> Spencer
>
>
> On 12/26/2010 7:36 PM, Bryan Hanson wrote:
> > Hi David & others...
> >
> > I did find the function you recommended, plus, it's even easier (but a
> > little hidden in the doc): >element(form, "mass"). But, this uses the
> > atomic masses from the periodic table, which are weighted averages of
> > the isotopes of each element. What I'm doing actually involves mass
> > spectrometry, so I need the isotope masses, which are integers (think
> > 12C, 13C, 14C, but the periodic table says 12.011 reflecting the
> > relative abundances). I used Gabor's solution and got my little
> > function humming. Plus, I have several things to read through from
> > the various recommendations.
> >
> > Thanks again, Bryan
> >
> > On Dec 26, 2010, at 10:21 PM, David Winsemius wrote:
> >
> >>
> >> On Dec 26, 2010, at 8:28 PM, Bryan Hanson wrote:
> >>
> >>> Thanks Spencer, I'll definitely have a look at this package and it's
> >>> vignettes. I believe I have looked at it before, but didn't catch
> >>> it on this particular search. Bryan
> >>
> >> Using the thermo list that the makeup function accesses to get its
> >> valid atomic symbols one can arrive at the the answer you posited
> >> would be too difficult in you first posting, the atomic weight from
> >> the formulae:
> >>
> >> > str(thermo$element)
> >> 'data.frame': 130 obs. of 6 variables:
> >> $ element: chr "Z" "O" "H" "He" ...
> >> $ state : chr "aq" "gas" "gas" "gas" ...
> >> $ source : chr "CWM89" "CWM89" "CWM89" "CWM89" ...
> >> $ mass : num 0 16 1.01 4 20.18 ...
> >> $ s : num -15.6 49 31.2 30.2 35 ...
> >> $ n : int 1 2 2 1 1 1 1 1 2 2 ...
> >>
> >> patts <- paste("^", rownames(makeup(form)), "$", sep="")
> >> makuform<- makeup(form)
> >> makuform$amass <- sapply(patts, function(x) {return( thermo$element[
> >> grep(x, thermo$element[[1]])[1], "mass"])} )
> >> sum(makuform$amass *makuform$count)
> >> # [1] 167.0457
> >>
> >>>
> >>> On Dec 26, 2010, at 8:16 PM, Spencer Graves wrote:
> >>>
> >>>> p.s. help(pac=CHNOSZ) reveals that this package has 3 vignettes.
> >>>> I have not looked at these vignettes, but most vignettes provide
> >>>> excellent introductions (though rarely with complete coverage) of
> >>>> important capabilities of the package. (The 'sos' package includes
> >>>> a vignette, which exposes more capabilities than the example below.)
> >>>>
> >>>>
> >>>> ######################
> >>>> Have you considered the 'CHNOSZ' package?
> >>>>
> >>>>
> >>>>> makeup("C5H11BrO" )
> >>>> count
> >>>> C 5
> >>>> H 11
> >>>> Br 1
> >>>> O 1
> >>>>
> >>>>
> >>>> I found this using the 'sos' package as follows:
> >>>>
> >>>>
> >>>> library(sos)
> >>>> cf <- ???'chemical formula'
> >>>> found 21 matches; retrieving 2 pages
> >>>> cf
> >>>>
> >>>>
> >>>> The print method for "cf" opened the results in a web browser,
> >>>> which showed that the "CHNOSZ" package had 14 of these 11 matches,
> >>>> and the other 7 were in 7 different packages. Moreover, the
> >>>> "CHNOSZ" package is devoted to "Chemical Thermodynamics and
> >>>> Activity Diagrams" and provides many more capabilities that might
> >>>> interest you.
> >>>>
> >>>>
> >>>> Hope this helps.
> >>>> Spencer
> >>>>
> >>>>
> >>>> On 12/26/2010 5:01 PM, Bryan Hanson wrote:
> >>>>> Well let me just say thanks and WOW! Four great ideas, each
> >>>>> worthy of
> >>>>> study and I'll learn several things from each. Interestingly, these
> >>>>> solutions seem more general and more compact than the solutions I
> >>>>> found on the 'net using python and perl. More evidence for the power
> >>>>> of R! A big thanks to each of you! Bryan
> >>>>>
> >>>>> On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
> >>>>>
> >>>>>> On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson 
> >>>>>> wrote:
> >>>>>>> Hello R Folks...
> >>>>>>>
> >>>>>>> I've been looking around the 'net and I see many complex
> >>>>>>> solutions in
> >>>>>>> various languages to this question, but I have a pretty simple need
> >>>>>>> (and I'm
> >>>>>>> not much good at regex). I want to use a chemical formula as a
> >>>>>>> function
> >>>>>>> argument. The formula would be in "Hill order" which is to list C,
> >>>>>>> then H,
> >>>>>>> then all other elements in alphabetical order. My example will
> >>>>>>> have
> >>>>>>> only a
> >>>>>>> limited number of elements, few enough that one can search directly
> >>>>>>> for each
> >>>>>>> element. So some examples would be C5H12, or C5H12O or C5H11BrO
> >>>>>>> (note that
> >>>>>>> for oxygen and bromine, O or Br, there is no following number
> >>>>>>> meaning a 1 is
> >>>>>>> implied).
> >>>>>>>
> >>>>>>> Let's say
> >>>>>>>
> >>>>>>>> form <- "C5H11BrO"
> >>>>>>>
> >>>>>>> I'd like to get the count of each element, so in this case I
> >>>>>>> need to
> >>>>>>> extract
> >>>>>>> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the
> >>>>>>> molecular
> >>>>>>> weight by mulitplying). Sounds pretty simple, but my experiments
> >>>>>>> with grep
> >>>>>>> and strsplit don't immediately clue me into an obvious
> >>>>>>> solution. As
> >>>>>>> I said,
> >>>>>>> I don't need a general solution to the problem of calculating
> >>>>>>> molecular
> >>>>>>> weight from an arbitrary formula, that seems quite challenging,
> >>>>>>> just
> >>>>>>> a way
> >>>>>>> to convert "form" into a list or data frame which I can then do the
> >>>>>>> math on.
> >>>>>>>
> >>>>>>> Here's hoping this is a simple issue for more experienced R users!
> >>>>>>> TIA,
> >>>>>>
> >>>>>> This can be done by strapply in gsubfn. It matches the regular
> >>>>>> expression to the target string passing the back references (the
> >>>>>> parenthesized portions of the regular expression) through a
> >>>>>> specified
> >>>>>> function as successive arguments.
> >>>>>>
> >>>>>> Thus the first arg is form, your input string. The second arg is
> >>>>>> the
> >>>>>> regular expression which matches an upper case letter optionally
> >>>>>> followed by lower case letters and all that is optionally
> >>>>>> followed by
> >>>>>> digits. The third arg is a function shown in a formula
> >>>>>> representation. strapply passes the back references (i.e. the
> >>>>>> portions
> >>>>>> within parentheses) to the function as the two arguments. Finally
> >>>>>> simplify is another function in formula notation which turns the
> >>>>>> result into a matrix and then a data frame. Finally we make the
> >>>>>> second column of the data frame numeric.
> >>>>>>
> >>>>>> library(gsubfn)
> >>>>>>
> >>>>>> DF <- strapply(form,
> >>>>>> "([A-Z][a-z]*)(\\d*)",
> >>>>>> ~ c(..1, if (nchar(..2)) ..2 else 1),
> >>>>>> simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors =
> >>>>>> FALSE))
> >>>>>> DF[[2]] <- as.numeric(DF[[2]])
> >>>>>>
> >>>>>> DF looks like this:
> >>>>>>
> >>>>>>> DF
> >>>>>> V1 V2
> >>>>>> 1 C 5
> >>>>>> 2 H 11
> >>>>>> 3 Br 1
> >>>>>> 4 O 1
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Statistics & Software Consulting
> >>>>>> GKX Group, GKX Associates Inc.
> >>>>>> tel: 1-877-GKX-GROUP
> >>>>>> email: ggrothendieck at gmail.com
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-help at r-project.org mailing list
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
> >>>>> http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>
> >>>>>
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>
> >> David Winsemius, MD
> >> West Hartford, CT
> >>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.