[R] Parsing a Simple Chemical Formula

Mon Dec 27 04:36:49 CET 2010

Hi David & others...

I did find the function you recommended, plus, it's even easier (but a  
little hidden in the doc): >element(form, "mass").  But, this uses the  
atomic masses from the periodic table, which are weighted averages of  
the isotopes of each element.  What I'm doing actually involves mass  
spectrometry, so I need the isotope masses, which are integers (think  
12C, 13C, 14C, but the periodic table says 12.011 reflecting the  
relative abundances).  I used Gabor's solution and got my little  
function humming.  Plus, I have several things to read through from  
the various recommendations.

Thanks again, Bryan

On Dec 26, 2010, at 10:21 PM, David Winsemius wrote:

>
> On Dec 26, 2010, at 8:28 PM, Bryan Hanson wrote:
>
>> Thanks Spencer, I'll definitely have a look at this package and  
>> it's vignettes.  I believe I have looked at it before, but didn't  
>> catch it on this particular search.  Bryan
>
> Using the thermo list that the makeup function accesses to get its  
> valid atomic symbols one can arrive at the the answer you posited  
> would be too difficult in you first posting, the atomic weight from  
> the formulae:
>
> > str(thermo$element)
> 'data.frame':	130 obs. of  6 variables:
> $ element: chr  "Z" "O" "H" "He" ...
> $ state  : chr  "aq" "gas" "gas" "gas" ...
> $ source : chr  "CWM89" "CWM89" "CWM89" "CWM89" ...
> $ mass   : num  0 16 1.01 4 20.18 ...
> $ s      : num  -15.6 49 31.2 30.2 35 ...
> $ n      : int  1 2 2 1 1 1 1 1 2 2 ...
>
> patts <- paste("^", rownames(makeup(form)), "$", sep="")
> makuform<- makeup(form)
> makuform$amass <- sapply(patts, function(x) {return( thermo 
> $element[ grep(x, thermo$element[[1]])[1], "mass"])}  )
> sum(makuform$amass *makuform$count)
> # [1] 167.0457
>
>>
>> On Dec 26, 2010, at 8:16 PM, Spencer Graves wrote:
>>
>>> p.s.  help(pac=CHNOSZ) reveals that this package has 3 vignettes.   
>>> I have not looked at these vignettes, but most vignettes provide  
>>> excellent introductions (though rarely with complete coverage) of  
>>> important capabilities of the package.  (The 'sos' package  
>>> includes a vignette, which exposes more capabilities than the  
>>> example below.)
>>>
>>>
>>> ######################
>>>    Have you considered the 'CHNOSZ' package?
>>>
>>>
>>>> makeup("C5H11BrO" )
>>> count
>>> C      5
>>> H     11
>>> Br     1
>>> O      1
>>>
>>>
>>>    I found this using the 'sos' package as follows:
>>>
>>>
>>> library(sos)
>>> cf <- ???'chemical formula'
>>> found 21 matches;  retrieving 2 pages
>>> cf
>>>
>>>
>>>    The print method for "cf" opened the results in a web browser,  
>>> which showed that the "CHNOSZ" package had 14 of these 11 matches,  
>>> and the other 7 were in 7 different packages.  Moreover, the  
>>> "CHNOSZ" package is devoted to "Chemical Thermodynamics and  
>>> Activity Diagrams" and provides many more capabilities that might  
>>> interest you.
>>>
>>>
>>>    Hope this helps.
>>>    Spencer
>>>
>>>
>>> On 12/26/2010 5:01 PM, Bryan Hanson wrote:
>>>> Well let me just say thanks and WOW!  Four great ideas, each  
>>>> worthy of
>>>> study and I'll learn several things from each.  Interestingly,  
>>>> these
>>>> solutions seem more general and more compact than the solutions I
>>>> found on the 'net using python and perl.  More evidence for the  
>>>> power
>>>> of R!  A big thanks to each of you!  Bryan
>>>>
>>>> On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
>>>>
>>>>> On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson  
>>>>> <hanson at depauw.edu> wrote:
>>>>>> Hello R Folks...
>>>>>>
>>>>>> I've been looking around the 'net and I see many complex  
>>>>>> solutions in
>>>>>> various languages to this question, but I have a pretty simple  
>>>>>> need
>>>>>> (and I'm
>>>>>> not much good at regex).  I want to use a chemical formula as a
>>>>>> function
>>>>>> argument.  The formula would be in "Hill order" which is to  
>>>>>> list C,
>>>>>> then H,
>>>>>> then all other elements in alphabetical order.  My example will  
>>>>>> have
>>>>>> only a
>>>>>> limited number of elements, few enough that one can search  
>>>>>> directly
>>>>>> for each
>>>>>> element.  So some examples would be C5H12, or C5H12O or C5H11BrO
>>>>>> (note that
>>>>>> for oxygen and bromine, O or Br, there is no following number
>>>>>> meaning a 1 is
>>>>>> implied).
>>>>>>
>>>>>> Let's say
>>>>>>
>>>>>>> form <- "C5H11BrO"
>>>>>>
>>>>>> I'd like to get the count of each element, so in this case I  
>>>>>> need to
>>>>>> extract
>>>>>> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the  
>>>>>> molecular
>>>>>> weight by mulitplying).  Sounds pretty simple, but my experiments
>>>>>> with grep
>>>>>> and strsplit don't immediately clue me into an obvious  
>>>>>> solution.  As
>>>>>> I said,
>>>>>> I don't need a general solution to the problem of calculating  
>>>>>> molecular
>>>>>> weight from an arbitrary formula, that seems quite challenging,  
>>>>>> just
>>>>>> a way
>>>>>> to convert "form" into a list or data frame which I can then do  
>>>>>> the
>>>>>> math on.
>>>>>>
>>>>>> Here's hoping this is a simple issue for more experienced R  
>>>>>> users!
>>>>>> TIA,
>>>>>
>>>>> This can be done by strapply in gsubfn.  It matches the regular
>>>>> expression to the target string passing the back references (the
>>>>> parenthesized portions of the regular expression) through a  
>>>>> specified
>>>>> function as successive arguments.
>>>>>
>>>>> Thus the first arg is form, your input string.  The second arg  
>>>>> is the
>>>>> regular expression which matches an upper case letter optionally
>>>>> followed by lower case letters and all that is optionally  
>>>>> followed by
>>>>> digits.  The third arg is a function shown in a formula
>>>>> representation. strapply passes the back references (i.e. the  
>>>>> portions
>>>>> within parentheses) to the function as the two arguments.  Finally
>>>>> simplify is another function in formula notation which turns the
>>>>> result into a matrix and then a data frame.  Finally we make the
>>>>> second column of the data frame numeric.
>>>>>
>>>>> library(gsubfn)
>>>>>
>>>>> DF <- strapply(form,
>>>>> "([A-Z][a-z]*)(\\d*)",
>>>>> ~ c(..1, if (nchar(..2)) ..2 else 1),
>>>>> simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors =
>>>>> FALSE))
>>>>> DF[[2]] <- as.numeric(DF[[2]])
>>>>>
>>>>> DF looks like this:
>>>>>
>>>>>> DF
>>>>> V1 V2
>>>>> 1  C  5
>>>>> 2  H 11
>>>>> 3 Br  1
>>>>> 4  O  1
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Statistics & Software Consulting
>>>>> GKX Group, GKX Associates Inc.
>>>>> tel: 1-877-GKX-GROUP
>>>>> email: ggrothendieck at gmail.com
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>>
>>> -- 
>>> Spencer Graves, PE, PhD
>>> President and Chief Operating Officer
>>> Structure Inspection and Monitoring, Inc.
>>> 751 Emerson Ct.
>>> San José, CA 95126
>>> ph:  408-655-4567
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> West Hartford, CT
>