[R] Parsing a Simple Chemical Formula

Spencer Graves spencer.graves at structuremonitoring.com
Mon Dec 27 02:12:39 CET 2010

       Have you considered the 'CHNOSZ' package?

 > makeup("C5H11BrO" )
C      5
H     11
Br     1
O      1

       I found this using the 'sos' package as follows:

cf <- ???'chemical formula'
found 21 matches;  retrieving 2 pages

       The print method for "cf" opened the results in a web browser, 
which showed that the "CHNOSZ" package had 14 of these 11 matches, and 
the other 7 were in 7 different packages.  Moreover, the "CHNOSZ" 
package is devoted to "Chemical Thermodynamics and Activity Diagrams" 
and provides many more capabilities that might interest you.

       Hope this helps.

On 12/26/2010 5:01 PM, Bryan Hanson wrote:
> Well let me just say thanks and WOW!  Four great ideas, each worthy of 
> study and I'll learn several things from each.  Interestingly, these 
> solutions seem more general and more compact than the solutions I 
> found on the 'net using python and perl.  More evidence for the power 
> of R!  A big thanks to each of you!  Bryan
> On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
>> On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
>>> Hello R Folks...
>>> I've been looking around the 'net and I see many complex solutions in
>>> various languages to this question, but I have a pretty simple need 
>>> (and I'm
>>> not much good at regex).  I want to use a chemical formula as a 
>>> function
>>> argument.  The formula would be in "Hill order" which is to list C, 
>>> then H,
>>> then all other elements in alphabetical order.  My example will have 
>>> only a
>>> limited number of elements, few enough that one can search directly 
>>> for each
>>> element.  So some examples would be C5H12, or C5H12O or C5H11BrO 
>>> (note that
>>> for oxygen and bromine, O or Br, there is no following number 
>>> meaning a 1 is
>>> implied).
>>> Let's say
>>>> form <- "C5H11BrO"
>>> I'd like to get the count of each element, so in this case I need to 
>>> extract
>>> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular
>>> weight by mulitplying).  Sounds pretty simple, but my experiments 
>>> with grep
>>> and strsplit don't immediately clue me into an obvious solution.  As 
>>> I said,
>>> I don't need a general solution to the problem of calculating molecular
>>> weight from an arbitrary formula, that seems quite challenging, just 
>>> a way
>>> to convert "form" into a list or data frame which I can then do the 
>>> math on.
>>> Here's hoping this is a simple issue for more experienced R users!  
>>> TIA,
>> This can be done by strapply in gsubfn.  It matches the regular
>> expression to the target string passing the back references (the
>> parenthesized portions of the regular expression) through a specified
>> function as successive arguments.
>> Thus the first arg is form, your input string.  The second arg is the
>> regular expression which matches an upper case letter optionally
>> followed by lower case letters and all that is optionally followed by
>> digits.  The third arg is a function shown in a formula
>> representation. strapply passes the back references (i.e. the portions
>> within parentheses) to the function as the two arguments.  Finally
>> simplify is another function in formula notation which turns the
>> result into a matrix and then a data frame.  Finally we make the
>> second column of the data frame numeric.
>> library(gsubfn)
>> DF <- strapply(form,
>>   "([A-Z][a-z]*)(\\d*)",
>>   ~ c(..1, if (nchar(..2)) ..2 else 1),
>>   simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = 
>> FALSE))
>> DF[[2]] <- as.numeric(DF[[2]])
>> DF looks like this:
>>> DF
>>  V1 V2
>> 1  C  5
>> 2  H 11
>> 3 Br  1
>> 4  O  1
>> -- 
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Spencer Graves, PE, PhD
President and Chief Operating Officer
Structure Inspection and Monitoring, Inc.
751 Emerson Ct.
San José, CA 95126
ph:  408-655-4567

More information about the R-help mailing list