[R] Parsing a Simple Chemical Formula

Mon Dec 27 02:28:17 CET 2010

Thanks Spencer, I'll definitely have a look at this package and it's  
vignettes.  I believe I have looked at it before, but didn't catch it  
on this particular search.  Bryan

On Dec 26, 2010, at 8:16 PM, Spencer Graves wrote:

> p.s.  help(pac=CHNOSZ) reveals that this package has 3 vignettes.  I  
> have not looked at these vignettes, but most vignettes provide  
> excellent introductions (though rarely with complete coverage) of  
> important capabilities of the package.  (The 'sos' package includes  
> a vignette, which exposes more capabilities than the example below.)
>
>
> ######################
>      Have you considered the 'CHNOSZ' package?
>
>
>> makeup("C5H11BrO" )
>   count
> C      5
> H     11
> Br     1
> O      1
>
>
>      I found this using the 'sos' package as follows:
>
>
> library(sos)
> cf <- ???'chemical formula'
> found 21 matches;  retrieving 2 pages
> cf
>
>
>      The print method for "cf" opened the results in a web browser,  
> which showed that the "CHNOSZ" package had 14 of these 11 matches,  
> and the other 7 were in 7 different packages.  Moreover, the  
> "CHNOSZ" package is devoted to "Chemical Thermodynamics and Activity  
> Diagrams" and provides many more capabilities that might interest you.
>
>
>      Hope this helps.
>      Spencer
>
>
> On 12/26/2010 5:01 PM, Bryan Hanson wrote:
>> Well let me just say thanks and WOW!  Four great ideas, each worthy  
>> of
>> study and I'll learn several things from each.  Interestingly, these
>> solutions seem more general and more compact than the solutions I
>> found on the 'net using python and perl.  More evidence for the power
>> of R!  A big thanks to each of you!  Bryan
>>
>> On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
>>
>>> On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu>  
>>> wrote:
>>>> Hello R Folks...
>>>>
>>>> I've been looking around the 'net and I see many complex  
>>>> solutions in
>>>> various languages to this question, but I have a pretty simple need
>>>> (and I'm
>>>> not much good at regex).  I want to use a chemical formula as a
>>>> function
>>>> argument.  The formula would be in "Hill order" which is to list C,
>>>> then H,
>>>> then all other elements in alphabetical order.  My example will  
>>>> have
>>>> only a
>>>> limited number of elements, few enough that one can search directly
>>>> for each
>>>> element.  So some examples would be C5H12, or C5H12O or C5H11BrO
>>>> (note that
>>>> for oxygen and bromine, O or Br, there is no following number
>>>> meaning a 1 is
>>>> implied).
>>>>
>>>> Let's say
>>>>
>>>>> form <- "C5H11BrO"
>>>>
>>>> I'd like to get the count of each element, so in this case I need  
>>>> to
>>>> extract
>>>> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the  
>>>> molecular
>>>> weight by mulitplying).  Sounds pretty simple, but my experiments
>>>> with grep
>>>> and strsplit don't immediately clue me into an obvious solution.   
>>>> As
>>>> I said,
>>>> I don't need a general solution to the problem of calculating  
>>>> molecular
>>>> weight from an arbitrary formula, that seems quite challenging,  
>>>> just
>>>> a way
>>>> to convert "form" into a list or data frame which I can then do the
>>>> math on.
>>>>
>>>> Here's hoping this is a simple issue for more experienced R users!
>>>> TIA,
>>>
>>> This can be done by strapply in gsubfn.  It matches the regular
>>> expression to the target string passing the back references (the
>>> parenthesized portions of the regular expression) through a  
>>> specified
>>> function as successive arguments.
>>>
>>> Thus the first arg is form, your input string.  The second arg is  
>>> the
>>> regular expression which matches an upper case letter optionally
>>> followed by lower case letters and all that is optionally followed  
>>> by
>>> digits.  The third arg is a function shown in a formula
>>> representation. strapply passes the back references (i.e. the  
>>> portions
>>> within parentheses) to the function as the two arguments.  Finally
>>> simplify is another function in formula notation which turns the
>>> result into a matrix and then a data frame.  Finally we make the
>>> second column of the data frame numeric.
>>>
>>> library(gsubfn)
>>>
>>> DF <- strapply(form,
>>>  "([A-Z][a-z]*)(\\d*)",
>>>  ~ c(..1, if (nchar(..2)) ..2 else 1),
>>>  simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors =
>>> FALSE))
>>> DF[[2]] <- as.numeric(DF[[2]])
>>>
>>> DF looks like this:
>>>
>>>> DF
>>> V1 V2
>>> 1  C  5
>>> 2  H 11
>>> 3 Br  1
>>> 4  O  1
>>>
>>>
>>>
>>> --
>>> Statistics & Software Consulting
>>> GKX Group, GKX Associates Inc.
>>> tel: 1-877-GKX-GROUP
>>> email: ggrothendieck at gmail.com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
> -- 
> Spencer Graves, PE, PhD
> President and Chief Operating Officer
> Structure Inspection and Monitoring, Inc.
> 751 Emerson Ct.
> San José, CA 95126
> ph:  408-655-4567
>