[R] Parsing a Simple Chemical Formula

David Winsemius dwinsemius at comcast.net
Mon Dec 27 04:21:40 CET 2010


On Dec 26, 2010, at 8:28 PM, Bryan Hanson wrote:

> Thanks Spencer, I'll definitely have a look at this package and it's  
> vignettes.  I believe I have looked at it before, but didn't catch  
> it on this particular search.  Bryan

Using the thermo list that the makeup function accesses to get its  
valid atomic symbols one can arrive at the the answer you posited  
would be too difficult in you first posting, the atomic weight from  
the formulae:

 > str(thermo$element)
'data.frame':	130 obs. of  6 variables:
  $ element: chr  "Z" "O" "H" "He" ...
  $ state  : chr  "aq" "gas" "gas" "gas" ...
  $ source : chr  "CWM89" "CWM89" "CWM89" "CWM89" ...
  $ mass   : num  0 16 1.01 4 20.18 ...
  $ s      : num  -15.6 49 31.2 30.2 35 ...
  $ n      : int  1 2 2 1 1 1 1 1 2 2 ...

patts <- paste("^", rownames(makeup(form)), "$", sep="")
makuform<- makeup(form)
makuform$amass <- sapply(patts, function(x) {return( thermo 
$element[ grep(x, thermo$element[[1]])[1], "mass"])}  )
sum(makuform$amass *makuform$count)
# [1] 167.0457

>
> On Dec 26, 2010, at 8:16 PM, Spencer Graves wrote:
>
>> p.s.  help(pac=CHNOSZ) reveals that this package has 3 vignettes.   
>> I have not looked at these vignettes, but most vignettes provide  
>> excellent introductions (though rarely with complete coverage) of  
>> important capabilities of the package.  (The 'sos' package includes  
>> a vignette, which exposes more capabilities than the example below.)
>>
>>
>> ######################
>>     Have you considered the 'CHNOSZ' package?
>>
>>
>>> makeup("C5H11BrO" )
>>  count
>> C      5
>> H     11
>> Br     1
>> O      1
>>
>>
>>     I found this using the 'sos' package as follows:
>>
>>
>> library(sos)
>> cf <- ???'chemical formula'
>> found 21 matches;  retrieving 2 pages
>> cf
>>
>>
>>     The print method for "cf" opened the results in a web browser,  
>> which showed that the "CHNOSZ" package had 14 of these 11 matches,  
>> and the other 7 were in 7 different packages.  Moreover, the  
>> "CHNOSZ" package is devoted to "Chemical Thermodynamics and  
>> Activity Diagrams" and provides many more capabilities that might  
>> interest you.
>>
>>
>>     Hope this helps.
>>     Spencer
>>
>>
>> On 12/26/2010 5:01 PM, Bryan Hanson wrote:
>>> Well let me just say thanks and WOW!  Four great ideas, each  
>>> worthy of
>>> study and I'll learn several things from each.  Interestingly, these
>>> solutions seem more general and more compact than the solutions I
>>> found on the 'net using python and perl.  More evidence for the  
>>> power
>>> of R!  A big thanks to each of you!  Bryan
>>>
>>> On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
>>>
>>>> On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu>  
>>>> wrote:
>>>>> Hello R Folks...
>>>>>
>>>>> I've been looking around the 'net and I see many complex  
>>>>> solutions in
>>>>> various languages to this question, but I have a pretty simple  
>>>>> need
>>>>> (and I'm
>>>>> not much good at regex).  I want to use a chemical formula as a
>>>>> function
>>>>> argument.  The formula would be in "Hill order" which is to list  
>>>>> C,
>>>>> then H,
>>>>> then all other elements in alphabetical order.  My example will  
>>>>> have
>>>>> only a
>>>>> limited number of elements, few enough that one can search  
>>>>> directly
>>>>> for each
>>>>> element.  So some examples would be C5H12, or C5H12O or C5H11BrO
>>>>> (note that
>>>>> for oxygen and bromine, O or Br, there is no following number
>>>>> meaning a 1 is
>>>>> implied).
>>>>>
>>>>> Let's say
>>>>>
>>>>>> form <- "C5H11BrO"
>>>>>
>>>>> I'd like to get the count of each element, so in this case I  
>>>>> need to
>>>>> extract
>>>>> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the  
>>>>> molecular
>>>>> weight by mulitplying).  Sounds pretty simple, but my experiments
>>>>> with grep
>>>>> and strsplit don't immediately clue me into an obvious  
>>>>> solution.  As
>>>>> I said,
>>>>> I don't need a general solution to the problem of calculating  
>>>>> molecular
>>>>> weight from an arbitrary formula, that seems quite challenging,  
>>>>> just
>>>>> a way
>>>>> to convert "form" into a list or data frame which I can then do  
>>>>> the
>>>>> math on.
>>>>>
>>>>> Here's hoping this is a simple issue for more experienced R users!
>>>>> TIA,
>>>>
>>>> This can be done by strapply in gsubfn.  It matches the regular
>>>> expression to the target string passing the back references (the
>>>> parenthesized portions of the regular expression) through a  
>>>> specified
>>>> function as successive arguments.
>>>>
>>>> Thus the first arg is form, your input string.  The second arg is  
>>>> the
>>>> regular expression which matches an upper case letter optionally
>>>> followed by lower case letters and all that is optionally  
>>>> followed by
>>>> digits.  The third arg is a function shown in a formula
>>>> representation. strapply passes the back references (i.e. the  
>>>> portions
>>>> within parentheses) to the function as the two arguments.  Finally
>>>> simplify is another function in formula notation which turns the
>>>> result into a matrix and then a data frame.  Finally we make the
>>>> second column of the data frame numeric.
>>>>
>>>> library(gsubfn)
>>>>
>>>> DF <- strapply(form,
>>>> "([A-Z][a-z]*)(\\d*)",
>>>> ~ c(..1, if (nchar(..2)) ..2 else 1),
>>>> simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors =
>>>> FALSE))
>>>> DF[[2]] <- as.numeric(DF[[2]])
>>>>
>>>> DF looks like this:
>>>>
>>>>> DF
>>>> V1 V2
>>>> 1  C  5
>>>> 2  H 11
>>>> 3 Br  1
>>>> 4  O  1
>>>>
>>>>
>>>>
>>>> --
>>>> Statistics & Software Consulting
>>>> GKX Group, GKX Associates Inc.
>>>> tel: 1-877-GKX-GROUP
>>>> email: ggrothendieck at gmail.com
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>>
>> -- 
>> Spencer Graves, PE, PhD
>> President and Chief Operating Officer
>> Structure Inspection and Monitoring, Inc.
>> 751 Emerson Ct.
>> San José, CA 95126
>> ph:  408-655-4567
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list