[R] Parsing a Simple Chemical Formula

Spencer Graves spencer.graves at structuremonitoring.com
Mon Dec 27 05:24:23 CET 2010


       Mike Marchywka's post mentioned a CRAN package, "rpubchem", 
missed by my search for "chemical formula".  A further search for 
"chemical" and "chemistry" still missed it.  "compound" found it.  
Adding "compounds" and combining them with "union" produced a list of 
564 links in 219 packages;  7 of the help pages were for "rpubchem".  
The package with the most matches is "seacarb" (seawater carbonate 
chemistry with R:  21 matches), followed by "CHNOSZ", previously 
mentioned (19 matches).  " rpubchem" is the 22nd package on this list (5 
matches, with a max score of 32, less than the max score of 2 other 
packages with 5 matches).


       Spencer


On 12/26/2010 7:36 PM, Bryan Hanson wrote:
> Hi David & others...
>
> I did find the function you recommended, plus, it's even easier (but a 
> little hidden in the doc): >element(form, "mass").  But, this uses the 
> atomic masses from the periodic table, which are weighted averages of 
> the isotopes of each element.  What I'm doing actually involves mass 
> spectrometry, so I need the isotope masses, which are integers (think 
> 12C, 13C, 14C, but the periodic table says 12.011 reflecting the 
> relative abundances).  I used Gabor's solution and got my little 
> function humming.  Plus, I have several things to read through from 
> the various recommendations.
>
> Thanks again, Bryan
>
> On Dec 26, 2010, at 10:21 PM, David Winsemius wrote:
>
>>
>> On Dec 26, 2010, at 8:28 PM, Bryan Hanson wrote:
>>
>>> Thanks Spencer, I'll definitely have a look at this package and it's 
>>> vignettes.  I believe I have looked at it before, but didn't catch 
>>> it on this particular search.  Bryan
>>
>> Using the thermo list that the makeup function accesses to get its 
>> valid atomic symbols one can arrive at the the answer you posited 
>> would be too difficult in you first posting, the atomic weight from 
>> the formulae:
>>
>> > str(thermo$element)
>> 'data.frame':    130 obs. of  6 variables:
>> $ element: chr  "Z" "O" "H" "He" ...
>> $ state  : chr  "aq" "gas" "gas" "gas" ...
>> $ source : chr  "CWM89" "CWM89" "CWM89" "CWM89" ...
>> $ mass   : num  0 16 1.01 4 20.18 ...
>> $ s      : num  -15.6 49 31.2 30.2 35 ...
>> $ n      : int  1 2 2 1 1 1 1 1 2 2 ...
>>
>> patts <- paste("^", rownames(makeup(form)), "$", sep="")
>> makuform<- makeup(form)
>> makuform$amass <- sapply(patts, function(x) {return( thermo$element[ 
>> grep(x, thermo$element[[1]])[1], "mass"])}  )
>> sum(makuform$amass *makuform$count)
>> # [1] 167.0457
>>
>>>
>>> On Dec 26, 2010, at 8:16 PM, Spencer Graves wrote:
>>>
>>>> p.s.  help(pac=CHNOSZ) reveals that this package has 3 vignettes.  
>>>> I have not looked at these vignettes, but most vignettes provide 
>>>> excellent introductions (though rarely with complete coverage) of 
>>>> important capabilities of the package.  (The 'sos' package includes 
>>>> a vignette, which exposes more capabilities than the example below.)
>>>>
>>>>
>>>> ######################
>>>>    Have you considered the 'CHNOSZ' package?
>>>>
>>>>
>>>>> makeup("C5H11BrO" )
>>>> count
>>>> C      5
>>>> H     11
>>>> Br     1
>>>> O      1
>>>>
>>>>
>>>>    I found this using the 'sos' package as follows:
>>>>
>>>>
>>>> library(sos)
>>>> cf <- ???'chemical formula'
>>>> found 21 matches;  retrieving 2 pages
>>>> cf
>>>>
>>>>
>>>>    The print method for "cf" opened the results in a web browser, 
>>>> which showed that the "CHNOSZ" package had 14 of these 11 matches, 
>>>> and the other 7 were in 7 different packages.  Moreover, the 
>>>> "CHNOSZ" package is devoted to "Chemical Thermodynamics and 
>>>> Activity Diagrams" and provides many more capabilities that might 
>>>> interest you.
>>>>
>>>>
>>>>    Hope this helps.
>>>>    Spencer
>>>>
>>>>
>>>> On 12/26/2010 5:01 PM, Bryan Hanson wrote:
>>>>> Well let me just say thanks and WOW!  Four great ideas, each 
>>>>> worthy of
>>>>> study and I'll learn several things from each.  Interestingly, these
>>>>> solutions seem more general and more compact than the solutions I
>>>>> found on the 'net using python and perl.  More evidence for the power
>>>>> of R!  A big thanks to each of you!  Bryan
>>>>>
>>>>> On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
>>>>>
>>>>>> On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> 
>>>>>> wrote:
>>>>>>> Hello R Folks...
>>>>>>>
>>>>>>> I've been looking around the 'net and I see many complex 
>>>>>>> solutions in
>>>>>>> various languages to this question, but I have a pretty simple need
>>>>>>> (and I'm
>>>>>>> not much good at regex).  I want to use a chemical formula as a
>>>>>>> function
>>>>>>> argument.  The formula would be in "Hill order" which is to list C,
>>>>>>> then H,
>>>>>>> then all other elements in alphabetical order.  My example will 
>>>>>>> have
>>>>>>> only a
>>>>>>> limited number of elements, few enough that one can search directly
>>>>>>> for each
>>>>>>> element.  So some examples would be C5H12, or C5H12O or C5H11BrO
>>>>>>> (note that
>>>>>>> for oxygen and bromine, O or Br, there is no following number
>>>>>>> meaning a 1 is
>>>>>>> implied).
>>>>>>>
>>>>>>> Let's say
>>>>>>>
>>>>>>>> form <- "C5H11BrO"
>>>>>>>
>>>>>>> I'd like to get the count of each element, so in this case I 
>>>>>>> need to
>>>>>>> extract
>>>>>>> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the 
>>>>>>> molecular
>>>>>>> weight by mulitplying).  Sounds pretty simple, but my experiments
>>>>>>> with grep
>>>>>>> and strsplit don't immediately clue me into an obvious 
>>>>>>> solution.  As
>>>>>>> I said,
>>>>>>> I don't need a general solution to the problem of calculating 
>>>>>>> molecular
>>>>>>> weight from an arbitrary formula, that seems quite challenging, 
>>>>>>> just
>>>>>>> a way
>>>>>>> to convert "form" into a list or data frame which I can then do the
>>>>>>> math on.
>>>>>>>
>>>>>>> Here's hoping this is a simple issue for more experienced R users!
>>>>>>> TIA,
>>>>>>
>>>>>> This can be done by strapply in gsubfn.  It matches the regular
>>>>>> expression to the target string passing the back references (the
>>>>>> parenthesized portions of the regular expression) through a 
>>>>>> specified
>>>>>> function as successive arguments.
>>>>>>
>>>>>> Thus the first arg is form, your input string.  The second arg is 
>>>>>> the
>>>>>> regular expression which matches an upper case letter optionally
>>>>>> followed by lower case letters and all that is optionally 
>>>>>> followed by
>>>>>> digits.  The third arg is a function shown in a formula
>>>>>> representation. strapply passes the back references (i.e. the 
>>>>>> portions
>>>>>> within parentheses) to the function as the two arguments.  Finally
>>>>>> simplify is another function in formula notation which turns the
>>>>>> result into a matrix and then a data frame.  Finally we make the
>>>>>> second column of the data frame numeric.
>>>>>>
>>>>>> library(gsubfn)
>>>>>>
>>>>>> DF <- strapply(form,
>>>>>> "([A-Z][a-z]*)(\\d*)",
>>>>>> ~ c(..1, if (nchar(..2)) ..2 else 1),
>>>>>> simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors =
>>>>>> FALSE))
>>>>>> DF[[2]] <- as.numeric(DF[[2]])
>>>>>>
>>>>>> DF looks like this:
>>>>>>
>>>>>>> DF
>>>>>> V1 V2
>>>>>> 1  C  5
>>>>>> 2  H 11
>>>>>> 3 Br  1
>>>>>> 4  O  1
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Statistics & Software Consulting
>>>>>> GKX Group, GKX Associates Inc.
>>>>>> tel: 1-877-GKX-GROUP
>>>>>> email: ggrothendieck at gmail.com
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> West Hartford, CT
>>



More information about the R-help mailing list