[R] Parsing a Simple Chemical Formula
Spencer Graves
spencer.graves at structuremonitoring.com
Mon Dec 27 05:24:23 CET 2010
Mike Marchywka's post mentioned a CRAN package, "rpubchem",
missed by my search for "chemical formula". A further search for
"chemical" and "chemistry" still missed it. "compound" found it.
Adding "compounds" and combining them with "union" produced a list of
564 links in 219 packages; 7 of the help pages were for "rpubchem".
The package with the most matches is "seacarb" (seawater carbonate
chemistry with R: 21 matches), followed by "CHNOSZ", previously
mentioned (19 matches). " rpubchem" is the 22nd package on this list (5
matches, with a max score of 32, less than the max score of 2 other
packages with 5 matches).
Spencer
On 12/26/2010 7:36 PM, Bryan Hanson wrote:
> Hi David & others...
>
> I did find the function you recommended, plus, it's even easier (but a
> little hidden in the doc): >element(form, "mass"). But, this uses the
> atomic masses from the periodic table, which are weighted averages of
> the isotopes of each element. What I'm doing actually involves mass
> spectrometry, so I need the isotope masses, which are integers (think
> 12C, 13C, 14C, but the periodic table says 12.011 reflecting the
> relative abundances). I used Gabor's solution and got my little
> function humming. Plus, I have several things to read through from
> the various recommendations.
>
> Thanks again, Bryan
>
> On Dec 26, 2010, at 10:21 PM, David Winsemius wrote:
>
>>
>> On Dec 26, 2010, at 8:28 PM, Bryan Hanson wrote:
>>
>>> Thanks Spencer, I'll definitely have a look at this package and it's
>>> vignettes. I believe I have looked at it before, but didn't catch
>>> it on this particular search. Bryan
>>
>> Using the thermo list that the makeup function accesses to get its
>> valid atomic symbols one can arrive at the the answer you posited
>> would be too difficult in you first posting, the atomic weight from
>> the formulae:
>>
>> > str(thermo$element)
>> 'data.frame': 130 obs. of 6 variables:
>> $ element: chr "Z" "O" "H" "He" ...
>> $ state : chr "aq" "gas" "gas" "gas" ...
>> $ source : chr "CWM89" "CWM89" "CWM89" "CWM89" ...
>> $ mass : num 0 16 1.01 4 20.18 ...
>> $ s : num -15.6 49 31.2 30.2 35 ...
>> $ n : int 1 2 2 1 1 1 1 1 2 2 ...
>>
>> patts <- paste("^", rownames(makeup(form)), "$", sep="")
>> makuform<- makeup(form)
>> makuform$amass <- sapply(patts, function(x) {return( thermo$element[
>> grep(x, thermo$element[[1]])[1], "mass"])} )
>> sum(makuform$amass *makuform$count)
>> # [1] 167.0457
>>
>>>
>>> On Dec 26, 2010, at 8:16 PM, Spencer Graves wrote:
>>>
>>>> p.s. help(pac=CHNOSZ) reveals that this package has 3 vignettes.
>>>> I have not looked at these vignettes, but most vignettes provide
>>>> excellent introductions (though rarely with complete coverage) of
>>>> important capabilities of the package. (The 'sos' package includes
>>>> a vignette, which exposes more capabilities than the example below.)
>>>>
>>>>
>>>> ######################
>>>> Have you considered the 'CHNOSZ' package?
>>>>
>>>>
>>>>> makeup("C5H11BrO" )
>>>> count
>>>> C 5
>>>> H 11
>>>> Br 1
>>>> O 1
>>>>
>>>>
>>>> I found this using the 'sos' package as follows:
>>>>
>>>>
>>>> library(sos)
>>>> cf <- ???'chemical formula'
>>>> found 21 matches; retrieving 2 pages
>>>> cf
>>>>
>>>>
>>>> The print method for "cf" opened the results in a web browser,
>>>> which showed that the "CHNOSZ" package had 14 of these 11 matches,
>>>> and the other 7 were in 7 different packages. Moreover, the
>>>> "CHNOSZ" package is devoted to "Chemical Thermodynamics and
>>>> Activity Diagrams" and provides many more capabilities that might
>>>> interest you.
>>>>
>>>>
>>>> Hope this helps.
>>>> Spencer
>>>>
>>>>
>>>> On 12/26/2010 5:01 PM, Bryan Hanson wrote:
>>>>> Well let me just say thanks and WOW! Four great ideas, each
>>>>> worthy of
>>>>> study and I'll learn several things from each. Interestingly, these
>>>>> solutions seem more general and more compact than the solutions I
>>>>> found on the 'net using python and perl. More evidence for the power
>>>>> of R! A big thanks to each of you! Bryan
>>>>>
>>>>> On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
>>>>>
>>>>>> On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu>
>>>>>> wrote:
>>>>>>> Hello R Folks...
>>>>>>>
>>>>>>> I've been looking around the 'net and I see many complex
>>>>>>> solutions in
>>>>>>> various languages to this question, but I have a pretty simple need
>>>>>>> (and I'm
>>>>>>> not much good at regex). I want to use a chemical formula as a
>>>>>>> function
>>>>>>> argument. The formula would be in "Hill order" which is to list C,
>>>>>>> then H,
>>>>>>> then all other elements in alphabetical order. My example will
>>>>>>> have
>>>>>>> only a
>>>>>>> limited number of elements, few enough that one can search directly
>>>>>>> for each
>>>>>>> element. So some examples would be C5H12, or C5H12O or C5H11BrO
>>>>>>> (note that
>>>>>>> for oxygen and bromine, O or Br, there is no following number
>>>>>>> meaning a 1 is
>>>>>>> implied).
>>>>>>>
>>>>>>> Let's say
>>>>>>>
>>>>>>>> form <- "C5H11BrO"
>>>>>>>
>>>>>>> I'd like to get the count of each element, so in this case I
>>>>>>> need to
>>>>>>> extract
>>>>>>> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the
>>>>>>> molecular
>>>>>>> weight by mulitplying). Sounds pretty simple, but my experiments
>>>>>>> with grep
>>>>>>> and strsplit don't immediately clue me into an obvious
>>>>>>> solution. As
>>>>>>> I said,
>>>>>>> I don't need a general solution to the problem of calculating
>>>>>>> molecular
>>>>>>> weight from an arbitrary formula, that seems quite challenging,
>>>>>>> just
>>>>>>> a way
>>>>>>> to convert "form" into a list or data frame which I can then do the
>>>>>>> math on.
>>>>>>>
>>>>>>> Here's hoping this is a simple issue for more experienced R users!
>>>>>>> TIA,
>>>>>>
>>>>>> This can be done by strapply in gsubfn. It matches the regular
>>>>>> expression to the target string passing the back references (the
>>>>>> parenthesized portions of the regular expression) through a
>>>>>> specified
>>>>>> function as successive arguments.
>>>>>>
>>>>>> Thus the first arg is form, your input string. The second arg is
>>>>>> the
>>>>>> regular expression which matches an upper case letter optionally
>>>>>> followed by lower case letters and all that is optionally
>>>>>> followed by
>>>>>> digits. The third arg is a function shown in a formula
>>>>>> representation. strapply passes the back references (i.e. the
>>>>>> portions
>>>>>> within parentheses) to the function as the two arguments. Finally
>>>>>> simplify is another function in formula notation which turns the
>>>>>> result into a matrix and then a data frame. Finally we make the
>>>>>> second column of the data frame numeric.
>>>>>>
>>>>>> library(gsubfn)
>>>>>>
>>>>>> DF <- strapply(form,
>>>>>> "([A-Z][a-z]*)(\\d*)",
>>>>>> ~ c(..1, if (nchar(..2)) ..2 else 1),
>>>>>> simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors =
>>>>>> FALSE))
>>>>>> DF[[2]] <- as.numeric(DF[[2]])
>>>>>>
>>>>>> DF looks like this:
>>>>>>
>>>>>>> DF
>>>>>> V1 V2
>>>>>> 1 C 5
>>>>>> 2 H 11
>>>>>> 3 Br 1
>>>>>> 4 O 1
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Statistics & Software Consulting
>>>>>> GKX Group, GKX Associates Inc.
>>>>>> tel: 1-877-GKX-GROUP
>>>>>> email: ggrothendieck at gmail.com
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
More information about the R-help
mailing list