[R] Parsing a Simple Chemical Formula
Spencer Graves
spencer.graves at structuremonitoring.com
Mon Dec 27 02:16:30 CET 2010
p.s. help(pac=CHNOSZ) reveals that this package has 3 vignettes. I
have not looked at these vignettes, but most vignettes provide excellent
introductions (though rarely with complete coverage) of important
capabilities of the package. (The 'sos' package includes a vignette,
which exposes more capabilities than the example below.)
######################
Have you considered the 'CHNOSZ' package?
> makeup("C5H11BrO" )
count
C 5
H 11
Br 1
O 1
I found this using the 'sos' package as follows:
library(sos)
cf <- ???'chemical formula'
found 21 matches; retrieving 2 pages
cf
The print method for "cf" opened the results in a web browser,
which showed that the "CHNOSZ" package had 14 of these 11 matches, and
the other 7 were in 7 different packages. Moreover, the "CHNOSZ"
package is devoted to "Chemical Thermodynamics and Activity Diagrams"
and provides many more capabilities that might interest you.
Hope this helps.
Spencer
On 12/26/2010 5:01 PM, Bryan Hanson wrote:
> Well let me just say thanks and WOW! Four great ideas, each worthy of
> study and I'll learn several things from each. Interestingly, these
> solutions seem more general and more compact than the solutions I
> found on the 'net using python and perl. More evidence for the power
> of R! A big thanks to each of you! Bryan
>
> On Dec 26, 2010, at 7:26 PM, Gabor Grothendieck wrote:
>
>> On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
>>> Hello R Folks...
>>>
>>> I've been looking around the 'net and I see many complex solutions in
>>> various languages to this question, but I have a pretty simple need
>>> (and I'm
>>> not much good at regex). I want to use a chemical formula as a
>>> function
>>> argument. The formula would be in "Hill order" which is to list C,
>>> then H,
>>> then all other elements in alphabetical order. My example will have
>>> only a
>>> limited number of elements, few enough that one can search directly
>>> for each
>>> element. So some examples would be C5H12, or C5H12O or C5H11BrO
>>> (note that
>>> for oxygen and bromine, O or Br, there is no following number
>>> meaning a 1 is
>>> implied).
>>>
>>> Let's say
>>>
>>>> form <- "C5H11BrO"
>>>
>>> I'd like to get the count of each element, so in this case I need to
>>> extract
>>> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular
>>> weight by mulitplying). Sounds pretty simple, but my experiments
>>> with grep
>>> and strsplit don't immediately clue me into an obvious solution. As
>>> I said,
>>> I don't need a general solution to the problem of calculating molecular
>>> weight from an arbitrary formula, that seems quite challenging, just
>>> a way
>>> to convert "form" into a list or data frame which I can then do the
>>> math on.
>>>
>>> Here's hoping this is a simple issue for more experienced R users!
>>> TIA,
>>
>> This can be done by strapply in gsubfn. It matches the regular
>> expression to the target string passing the back references (the
>> parenthesized portions of the regular expression) through a specified
>> function as successive arguments.
>>
>> Thus the first arg is form, your input string. The second arg is the
>> regular expression which matches an upper case letter optionally
>> followed by lower case letters and all that is optionally followed by
>> digits. The third arg is a function shown in a formula
>> representation. strapply passes the back references (i.e. the portions
>> within parentheses) to the function as the two arguments. Finally
>> simplify is another function in formula notation which turns the
>> result into a matrix and then a data frame. Finally we make the
>> second column of the data frame numeric.
>>
>> library(gsubfn)
>>
>> DF <- strapply(form,
>> "([A-Z][a-z]*)(\\d*)",
>> ~ c(..1, if (nchar(..2)) ..2 else 1),
>> simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors =
>> FALSE))
>> DF[[2]] <- as.numeric(DF[[2]])
>>
>> DF looks like this:
>>
>>> DF
>> V1 V2
>> 1 C 5
>> 2 H 11
>> 3 Br 1
>> 4 O 1
>>
>>
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
Spencer Graves, PE, PhD
President and Chief Operating Officer
Structure Inspection and Monitoring, Inc.
751 Emerson Ct.
San José, CA 95126
ph: 408-655-4567
More information about the R-help
mailing list