[R] Parsing a Simple Chemical Formula

Gabor Grothendieck ggrothendieck at gmail.com
Mon Dec 27 05:09:16 CET 2010


On Sun, Dec 26, 2010 at 7:26 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
>> Hello R Folks...
>>
>> I've been looking around the 'net and I see many complex solutions in
>> various languages to this question, but I have a pretty simple need (and I'm
>> not much good at regex).  I want to use a chemical formula as a function
>> argument.  The formula would be in "Hill order" which is to list C, then H,
>> then all other elements in alphabetical order.  My example will have only a
>> limited number of elements, few enough that one can search directly for each
>> element.  So some examples would be C5H12, or C5H12O or C5H11BrO (note that
>> for oxygen and bromine, O or Br, there is no following number meaning a 1 is
>> implied).
>>
>> Let's say
>>
>>> form <- "C5H11BrO"
>>
>> I'd like to get the count of each element, so in this case I need to extract
>> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular
>> weight by mulitplying).  Sounds pretty simple, but my experiments with grep
>> and strsplit don't immediately clue me into an obvious solution.  As I said,
>> I don't need a general solution to the problem of calculating molecular
>> weight from an arbitrary formula, that seems quite challenging, just a way
>> to convert "form" into a list or data frame which I can then do the math on.
>>
>> Here's hoping this is a simple issue for more experienced R users!  TIA,
>
> This can be done by strapply in gsubfn.  It matches the regular
> expression to the target string passing the back references (the
> parenthesized portions of the regular expression) through a specified
> function as successive arguments.
>
> Thus the first arg is form, your input string.  The second arg is the
> regular expression which matches an upper case letter optionally
> followed by lower case letters and all that is optionally followed by
> digits.  The third arg is a function shown in a formula
> representation. strapply passes the back references (i.e. the portions
> within parentheses) to the function as the two arguments.  Finally
> simplify is another function in formula notation which turns the
> result into a matrix and then a data frame.  Finally we make the
> second column of the data frame numeric.
>
> library(gsubfn)
>
> DF <- strapply(form,
>   "([A-Z][a-z]*)(\\d*)",
>   ~ c(..1, if (nchar(..2)) ..2 else 1),
>   simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE))
> DF[[2]] <- as.numeric(DF[[2]])
>
> DF looks like this:
>
>> DF
>  V1 V2
> 1  C  5
> 2  H 11
> 3 Br  1
> 4  O  1
>

Here is a variation that is slightly simpler. The function in the
third argument has been changed from c to paste so that it outputs
strings like "C 5".  With this form of output we can use read.table to
read it directly creating a data frame.

> strapply(form,
+   "([A-Z][a-z]*)(\\d*)",
+   ~ paste(..1, if (nchar(..2)) ..2 else 1),
+   simplify = ~ read.table(textConnection(..1)))
  V1 V2
1  C  5
2  H 11
3 Br  1
4  O  1


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list