[R] Parsing a Simple Chemical Formula
Gabor Grothendieck
ggrothendieck at gmail.com
Mon Dec 27 05:09:16 CET 2010
On Sun, Dec 26, 2010 at 7:26 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> On Sun, Dec 26, 2010 at 6:29 PM, Bryan Hanson <hanson at depauw.edu> wrote:
>> Hello R Folks...
>>
>> I've been looking around the 'net and I see many complex solutions in
>> various languages to this question, but I have a pretty simple need (and I'm
>> not much good at regex). I want to use a chemical formula as a function
>> argument. The formula would be in "Hill order" which is to list C, then H,
>> then all other elements in alphabetical order. My example will have only a
>> limited number of elements, few enough that one can search directly for each
>> element. So some examples would be C5H12, or C5H12O or C5H11BrO (note that
>> for oxygen and bromine, O or Br, there is no following number meaning a 1 is
>> implied).
>>
>> Let's say
>>
>>> form <- "C5H11BrO"
>>
>> I'd like to get the count of each element, so in this case I need to extract
>> C and 5, H and 11, Br and 1, O and 1 (I want to calculate the molecular
>> weight by mulitplying). Sounds pretty simple, but my experiments with grep
>> and strsplit don't immediately clue me into an obvious solution. As I said,
>> I don't need a general solution to the problem of calculating molecular
>> weight from an arbitrary formula, that seems quite challenging, just a way
>> to convert "form" into a list or data frame which I can then do the math on.
>>
>> Here's hoping this is a simple issue for more experienced R users! TIA,
>
> This can be done by strapply in gsubfn. It matches the regular
> expression to the target string passing the back references (the
> parenthesized portions of the regular expression) through a specified
> function as successive arguments.
>
> Thus the first arg is form, your input string. The second arg is the
> regular expression which matches an upper case letter optionally
> followed by lower case letters and all that is optionally followed by
> digits. The third arg is a function shown in a formula
> representation. strapply passes the back references (i.e. the portions
> within parentheses) to the function as the two arguments. Finally
> simplify is another function in formula notation which turns the
> result into a matrix and then a data frame. Finally we make the
> second column of the data frame numeric.
>
> library(gsubfn)
>
> DF <- strapply(form,
> "([A-Z][a-z]*)(\\d*)",
> ~ c(..1, if (nchar(..2)) ..2 else 1),
> simplify = ~ as.data.frame(t(matrix(..1, 2)), stringsAsFactors = FALSE))
> DF[[2]] <- as.numeric(DF[[2]])
>
> DF looks like this:
>
>> DF
> V1 V2
> 1 C 5
> 2 H 11
> 3 Br 1
> 4 O 1
>
Here is a variation that is slightly simpler. The function in the
third argument has been changed from c to paste so that it outputs
strings like "C 5". With this form of output we can use read.table to
read it directly creating a data frame.
> strapply(form,
+ "([A-Z][a-z]*)(\\d*)",
+ ~ paste(..1, if (nchar(..2)) ..2 else 1),
+ simplify = ~ read.table(textConnection(..1)))
V1 V2
1 C 5
2 H 11
3 Br 1
4 O 1
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
More information about the R-help
mailing list