[R] Parsing a Simple Chemical Formula

Mon Dec 27 01:41:06 CET 2010

On Dec 26, 2010, at 6:29 PM, Bryan Hanson wrote:

> Hello R Folks...
>
> I've been looking around the 'net and I see many complex solutions  
> in various languages to this question, but I have a pretty simple  
> need (and I'm not much good at regex).  I want to use a chemical  
> formula as a function argument.  The formula would be in "Hill  
> order" which is to list C, then H, then all other elements in  
> alphabetical order.  My example will have only a limited number of  
> elements, few enough that one can search directly for each element.   
> So some examples would be C5H12, or C5H12O or C5H11BrO (note that  
> for oxygen and bromine, O or Br, there is no following number  
> meaning a 1 is implied).
>
> Let's say
>
> > form <- "C5H11BrO"

Well here's how I see it:

The "form" can be split with a regular expression:
Capital letter followed by zero or one lower, followeed by a various  
number of digits

greg <- gregexpr("[A-Z]{1}[a-z]?[0-9]*", form)

Append a number equal to one moe lan the ength for reasins that will  
become clear

ugreg <- c(unlist(greg), nchar(form)+1)

Then use substring function to serially pick from a split point to one  
minus the next split point (or in that case of the last element one  
minus the length of the string:

 > sapply(1:(length(ugreg)-1), function(z) substr(form, ugreg[z],  
ugreg[z+1]-1) )
[1] "C5"  "H11" "Br"  "O"

Then you can split these "triples" (cap,lower,n) and if n is absent  
assume 1.

 > sub("(\\d*)$", "", sapply(1:(length(ugreg)-1),   # blank out the  
digits
                 function(z) substr(form, ugreg[z], ugreg[z+1]-1) ) )
[1] "C"  "H"  "Br" "O"

sub("^$", "1", sub("([A-Za-z]*)", "",    # subst "1" for empty strings
                     sapply(1:(length(ugreg)-1),
                           function(z) substr(form, ugreg[z], ugreg[z 
+1]-1) ) ) )
[1] "5"  "11" "1"  "1"

If you limited the number of elements searched for, it might improve  
the error trapping, I suppose.

-- 
David.

>
> I'd like to get the count of each element, so in this case I need to  
> extract C and 5, H and 11, Br and 1, O and 1 (I want to calculate  
> the molecular weight by mulitplying).  Sounds pretty simple, but my  
> experiments with grep and strsplit don't immediately clue me into an  
> obvious solution.  As I said, I don't need a general solution to the  
> problem of calculating molecular weight from an arbitrary formula,  
> that seems quite challenging, just a way to convert "form" into a  
> list or data frame which I can then do the math on.
>
> Here's hoping this is a simple issue for more experienced R users!   
> TIA,  Bryan
> ***********
> Bryan Hanson
> Professor of Chemistry & Biochemistry
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT