[R] Differenciate numbers from reference for rows

Gabor Grothendieck ggrothendieck at gmail.com
Sat Oct 30 16:20:38 CEST 2010


On Sat, Oct 30, 2010 at 9:43 AM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Oct 30, 2010, at 8:42 AM, Gabor Grothendieck wrote:
>
>> On Fri, Oct 29, 2010 at 6:54 PM, M.Ribeiro <mresendeufv at yahoo.com.br>
>> wrote:
>>>
>>> So, I am having a tricky reference file to extract information from.
>>>
>>> The format of the file is
>>>
>>> x   1 + 4 * 3 + 5 + 6 + 11 * 0.5
>>>
>>> So, the elements that are not being multiplied (1, 5 and 6) and the
>>> elements
>>> before the multiplication sign (4 and 11) means actually the reference
>>> for
>>> the row in a matrix where I need to extract the element from.
>>>
>>> The numbers after the multiplication sign are regular numbers
>>> Ex:
>>>
>>>> x<-matrix(20:35)
>>>
>>> I would like to read the rows 1,4,5,6 and 11 and sum then. However the
>>> numbers in the elements row 4 and 11 are multiplied by 3 and 0.5
>>>
>>> So it would be
>>> 20 + 23 * 3 + 24 + 25 + 30 * 0.5.
>>>
>>> And I have this format in different files so I can't do all by hand.
>>> Can anybody help me with a script that can differentiate this?
>>
>>
>> I assume that every number except for the second number in the pattern
>> number * number is to be replaced by that row number in x.  Try this.
>> We define a regular expression which matches the first number ([0-9]+)
>> of each potential pair and optionally (?) spaces ( *) a star (\\*),
>> more spaces ( *) and digits [0-9.]+ passing the first and second
>> backreferences (matches to the parenthesized portions of the regular
>> expression) to f and inserting the output of f where the matches had
>> been.
>>
>> library(gsubfn)
>> f <- function(a, b) paste(x[as.numeric(a)], b)
>> s2 <- gsubfn("([0-9]+)( *\\* *[0-9.]+)?", f, s)
>>
>> If the objective is to then perform the calculation that that
>> represents then try this:
>> sapply(s2, function(x) eval(parse(text = x)))
>>
>> For example,
>>
>>> s <- c("1 + 4 * 3 + 5 + 6 + 11 * 0.5", "1 + 4 * 3 + 5 + 6 + 11 * 0.5")
>>> x <- matrix(20:35)
>>> f <- function(a, b) paste(x[as.numeric(a)], b)
>>> s2 <- gsubfn("([0-9]+)( *\\* *[0-9.]+)?", f, s)
>>> s2
>>
>> [1] "20  + 23  * 3 + 24  + 25  + 30  * 0.5" "20  + 23  * 3 + 24  + 25 + 30
>>  * 0.5"
>>>
>>> sapply(s2, function(x) eval(parse(text = x)))
>>
>> 20  + 23  * 3 + 24  + 25  + 30  * 0.5 20  + 23  * 3 + 24  + 25  + 30  *
>> 0.5
>>                                 153                                   153
>>
>> For more see the gsubfn home page at http://gsubfn.googlecode.com
>
>
> I am scratching my head regarding the gsubfn workings. It appears that as
> gsubfn moves across the input strings that it will either match just
> "[0-9+]" or it will match "[0-9+] *\\* *[0-9.]+?".

In the regular expression

   "([0-9]+)( *\\* *[0-9.]+)?"

it matches the first (...) and then the (...)?  part.  ? means 0 or 1
occurrences so it can match by matching the content or if that is not
possible it will match the empty string.

>
> In either case the match will do a lookup in x[] for the first match using
> the "a" index, and if there is a match for the second position assigned to
> "*b" then that x[a] will be followed by "*b"  and is therefore destined to
> be multiplied by "b". I cannot quite figure out how the NULL value gets
> not-matched to the second back-reference and then doesn't screw up the f()
> function by only providing one argument to a two argument function. Maybe
> it's due to this? (So can you comment on how optional back-references return
> values?)

(...)? says to match 0 or 1 occurrences of the ... expression.  Iif
(...) does not match then (...)? will be successful in matching the
empty string.  The function is always called with two arguments.  Try
this:

> s <- "1 + 4 * 3 + 5 + 6 + 11 * 0.5"
> g <- function(a, b) sprintf("<a='%s'><b='%s'>", a, b)
> gsubfn("([0-9]+)( *\\* *[0-9.]+)?", g, s)
[1] "<a='1'><b=''> + <a='4'><b=' * 3'> + <a='5'><b=''> + <a='6'><b=''>
+ <a='11'><b=' * 0.5'>"



-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list