[R] Help with text separation

Petr PIKAL petr.pikal at precheza.cz
Mon Nov 14 15:49:13 CET 2011


Hi

r-help-bounces at r-project.org napsal dne 14.11.2011 14:54:05:

> Thank you Sarah,
> 
> Your reply was very helpful. I have the added difficulty that I am not 
only
> dealing with single A-Z characters, but quite often have the following
> situation:
> 
> form<-c('~Sentence+LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/
> benefit1+product+action+mean+CTA*help')
> 
> and again, I need to remove the +'CTA*help' part of the character 
string.
> However, in another instance I may have
> 
> form<-c('~Sentence*LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/
> benefit1+product+action+mean+CTA*help')
> 
> 
> In this case I would need to remove 'Sentence*LEGAL+' from form.
> 
> 
> Can this be accomplished in the same manner?

Hm. I am not at all an expert in regular expressions but recently I 
learned some ways (thanks Uwe)

sub("^(~)\\+(.+)\\+$", "\\1\\2", gsub("[[:alnum:]]+\\*[[:alnum:]]+", "", 
form))
[1] "~Intro+Intro/Intro1++benefit+benefit/benefit1+product+action+mean"

this will remove all values xxxxxx*yyyyy from your form together with 
leading and trailing +

I wonder if any automatic process can remove only one from several 
xxxxxx*yyyyy substrings.

Regards
Petr

PS and still it is not perfect as there is one middle + more.

> 
> Many thanks, once again, for your help
> 
> Mike Griffiths
> 
> 
> 
> On Mon, Nov 14, 2011 at 12:09 PM, Sarah Goslee 
<sarah.goslee at gmail.com>wrote:
> 
> > Hi,
> >
> > On Mon, Nov 14, 2011 at 4:20 AM, Michael Griffiths
> > <griffiths at upstreamsystems.com> wrote:
> > > Good morning R list,
> > >
> > > My apologies if this has *already* answered elsewhere, but I have 
not
> > found
> > > the answer that I am looking for.
> > >
> > > I have a character string, i.e.
> > >
> > >
> > > form<-c('~ A + B + C + C / D + E + E / F + G + H + I + J + K + L * 
M')
> > >
> > > Now, my aim is to find the position of all those instances of '*' 
and to
> > > remove said '*'. However, I would also like to remove the preceding
> > > variable name before the '*', the math operator preceding this, and 
also
> > > the variable name after the '*'. So, here I would like to remove 
'+L*M'
> >
> > You just want to get rid of them? gsub() it is.
> >
> > I've changed your formula a little bit to better demonstrate what's 
going
> > on:
> > > form<-c('~ A + B * C + C / D + E + E / F * G + H + I + J + K + L * 
M')
> > > gsub(" \\+ [A-Z] \\* [A-Z]", "", form)
> > [1] "~ A + C / D + E + E / F * G + H + I + J + K"
> >
> > That regular expression will take out a
> > space
> > +
> > any capital letter
> > space
> > *
> > space
> > any capital letter.
> >
> > It will take out all occurrences of that sequence, but won't take out
> > occurrences of * not in that sequence.
> >
> > If you don't want the spaces, you don't need them. Just take them out
> > of the regular expression as well.
> >
> > Not that strsplit() was remotely the right tool here, but you can
> > split into characters without a separator:
> > > form <- 'abcd'
> > > strsplit(form, '')
> > [[1]]
> > [1] "a" "b" "c" "d"
> >
> > Sarah
> >
> > > So, far I have come up with the following code:
> > >
> > > parts<-strsplit(form,' ')
> > > index<-which(unlist(parts)=="*")
> > > for (i in 1:length(index)){
> > >    parts[[1]][index[i]]<-list(NULL)
> > >    parts[[1]][index[i]+1]<-list(NULL)
> > >    parts[[1]][index[i]-1]<-list(NULL)
> > >    parts[[1]][index[i]-2]<-list(NULL)
> > > }
> > > new.form<-unlist(parts)
> > >
> > > form<-new.form[0]
> > > for (i in 1: length(new.form)){
> > >    form<-paste(form,new.form[i], sep="")
> > > }
> > >
> > > However, as you can see, I have had to use strsplit in, what I 
consider a
> > > rather clumsy manner, as the character string (form) has to be in a
> > certain
> > > format. All variables and maths operators require a space between 
them in
> > > order for strsplit to work in the manner I require.
> > >
> > > I would very much like to accomplish what the above code already 
does,
> > but
> > > without the need for the initial character string having the need 
for the
> > > aforementioned spaces.
> > >
> > > If the list can offer help, I would be most appreciative.
> > >
> > > Yours
> > >
> > > Mike Griffiths
> > >
> > >
> > >
> > --
> > Sarah Goslee
> > http://www.functionaldiversity.org
> >
> 
> 
> 
> -- 
> 
> *Michael Griffiths, Ph.D
> *Statistician
> 
> *Upstream Systems*
> 
> 8th Floor
> Portland House
> Bressenden Place
> SW1E 5BH
> 
> <http://www.google.com/url?q=http%3A%2F%2Fwww.upstreamsystems.com%
> 2F&sa=D&sntz=1&usg=AFrqEzfKYfaAalqvahwrpywpJDL9DxUmWw>
> 
> Tel   +44 (0) 20 7869 5147
> Fax  +44 207 290 1321
> Mob +44 789 4944 145
> 
> www.upstreamsystems.com<http://www.google.com/url?q=http%3A%2F%
> 
2Fwww.upstreamsystems.com%2F&sa=D&sntz=1&usg=AFrqEzfKYfaAalqvahwrpywpJDL9DxUmWw>
> 
> *griffiths at upstreamsystems.com <einstein at upstreamsystems.com>*
> 
> <http://www.upstreamsystems.com/>
> 
>    [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list