[R] Help with text separation
Petr PIKAL
petr.pikal at precheza.cz
Mon Nov 14 15:49:13 CET 2011
Hi
r-help-bounces at r-project.org napsal dne 14.11.2011 14:54:05:
> Thank you Sarah,
>
> Your reply was very helpful. I have the added difficulty that I am not
only
> dealing with single A-Z characters, but quite often have the following
> situation:
>
> form<-c('~Sentence+LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/
> benefit1+product+action+mean+CTA*help')
>
> and again, I need to remove the +'CTA*help' part of the character
string.
> However, in another instance I may have
>
> form<-c('~Sentence*LEGAL+Intro+Intro/Intro1+Intro*LEGAL+benefit+benefit/
> benefit1+product+action+mean+CTA*help')
>
>
> In this case I would need to remove 'Sentence*LEGAL+' from form.
>
>
> Can this be accomplished in the same manner?
Hm. I am not at all an expert in regular expressions but recently I
learned some ways (thanks Uwe)
sub("^(~)\\+(.+)\\+$", "\\1\\2", gsub("[[:alnum:]]+\\*[[:alnum:]]+", "",
form))
[1] "~Intro+Intro/Intro1++benefit+benefit/benefit1+product+action+mean"
this will remove all values xxxxxx*yyyyy from your form together with
leading and trailing +
I wonder if any automatic process can remove only one from several
xxxxxx*yyyyy substrings.
Regards
Petr
PS and still it is not perfect as there is one middle + more.
>
> Many thanks, once again, for your help
>
> Mike Griffiths
>
>
>
> On Mon, Nov 14, 2011 at 12:09 PM, Sarah Goslee
<sarah.goslee at gmail.com>wrote:
>
> > Hi,
> >
> > On Mon, Nov 14, 2011 at 4:20 AM, Michael Griffiths
> > <griffiths at upstreamsystems.com> wrote:
> > > Good morning R list,
> > >
> > > My apologies if this has *already* answered elsewhere, but I have
not
> > found
> > > the answer that I am looking for.
> > >
> > > I have a character string, i.e.
> > >
> > >
> > > form<-c('~ A + B + C + C / D + E + E / F + G + H + I + J + K + L *
M')
> > >
> > > Now, my aim is to find the position of all those instances of '*'
and to
> > > remove said '*'. However, I would also like to remove the preceding
> > > variable name before the '*', the math operator preceding this, and
also
> > > the variable name after the '*'. So, here I would like to remove
'+L*M'
> >
> > You just want to get rid of them? gsub() it is.
> >
> > I've changed your formula a little bit to better demonstrate what's
going
> > on:
> > > form<-c('~ A + B * C + C / D + E + E / F * G + H + I + J + K + L *
M')
> > > gsub(" \\+ [A-Z] \\* [A-Z]", "", form)
> > [1] "~ A + C / D + E + E / F * G + H + I + J + K"
> >
> > That regular expression will take out a
> > space
> > +
> > any capital letter
> > space
> > *
> > space
> > any capital letter.
> >
> > It will take out all occurrences of that sequence, but won't take out
> > occurrences of * not in that sequence.
> >
> > If you don't want the spaces, you don't need them. Just take them out
> > of the regular expression as well.
> >
> > Not that strsplit() was remotely the right tool here, but you can
> > split into characters without a separator:
> > > form <- 'abcd'
> > > strsplit(form, '')
> > [[1]]
> > [1] "a" "b" "c" "d"
> >
> > Sarah
> >
> > > So, far I have come up with the following code:
> > >
> > > parts<-strsplit(form,' ')
> > > index<-which(unlist(parts)=="*")
> > > for (i in 1:length(index)){
> > > parts[[1]][index[i]]<-list(NULL)
> > > parts[[1]][index[i]+1]<-list(NULL)
> > > parts[[1]][index[i]-1]<-list(NULL)
> > > parts[[1]][index[i]-2]<-list(NULL)
> > > }
> > > new.form<-unlist(parts)
> > >
> > > form<-new.form[0]
> > > for (i in 1: length(new.form)){
> > > form<-paste(form,new.form[i], sep="")
> > > }
> > >
> > > However, as you can see, I have had to use strsplit in, what I
consider a
> > > rather clumsy manner, as the character string (form) has to be in a
> > certain
> > > format. All variables and maths operators require a space between
them in
> > > order for strsplit to work in the manner I require.
> > >
> > > I would very much like to accomplish what the above code already
does,
> > but
> > > without the need for the initial character string having the need
for the
> > > aforementioned spaces.
> > >
> > > If the list can offer help, I would be most appreciative.
> > >
> > > Yours
> > >
> > > Mike Griffiths
> > >
> > >
> > >
> > --
> > Sarah Goslee
> > http://www.functionaldiversity.org
> >
>
>
>
> --
>
> *Michael Griffiths, Ph.D
> *Statistician
>
> *Upstream Systems*
>
> 8th Floor
> Portland House
> Bressenden Place
> SW1E 5BH
>
> <http://www.google.com/url?q=http%3A%2F%2Fwww.upstreamsystems.com%
> 2F&sa=D&sntz=1&usg=AFrqEzfKYfaAalqvahwrpywpJDL9DxUmWw>
>
> Tel +44 (0) 20 7869 5147
> Fax +44 207 290 1321
> Mob +44 789 4944 145
>
> www.upstreamsystems.com<http://www.google.com/url?q=http%3A%2F%
>
2Fwww.upstreamsystems.com%2F&sa=D&sntz=1&usg=AFrqEzfKYfaAalqvahwrpywpJDL9DxUmWw>
>
> *griffiths at upstreamsystems.com <einstein at upstreamsystems.com>*
>
> <http://www.upstreamsystems.com/>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list