[R] Discovering patterns in textual strings
Bert Gunter
bgunter@4567 @end|ng |rom gm@||@com
Sat May 5 22:59:02 CEST 2018
Jeff:
The previous solution I sent you was hugely inefficient and frankly kind of
stupid. Here is a much better and simpler solution.
> z <- c("abc",
"abc_def",
"abc.def",
"abc def",
"abcd_ef",
"abcd",
"e","f")
## Create vector of patterns of same length as z, many of which are repeated
> pats <- sub("^(.+)[. _].*","\\1",z)
## Now can use tapply() to get indices if desired
## Note that the patterns label the groups
> tapply(seq_along(z),pats,I)
$abc
[1] 1 2 3 4
$abcd
[1] 5 6
$e
[1] 7
$f
[1] 8
No need to reply.
Cheers,
Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sat, May 5, 2018 at 12:14 AM, Bert Gunter <bgunter.4567 using gmail.com> wrote:
> "Does that help?"
>
> No. I am not your private consultant. You need to reply to the list, which
> I have cc'ed here, not just me.
>
> I am still somewhat confused by your specifications, but others may not
> be. Part of my confusion stems from your failure to provide a reproducible
> example (see e.g. the posting guide linked below). For example, I cannot
> tell from your text whether the Abc and Bce strings contain one or more
> spaces at the end. I shall assume they may but need not.
>
> Anyway, here is a reproducible example and solution that assumes that the
> substrings/patterns of interest to you occur at the beginning of the
> strings and may or may not be followed by one of "." "_" or " "(space) and
> then possibly further text which should be ignored. Assuming that you are
> familiar with regular expressions, maybe this will help to get you started
> even if I have misunderstood your specifications. If you aren't familiar
> with regex's, maybe the stringr package may provide a gentler interface
> than using R's raw regex functionality. Or maybe someone else can suggest a
> better approach (which is another reason why you should reply to the list,
> not just me).
>
> z <- c("abc",
> "abc_def",
> "abc.def",
> "abc def",
> "abcd_ef",
> "abcd",
> "e","f")
>
> pats <- unique(sub("^(.+)[. _]+.*", "\\1", z))
> ## gives:
> > pats
> [1] "abc" "abcd" "e" "f"
>
>
> This gives you the four separate patterns that you could then use to group
> your records, perhaps by:
>
> > lapply(pats,function(x)grep(paste0("^", x,"([_. ]|$)"), z))
> [[1]]
> [1] 1 2 3 4
>
> [[2]]
> [1] 5 6
>
> [[3]]
> [1] 7
>
> [[4]]
> [1] 8
>
> That is, indices 1-4 in z are the first group; 5 and 6 are the second; etc.
>
>
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
> On Fri, May 4, 2018 at 9:00 PM, Jeff Reichman <reichmanj using sbcglobal.net>
> wrote:
>
>> Bert
>>
>> Thank you for the link. Figured there might be something
>>
>> Regarding your questions
>>
>> This is from a large 53 Billion records. The column in question are
>> AdNames (Real Time Bidding data)
>>
>> #1. Generally yes, but not always
>>
>> #2 Separators could be underscores (_) or dots (.) as in 1.2.3_ABC ......
>>
>> #3 Yes. So there could be Abc 123 could be a matching string
>>
>> This would not be considered a match ...
>> abc_something
>> this.is_a long stringwithabcinthemiddle
>>
>> The sequence(s) are always are at the beginning (or so it appears). Out
>> of the 54 billion records I am able to pull (SparkR sql) 948,679 unique
>> strings. It is from these unique strings that I (if possible) want to
>> identify the "key" strings.
>>
>> 1. Abc_1232.niok7j9hd
>> 2. Abc
>> 3. Abc.2#348hfk2.njilo
>> 4. Abc.2
>> 5. Abc.7
>> 6. BAdfr_kajdhf98#kjsdh
>> 7. BAdrf_gofer
>> 948679 ....
>>
>>
>> So I may have a thousand individuals strings all of which have Abc as a
>> common string, or Badrf. So I am looking to pull "Abc," "BAdrf", etc. So
>> then I can go back and restructure the data to show that any record with
>> Abc_1232.niok7j9hd if part of the Abc "Group," or Family ???
>>
>> Does that help
>>
>> Jeff
>>
>> -----Original Message-----
>> From: Bert Gunter <bgunter.4567 using gmail.com>
>> Sent: Friday, May 4, 2018 5:41 PM
>> To: reichmanj using sbcglobal.net
>> Cc: R-help <R-help using r-project.org>
>> Subject: Re: [R] Discovering patterns in textual strings
>>
>> The answer is, of course, using regular expressions and/or libraries
>> therefor. However, I do not think you have defined your problem
>> sufficiently. Some questions I have:
>>
>> 1. Do possible patterns to be matched always appear at the beginning of
>> your strings?
>>
>> 2. Always together between specified separators ("_" in your example);
>> or one of several specified separators; or otherwise?
>>
>> 3. Do spaces or other nonprinting characters occur in your strings?
>>
>> e.g. would
>>
>> abc_something
>> this.is_a long stringwithabcinthemiddle
>>
>> be considered matching?
>> There are undoubtedly other possibilities that I've missed.
>>
>>
>>
>> You may also find it useful to check this "task view" out for
>> possibilities:
>> https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
>>
>> Cheers,
>> Bert
>>
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Fri, May 4, 2018 at 3:25 PM, Jeff Reichman <reichmanj using sbcglobal.net>
>> wrote:
>> > R Help Forum
>> >
>> >
>> >
>> > Is there a R library (or a way) that I can extract unique character
>> > strings, or repeating patterns in textual strings. Say for example I
>> > have the following records:
>> >
>> >
>> >
>> > Abc_1234_kjhksh_276
>> >
>> > Abc
>> >
>> > Abc_1234_lakdofyo_324
>> >
>> > Bce_876_skdhk_*&^%*&
>> >
>> > Bce
>> >
>> > Bce_454
>> >
>> >
>> >
>> > And I would like to see the following results
>> >
>> > Abc
>> >
>> > Abc_1234
>> >
>> > Bce
>> >
>> >
>> >
>> >
>> >
>> > Jeff Reichman
>> >
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list