[R] regex - optional part isn't considered in replacement with gsub

Bert Gunter bgunter.4567 at gmail.com
Mon Aug 28 01:01:28 CEST 2017


Omar:

I don't think this can work. For example number-letter patterns 4),
5), and 6) would all be matched by pattern 6).

As Jeff indicated, you need to provide the delimiters -- what
characters come before and after the SKU patterns -- to be able to
recognize them. In a quick look at the text file you attached, the
delimiters appeared to be either "-" or " " (blank) and perhaps <end
of character string>. If that is correct or if you can tell us how to
make it correct, then it's straightforward to proceed. Otherwise, I am
unable to help. Maybe someone else can.

Cheers,
Bert






On Sun, Aug 27, 2017 at 11:47 AM, Omar André Gonzáles Díaz
<oma.gonzales at gmail.com> wrote:
> Hi Jeff, Bert, thank you for your input.
>
> I'm attaching a sample of the data, feel free to explore it.
>
> As I said, I need to extract the SKUs of the products (a key that
> identifies every product). Not every producto (row) has a SKU, in this
> case "no SKU" should be the output.
>
> I've identify these patterns so far:
>
> 1.- 75Q8C : 2 numbers, 1 letter, 1 number, 1 letter.
> 2.-OLED65E7P: 4 letters, 2 numbers, 1 letter, 1 number, 1 letter.
> 3.-MT48AF: 2 letters, 2 numbers, 2 letters.
> 4.-LH5000: 2 letters, 4 numbers.
> 5.-B8500: 1 letters, 4 numbers.
> 6.-E310: 1 letter, 3 numbers.
> 7.-X541UJ: 1 letter, 3 numbers, 2 letters.
>
>
> I think those cover the mayority of skus. So I would appreciate a a
> guidence on how to extract all those different patterns.
>
> Relate but not the question asked: The idea is that after extracting
> the skus, there should be skus repeted accros the different ecommerce.
> Those skus would permit us to compare the products and their prices.
>
>
> Thank you in advance.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2017-08-27 12:10 GMT-05:00 Bert Gunter <bgunter.4567 at gmail.com>:
>> You may have to provide us more detail on **exactly** the sorts of
>> patterns you wish to "capture" -- including exactly what you mean by
>> "capture" (what vaue do you wish to return?) -- as the "obvious"
>> answer is probably not sufficient:
>>
>> ## using your example -- thankyou
>>
>>> gsub(".*(49MU6300|LE32S5970).*","\\1",ecommerce[[2]])
>> [1] "49MU6300"  "LE32S5970"
>>
>>
>> Cheers,
>> Bert
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Sun, Aug 27, 2017 at 9:18 AM, Omar André Gonzáles Díaz
>> <oma.gonzales at gmail.com> wrote:
>>> Hello, I need some help with regex.
>>>
>>> I have this to sentences. I need to extract both "49MU6300" and "LE32S5970"
>>> and put them in a new colum "SKU".
>>>
>>> A) SMART TV UHD 49'' CURVO 49MU6300
>>> B) SMART TV HD 32'' LE32S5970
>>>
>>> DataFrame for testing:
>>>
>>> ecommerce <- data.frame(a = c(1,2), producto = c("SMART TV UHD 49'' CURVO
>>> 49MU6300",
>>>                              "SMART TV HD 32'' LE32S5970"))
>>>
>>>
>>> I'm using gsub like this:
>>>
>>> 1.- This would capture A as intended but only "32S5970" from B (missing
>>> "LE").
>>>
>>> ecommerce$sku <- gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2",
>>> ecommerce$producto)
>>>
>>>
>>> 2.- This would capture "LE32S5970" but not "49MU6300".
>>>
>>> ecommerce$sku <-
>>> gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2",
>>> ecommerce$producto)
>>>
>>>
>>> 3.- If I make the 2 first letter optional with:
>>>
>>> ecommerce$sku <-
>>> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2",
>>> ecommerce$producto)
>>>
>>>
>>> "49MU6300" is capture, but again only "32S5970" from B (missing "LE").
>>>
>>>
>>> What should I do? How would you approche it?
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list