[R-sig-Epi] Fwd: Identify medicines names

Wed Apr 7 10:13:41 CEST 2021

A simple solution is to use text analysis package such as quanteda

require(quanteda)

drug_dictionary <- as.dictionary(data.frame(word = toupper(patterns),
sentiment = patterns))
corpus(df$name) %>% tokens() %>% tokens_compound(drug_dictionary) %>%
dfm %>% dfm_lookup(drug_dictionary) %>% quanteda::convert(to =
"data.frame")

On Tue, Apr 6, 2021 at 4:42 PM Felipe Barletta
<felipe.e.barletta using gmail.com> wrote:
>
> Hi Gianpaolo,
>
> It works now, thank you!
>
> But it is not what I need exactly.
> I will explain better.
>
> Your solution is good. To identify what is antibiotic and for this my
> solution solved too:
>
> ######################################################
> matches  <- unlist(sapply(patterns, function(p) grep(p, df$name,
>                                                      value = FALSE,
>                                                      ignore.case = TRUE)
>                           )
>                    )
> anti <- df[matches,]
> ########################################################
>
>
> But what I need, beyond identifying what is an antibiotic:
> - Create a new variable (when the medicine is antibiotic - into the
> patterns object) with the name from patterns name.
> I did this with the code below - fuzzyjoin::regex_left_join() function:
>
> #########################################################
> #List of medicines that - object called patterns.
> patterns <-  c("Oritavancina", "Oxacilina", "Pefloxacino", "Penicilina",
>               "Pexiganan",  "Piperacilina-tazobactam","Tazobactam",
>               "Pirazinamida", "Plazomicina", "Polimixina B",
>               "Posilozid","Piperacilina")
> patterns <- toupper(patterns)
>
> # Sample Data frame where I need to find the names from the list above.
> df <- data.frame(name =
>                      c("CLORETO DE POTASSIO DRAGEA 600MG",
>                        "CLORETO DE SODIO 0,9% SERINGA PREENCHIDA 5ML",
>                        "CLORETO DE SODIO SOLUCAO INJETAVEL 0,9% 10ML",
>                        "CODEINA FOSFATO SOLUCAO ORAL 3MGML 10ML ISCMPA @",
>                        "CODEINA FOSFATO SOLUCAO ORAL 3MGML 5ML ISCMPA @",
>                        "DipiRONA SOLUCAO INJETAVEL 500MGML 2ML",
>                        "DipiRONA SOLUCAO INJETAVEL 500MGML 2ML",
>                        "FUROSEMIDA SOLUCAO INJETAVEL 10MGML 2ML",
>                        "HIDROCORTISONA SUCCINATO SODICO PO LIOFILO
> INJETAVEL 100MG",
>                        "ONDANSETRONA CLORIDRATO SOLUCAO INJETAVEL 2MGML
> 4ML",
>                        "ONDANSETRONA CLORIDRATO SOLUCAO INJETAVEL 2MGML
> 4ML",
>                        "Penicilina G BENZATINA PO LIOFILO INJETAVEL
> 1200000UI",
>                        "Penicilina G BENZATINA PO LIOFILO INJETAVEL
> 1200000UI",
>                        "PIPERACILINA SODICA 4G + TAZOBACTAM SODICA 0,5G PO
> LIOFILO INJETAVEL"))
>
>
> df <- df %>% mutate(name = toupper(name))
> patterns <- data.frame(name = patterns)
> results <- fuzzyjoin::regex_left_join(df,
>                                       patterns,
>                            by = "name")
> results
> #########################################################
> Notice, from results object, when the name of medicine is double
> (PIPERACILINA SODICA 4G + TAZOBACTAM SODICA 0,5G PO LIOFILO INJETAVEL"),
> the solution doesn't find "PIPERACILINA-TAZOBACTAM"
> The code created two new lines PIPERACILINA and othe with TAZOBACTAM.
>
> I think that this explanation was more clear.
>
>
>
>
>
>
>
>
>
>
> Em ter., 6 de abr. de 2021 às 03:55, Gianpaolo Romeo <
> gianpaolo.romeo using gmail.com> escreveu:
>
> > Sorry,
> > I wrote the code on a smartphone without using R, try this:
> >
> > require(dplyr)
> >
> > patterns <- c("Oritavancina", "Oxacilina", "Pefloxacino", "Penicilina",
> >               "Pexiganan", "Piperacilina", "Piperacilina-tazobactam",
> >               "Pirazinamida", "Plazomicina", "Polimixina B",
> >               "Posilozid")
> >
> > patterns.new <- paste(patterns, collapse = "|")
> >
> >
> > df <- data.frame(name =
> >                    c("CLORETO DE POTASSIO DRAGEA 600MG",
> >                      "CLORETO DE SODIO 0,9% SERINGA PREENCHIDA 5ML",
> >                      "CLORETO DE SODIO SOLUCAO INJETAVEL 0,9% 10ML",
> >                      "CODEINA FOSFATO SOLUCAO ORAL 3MGML 10ML ISCMPA @",
> >                      "CODEINA FOSFATO SOLUCAO ORAL 3MGML 5ML ISCMPA @",
> >                      "DipiRONA SOLUCAO INJETAVEL 500MGML 2ML",
> >                      "DipiRONA SOLUCAO INJETAVEL 500MGML 2ML",
> >                      "FUROSEMIDA SOLUCAO INJETAVEL 10MGML 2ML",
> >                      "HIDROCORTISONA SUCCINATO SODICO PO LIOFILO INJETAVEL
> > 100MG",
> >                      "ONDANSETRONA CLORIDRATO SOLUCAO INJETAVEL 2MGML 4ML",
> >                      "ONDANSETRONA CLORIDRATO SOLUCAO INJETAVEL 2MGML 4ML",
> >                      "Penicilina G BENZATINA PO LIOFILO INJETAVEL
> > 1200000UI",
> >                      "Penicilina G BENZATINA PO LIOFILO INJETAVEL
> > 1200000UI",
> >                      "PIPERACILINA SODICA 4G + TAZOBACTAM SODICA 0,5G
> > POLIOFILO INJETAVEL"))
> >
> >
> > results <- df %>% filter(grepl(pattern = patterns.new, x = name,
> > ignore.case = TRUE))
> >
> > Il giorno mar 6 apr 2021 alle ore 02:06 Felipe Barletta <
> > felipe.e.barletta using gmail.com> ha scritto:
> >
> >> Thanks a lotados Gianpaolo, but your suggest didn't work.
> >>
> >> Em seg, 5 de abr de 2021 4:50 PM, Gianpaolo Romeo <
> >> gianpaolo.romeo using gmail.com> escreveu:
> >>
> >>> I suggest you to use dplyr package:
> >>>
> >>>
> >>>
> >>> df %>% mutate(name = toupper(name)) %>%
> >>> filter(grepl(pattern = patterns, name))
> >>>
> >>>
> >>> If you want ti search every string that start exactly with a spedific
> >>> word:
> >>>
> >>> patterns <- paste0("^", patterns))
> >>>
> >>>
> >>> Il lun 5 apr 2021, 20:25 Felipe Barletta <felipe.e.barletta using gmail.com>
> >>> ha scritto:
> >>>
> >>>> Hi friends,
> >>>>
> >>>> Hi friends,
> >>>>
> >>>> I need to identify medicines names in a data set.
> >>>> I have a list of antibiotic names and I need to identify those names in
> >>>> a
> >>>> sample.
> >>>>
> >>>> When the name of the medicine is simple, my solution worked, see:
> >>>>
> >>>> #List of medicines that - object called patterns.
> >>>> patterns <- c("Oritavancina", "Oxacilina", "Pefloxacino", "Penicilina",
> >>>>               "Pexiganan", "Piperacilina", "Piperacilina-tazobactam",
> >>>>               "Pirazinamida", "Plazomicina", "Polimixina B",
> >>>>               "Posilozid")
> >>>>
> >>>>
> >>>> # Sample Data frame where I need to find the names from the list above.
> >>>> df <- data.frame(name =
> >>>>                      c("CLORETO DE POTASSIO DRAGEA 600MG",
> >>>>                        "CLORETO DE SODIO 0,9% SERINGA PREENCHIDA 5ML",
> >>>>                        "CLORETO DE SODIO SOLUCAO INJETAVEL 0,9% 10ML",
> >>>>                        "CODEINA FOSFATO SOLUCAO ORAL 3MGML 10ML ISCMPA
> >>>> @",
> >>>>                        "CODEINA FOSFATO SOLUCAO ORAL 3MGML 5ML ISCMPA
> >>>> @",
> >>>>                        "DipiRONA SOLUCAO INJETAVEL 500MGML 2ML",
> >>>>                        "DipiRONA SOLUCAO INJETAVEL 500MGML 2ML",
> >>>>                        "FUROSEMIDA SOLUCAO INJETAVEL 10MGML 2ML",
> >>>>                        "HIDROCORTISONA SUCCINATO SODICO PO LIOFILO
> >>>> INJETAVEL 100MG",
> >>>>                        "ONDANSETRONA CLORIDRATO SOLUCAO INJETAVEL 2MGML
> >>>> 4ML",
> >>>>                        "ONDANSETRONA CLORIDRATO SOLUCAO INJETAVEL 2MGML
> >>>> 4ML",
> >>>>                        "Penicilina G BENZATINA PO LIOFILO INJETAVEL
> >>>> 1200000UI",
> >>>>                        "Penicilina G BENZATINA PO LIOFILO INJETAVEL
> >>>> 1200000UI",
> >>>>                        "PIPERACILINA SODICA 4G + TAZOBACTAM SODICA 0,5G
> >>>> PO
> >>>> LIOFILO INJETAVEL"))
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> results <- regex_left_join(df,
> >>>>                            patterns,
> >>>>                            by = "name")
> >>>>
> >>>> head(results)
> >>>>
> >>>> # Identify with grep() - other way.
> >>>> matches  <- unlist(sapply(patterns, function(p) grep(p, df$name,
> >>>>                                                      value = FALSE,
> >>>>                                                      ignore.case = TRUE)
> >>>>                           )
> >>>>                    )
> >>>>
> >>>> anti <- df[matches,]
> >>>>
> >>>> However, when the name is composed it does not work (for example:
> >>>> Piperacillin-tazobactam)
> >>>>
> >>>> Can anyone help me in this issue?
> >>>>
> >>>>         [[alternative HTML version deleted]]
> >>>>
> >>>> _______________________________________________
> >>>> R-sig-Epi using r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-sig-epi
> >>>>
> >>>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Epi using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-epi