[R] Identifying words from a list and code as 0 or 1 and words NOT on the list code as 1

Rui Barradas ru|pb@rr@d@@ @end|ng |rom @@po@pt
Fri Jun 11 20:03:49 CEST 2021


Hello,

For what I understood of the problem, this might be what you want.


library(dplyr)
library(stringr)

coreWordsPat <- paste0("\\b", coreWords, "\\b")
coreWordsPat <- paste(coreWordsPat, collapse = "|")

left_join(
   df %>%
     mutate(Core = +str_detect(Utterance, coreWordsPat)) %>%
     select(ID, Utterance, Core),
   df %>%
     mutate(Fringe = str_remove_all(Utterance, coreWordsPat),
            Fringe = +(nchar(trimws(Fringe)) > 0)) %>%
     select(ID, Fringe),
   by = "ID"
)


Hope this helps,

Rui Barradas

Às 18:02 de 11/06/21, Debbie Hahs-Vaughn escreveu:
> I am working with utterances, statements spoken by children.  From each utterance, if one or more words in the statement match a predefined list of multiple 'core' words (probably 300 words), then I want to input '1' into 'Core' (and if none, then input '0' into 'Core').
> 
> If there are one or more words in the statement that are NOT core words, then I want to input '1' into 'Fringe' (and if there are only core words and nothing extra, then input '0' into 'Fringe').  I will not have a list of Fringe words.
> 
> Basically, right now I have a child ID and only the utterances.  Here is a snippet of my data.
> 
> ID      Utterance
> 1       a baby
> 2       small
> 3       yes
> 4       where's his bed
> 5       there's his bed
> 6       where's his pillow
> 7       what is that on his head
> 8       hey he has his arm stuck here
> 9       there there's it
> 10      now you're gonna go night-night
> 11      and that's the thing you can turn on
> 12      yeah where's the music box
> 13      what is this
> 14      small
> 15      there you go baby
> 
> 
> The following code runs but isn't doing exactly what I need--which is:  1) the ability to detect words from the list and define as core; 2) the ability to search the utterance and if there are any words in the utterance that are NOT core, to identify those as �1� as I will not have a list of fringe words.
> 
> ```
> 
> library(dplyr)
> library(stringr)
> library(tidyr)
> 
> coreWords <-c("I", "no", "yes", "my", "the", "want", "is", "it", "that", "a", "go", "mine", "you", "what", "on", "in", "here", "more", "out", "off", "some", "help", "all done", "finished")
> 
> str_detect(df,)
> 
> dfplus <- df %>%
>    mutate(id = row_number()) %>%
>    separate_rows(Utterance, sep = ' ') %>%
>    mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')),
>           Fringe = + !Core) %>%
>    group_by(id) %>%
>    mutate(Core = + (sum(Core) > 0),
>           Fringe = + (sum(Fringe) > 0)) %>%
>    slice(1) %>%
>    select(-Utterance) %>%
>    left_join(df) %>%
>    ungroup() %>%
>    select(Utterance, Core, Fringe, ID)
> 
> ```
> 
> The dput() code is:
> 
> structure(list(Utterance = c("a baby", "small", "yes", "where's his bed",
> "there's his bed", "where's his pillow", "what is that on his head",
> "hey he has his arm stuck here", "there there's it", "now you're gonna go night-night",
> "and that's the thing you can turn on", "yeah where's the music box",
> "what is this", "small", "there you go baby ", "what is this for ",
> "a ", "and the go goodnight here ", "and what is this ", " what's that sound ",
> "what does she say ", "what she say", "should I turn the on so Laura doesn't cry ",
> "what is this ", "what is that ", "where's clothes ", " where's the baby's bedroom ",
> "that might be in dad's bed+room ", "yes ", "there you go baby ",
> "you're welcome "), Core = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L), Fringe = c(0L, 0L, 0L, 1L, 1L, 1L,
> 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), ID = 1:31), row.names = c(NA,
> -31L), class = c("tbl_df", "tbl", "data.frame"))
> 
> ```
> 
> The first 10 rows of output looks like this:
> 
> Utterance       Core    Fringe  ID
> 1       a baby  1       0       1
> 2       small   1       0       2
> 3       yes     1       0       3
> 4       where's his bed 1       1       4
> 5       there's his bed 1       1       5
> 6       where's his pillow      1       1       6
> 7       what is that on his head        1       0       7
> 8       hey he has his arm stuck here   1       1       8
> 9       there there's it        1       0       9
> 10      now you're gonna go night-night 1       1       10
> 
> For example, in line 1 of the output, �a� is a core word so �1� for core is correct.  However, �baby� should be picked up as fringe so there should be �1�, not �0�, for fringe. Lines 7 and 9 also have words that should be identified as fringe but are not.
> 
> Additionally, it seems like if the utterance has parts of a core word in it, it�s being counted. For example, �small� is identified as a core word even though it's not (but 'all done' is a core word). 'Where's his bed' is identified as core and fringe, although none of the words are core.
> 
> Any suggestions on what is happening and how to correct it are greatly appreciated.
> 
> 	[[alternative HTML version deleted]]
> 
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list