Type: Package
Title: Morpheme Tokenization
Version: 1.2.3
Description: Tokenize text into morphemes. The morphemepiece algorithm uses a lookup table to determine the morpheme breakdown of words, and falls back on a modified wordpiece tokenization algorithm for words not found in the lookup table.
URL: https://github.com/macmillancontentscience/morphemepiece
BugReports: https://github.com/macmillancontentscience/morphemepiece/issues
License: Apache License (≥ 2)
Encoding: UTF-8
RoxygenNote: 7.1.2
Imports: dlr (≥ 1.0.0), fastmatch, magrittr, memoise (≥ 2.0.0), morphemepiece.data, piecemaker (≥ 1.0.0), purrr (≥ 0.3.4), readr, rlang, stringr (≥ 1.4.0)
Suggests: dplyr, fs, ggplot2, here, knitr, remotes, rmarkdown, testthat (≥ 3.0.0), utils
VignetteBuilder: knitr
Config/testthat/edition: 3
NeedsCompilation: no
Packaged: 2022-04-16 13:57:47 UTC; jonathan.bratt
Author: Jonathan Bratt ORCID iD [aut, cre], Jon Harmon ORCID iD [aut], Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph]
Maintainer: Jonathan Bratt <jonathan.bratt@macmillan.com>
Repository: CRAN
Date/Publication: 2022-04-16 14:12:29 UTC

morphemepiece: Morpheme Tokenization

Description

Tokenize words into morphemes (the smallest unit of meaning).


Determine Vocabulary Casedness

Description

Determine whether or not a wordpiece vocabulary is case-sensitive.

Usage

.infer_case_from_vocab(vocab)

Arguments

vocab

The vocabulary as a character vector.

Details

If none of the tokens in the vocabulary start with a capital letter, it will be assumed to be uncased. Note that tokens like "\[CLS\]" contain uppercase letters, but don't start with uppercase letters.

Value

TRUE if the vocabulary is cased, FALSE if uncased.


Tokenize an Input Word-by-word

Description

Tokenize an Input Word-by-word

Usage

.mp_tokenize_single_string(words, vocab, lookup, unk_token, max_chars)

Arguments

words

Character; a vector of words (generated by space-tokenizing a single input).

vocab

A morphemepiece vocabulary.

lookup

A morphemepiece lookup table.

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Value

A named integer vector of tokenized words.


Tokenize a Word

Description

Tokenize a single "word" (no whitespace). The word can technically contain punctuation, but typically punctuation has been split off by this point.

Usage

.mp_tokenize_word(
  word,
  vocab_split,
  dir = 1,
  allow_compounds = TRUE,
  unk_token = "[UNK]",
  max_chars = 100
)

Arguments

word

Word to tokenize.

vocab_split

List of character vectors containing vocabulary words. Should have components named "prefixes", "words", "suffixes".

dir

Integer; if 1 (the default), look for tokens starting at the beginning of the word. Otherwise, start at the end.

allow_compounds

Logical; whether to allow multiple whole words in the breakdown.

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Details

This is an adaptation of wordpiece:::.tokenize_word. The main differences are that it was designed to work with a morphemepiece vocabulary, which can include prefixes (denoted like "pre##"). As in wordpiece, the algorithm uses a repeated greedy search for the largest piece from the vocabulary found within the word, but starting from either the beginning or the end of the word (controlled by the dir parameter). The input vocabulary must be split into prefixes, suffixes, and "words".

Value

Input word as a list of tokens.


Tokenize a Word Bidirectionally

Description

Apply .mp_tokenize_word from both directions and pick the result with fewer pieces.

Usage

.mp_tokenize_word_bidir(
  word,
  vocab_split,
  unk_token,
  max_chars,
  allow_compounds = TRUE
)

Arguments

word

Character scalar; word to tokenize.

vocab_split

List of character vectors containing vocabulary words. Should have components named "prefixes", "words", "suffixes".

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

allow_compounds

Logical; whether to allow multiple whole words in the breakdown. Default is TRUE. This option will not be exposed to end users; it is kept here for documentation + development purposes.

Value

Input word as a list of tokens.


Tokenize a Word Including Lookup

Description

Look up a word in the table; go to fall-back otherwise.

Usage

.mp_tokenize_word_lookup(word, vocab, lookup, unk_token, max_chars)

Arguments

word

Character scalar; word to tokenize.

vocab

A morphemepiece vocabulary.

lookup

A morphemepiece lookup table.

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Value

Input word, broken into tokens.


Constructor for Class morphemepiece_vocabulary

Description

Constructor for Class morphemepiece_vocabulary

Usage

.new_morphemepiece_vocabulary(vocab, vocab_split, is_cased)

Arguments

vocab

Character vector; the "actual" vocabulary.

vocab_split

List of character vectors; the split vocabulary.

is_cased

Logical; whether the vocabulary is cased.

Value

The vocabulary with is_cased attached as an attribute, and the class morphemepiece_vocabulary applied. The split vocabulary is also attached as an attribute.


Process a Morphemepiece Vocabulary for Tokenization

Description

Process a Morphemepiece Vocabulary for Tokenization

Usage

.process_mp_vocab(v)

## Default S3 method:
.process_mp_vocab(v)

## S3 method for class 'morphemepiece_vocabulary'
.process_mp_vocab(v)

## S3 method for class 'integer'
.process_mp_vocab(v)

## S3 method for class 'character'
.process_mp_vocab(v)

Arguments

v

An object of class morphemepiece_vocabulary.

Value

A character vector of tokens for tokenization.


Validator for Objects of Class morphemepiece_vocabulary

Description

Validator for Objects of Class morphemepiece_vocabulary

Usage

.validate_morphemepiece_vocabulary(vocab)

Arguments

vocab

morphemepiece_vocabulary object to validate

Value

vocab if the object passes the checks. Otherwise, abort with message.


Load a morphemepiece lookup file

Description

Usually you will want to use the included lookup that can be accessed via morphemepiece_lookup(). This function can be used to load a different lookup from a file.

Usage

load_lookup(lookup_file)

Arguments

lookup_file

path to lookup file. File is assumed to be a text file, with one word per line. The lookup value, if different from the word, follows the word on the same line, after a space.

Value

The lookup as a named list. Names are words in lookup.


Load a lookup file, or retrieve from cache

Description

Usually you will want to use the included lookup that can be accessed via morphemepiece_lookup(). This function can be used to load (and cache) a different lookup from a file.

Usage

load_or_retrieve_lookup(lookup_file)

Arguments

lookup_file

path to lookup file. File is assumed to be a text file, with one word per line. The lookup value, if different from the word, follows the word on the same line, after a space.

Value

The lookup table as a named character vector.


Load a vocabulary file, or retrieve from cache

Description

Usually you will want to use the included vocabulary that can be accessed via morphemepiece_vocab(). This function can be used to load (and cache) a different vocabulary from a file.

Usage

load_or_retrieve_vocab(vocab_file)

Arguments

vocab_file

path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number (starting at zero) corresponding to the index of that token in the vocabulary.

Value

The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.

Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.


Load a vocabulary file

Description

Usually you will want to use the included vocabulary that can be accessed via morphemepiece_vocab(). This function can be used to load a different vocabulary from a file.

Usage

load_vocab(vocab_file)

Arguments

vocab_file

path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number (starting at zero) corresponding to the index of that token in the vocabulary.

Value

The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.

Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.


Retrieve Directory for Morphemepiece Cache

Description

The morphemepiece cache directory is a platform- and user-specific path where morphemepiece saves caches (such as a downloaded lookup). You can override the default location in a few ways:

Usage

morphemepiece_cache_dir()

Value

A character vector with the normalized path to the cache.


Tokenize Sequence with Morpheme Pieces

Description

Given a single sequence of text and a morphemepiece vocabulary, tokenizes the text.

Usage

morphemepiece_tokenize(
  text,
  vocab = morphemepiece_vocab(),
  lookup = morphemepiece_lookup(),
  unk_token = "[UNK]",
  max_chars = 100
)

Arguments

text

Character scalar; text to tokenize.

vocab

A morphemepiece vocabulary.

lookup

A morphemepiece lookup table.

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Value

A character vector of tokenized text (later, this should be a named integer vector, as in the wordpiece package.)


Format a Token List as a Vocabulary

Description

We use a character vector with class morphemepiece_vocabulary to provide information about tokens used in morphemepiece_tokenize. This function takes a character vector of tokens and puts it into that format.

Usage

prepare_vocab(token_list)

Arguments

token_list

A character vector of tokens.

Value

The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.

Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.

Examples

my_vocab <- prepare_vocab(c("some", "example", "tokens"))
class(my_vocab)
attr(my_vocab, "is_cased")

Objects exported from other packages

Description

These objects are imported from other packages. Follow the links below to see their documentation.

fastmatch

%fin%

magrittr

%>%

morphemepiece.data

morphemepiece_lookup, morphemepiece_vocab

rlang

%||%, .data


Set a Cache Directory for Morphemepiece

Description

Use this function to override the cache path used by morphemepiece for the current session. Set the MORPHEMEPIECE_CACHE_DIR environment variable for a more permanent change.

Usage

set_morphemepiece_cache_dir(cache_dir = NULL)

Arguments

cache_dir

Character scalar; a path to a cache directory.

Value

A normalized path to a cache directory. The directory is created if the user has write access and the directory does not exist.