abbreviate {base} | R Documentation |
Abbreviate strings to at least minlength
characters,
such that they remain unique (if they were),
unless strict = TRUE
.
abbreviate(names.arg, minlength = 4, use.classes = TRUE,
dot = FALSE, strict = FALSE,
method = c("left.kept", "both.sides"), named = TRUE)
names.arg |
a character vector of names to be abbreviated, or an
object to be coerced to a character vector by |
minlength |
the minimum length of the abbreviations. |
use.classes |
logical: should lowercase characters be removed first? |
dot |
logical: should a dot ( |
strict |
logical: should |
method |
a character string specifying the method used with default
|
named |
logical: should |
The default algorithm (method = "left.kept"
) used is similar
to that of S. For a single string it works as follows.
First spaces at the ends of the string are stripped.
Then (if necessary) any other spaces are stripped.
Next, lower case vowels are removed followed by lower case consonants.
Finally if the abbreviation is still longer than minlength
upper case letters and symbols are stripped.
Characters are always stripped from the end of the strings first. If
an element of names.arg
contains more than one word (words are
separated by spaces) then at least one letter from each word will be
retained.
Missing (NA
) values are unaltered.
If use.classes
is FALSE
then the only distinction is to
be between letters and space.
A character vector containing abbreviations for the character strings
in its first argument. Duplicates in the original names.arg
will be given identical abbreviations. If any non-duplicated elements
have the same minlength
abbreviations then, if method =
"both.sides"
the basic internal abbreviate()
algorithm is
applied to the characterwise reversed strings; if there are
still duplicated abbreviations and if strict = FALSE
as by
default, minlength
is incremented by one and new abbreviations
are found for those elements only. This process is repeated until all
unique elements of names.arg
have unique abbreviations.
If names
is true, the character version of names.arg
is
attached to the returned value as a names
attribute: no
other attributes are retained.
If a input element contains non-ASCII characters, the corresponding
value will be in UTF-8 and marked as such (see Encoding
).
If use.classes
is true (the default), this is really only
suitable for English, and prior to R 3.3.0 did not work correctly
with non-ASCII characters in multibyte locales. It will warn if used
with non-ASCII characters (and required to reduce the length). It is
unlikely to work well with inputs not in the Unicode Basic Multilingual
Plane nor on (rare) platforms where wide characters are not encoded in
Unicode.
As from R 3.3.0 the concept of ‘vowel’ is extended from English vowels by including characters which are accented versions of lower-case English vowels (including ‘o with stroke’). Of course, there are languages (even Western European languages such as Welsh) with other vowels.
x <- c("abcd", "efgh", "abce")
abbreviate(x, 2)
abbreviate(x, 2, strict = TRUE) # >> 1st and 3rd are == "ab"
(st.abb <- abbreviate(state.name, 2))
stopifnot(identical(unname(st.abb),
abbreviate(state.name, 2, named=FALSE)))
table(nchar(st.abb)) # out of 50, 3 need 4 letters :
as <- abbreviate(state.name, 3, strict = TRUE)
as[which(as == "Mss")]
## and without distinguishing vowels:
st.abb2 <- abbreviate(state.name, 2, FALSE)
cbind(st.abb, st.abb2)[st.abb2 != st.abb, ]
## method = "both.sides" helps: no 4-letters, and only 4 3-letters:
st.ab2 <- abbreviate(state.name, 2, method = "both")
table(nchar(st.ab2))
## Compare the two methods:
cbind(st.abb, st.ab2)