---
title: "ggwordcloud: a word cloud geom for ggplot2"
author: "E. Le Pennec"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{ggwordcloud: a word cloud geom for ggplot2}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.dev = "grDevices::png",
  dpi = 96L,
  dev.args = list(),
  fig.ext = "png",
  fig.width = 700 / 96,
  fig.height = NULL,
  fig.retina = 2L,
  fig.asp = 1 / 1.618,
  fig.align = "center"
)
```

`ggwordcloud` provides a word cloud text geom for `ggplot2`. The
placement algorithm implemented in C++ is an hybrid between the one of
`wordcloud` and the one of `wordcloud2.js`. The cloud can grow according
to a shape and stay within a mask. The size aesthetic is used either to
control the font size or the printed area of the words. `ggwordcloud`
also supports arbitrary text rotation. The faceting scheme of `ggplot2`
can also be used. Two functions meant to be the equivalent of
`wordcloud` and `wordcloud2` are proposed. Last but not least you can
use `gridtext` markdown/html syntax in the labels.

This vignette is meant as a quick tour of its features.

# Package installation

The package can be installed from CRAN by

```{r, eval=FALSE}
install.packages("ggwordcloud")
```

or the development version from the github repository

```{r, eval=FALSE}
devtools::install_github("lepennec/ggwordcloud")
```

Please check the latest development version before submitting an issue.

# The love / thank you words dataset

Along this vignette, we will use a *lovely* dataset: a collection of the
word love in several language combined with the number of native
speakers of those language as well as the total number of speakers. The
data have been extracted from *wikipedia* and is exposed in two data
frame of 4 columns: - `lang`: the ISO 649 language code - `words`: the
word love in those languages - `native_speakers`: the number of native
speakers (in millions) of those languages - `speaker`: the corresponding
total number of speakers (in millions) Another dataset with *thank you*
in several languages is also available. The first one `love_words`
(`thankyou_words`) contains 147 (133) different languages while the
second `love_words_small` (`thankyou_words_small`) contains the 34 (34)
languages having more than 50 millions speakers.

```{r}
library(ggwordcloud)
data("love_words_small")
data("love_words")
```

# Word cloud

The `geom_text_wordcloud` geom constructs a word cloud from a list of
words given by the `label` aesthetic:

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word)) +
  geom_text_wordcloud() +
  theme_minimal()
```

Note that we have used `theme_minimal()` to display the words and
nothing else. The word cloud is, by default, centered and the words are
placed along a spiral in a way they do not overlap.

Because there is some randomness in the placement algorithm, the same
command can yield a different result when using a different random seed:

```{r}
set.seed(43)
ggplot(love_words_small, aes(label = word)) +
  geom_text_wordcloud() +
  theme_minimal()
```

# Word cloud and text size

So far all the words had the same size because we do not specify a size
aesthetic. If we use the total number of speakers, we obtain:

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word, size = speakers)) +
  geom_text_wordcloud() +
  theme_minimal()
```

The words are scaled according to the value of the size aesthetic, the
number of speakers here. There are several classical choices for the
scaling: the font size could be chosen proportional to the value or to
the square root of the value so that the area of a given character is
respectively proportional to the square of the value or the value
itself. By default, `ggplot2` uses the square root scaling but does not
map a value of $0$ to $0$.

In order to obtain a true proportionality (and a better font size
control), one can use the `scale_size_area()` scale:

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word, size = speakers)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 30) +
  theme_minimal()
```

It turns out that both `wordcloud` and `wordcloud2` default to a linear
scaling between the value and the font size. This can be obtained with
the `scale_radius()` scale:

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word, size = speakers)) +
  geom_text_wordcloud() +
  scale_radius(range = c(0, 30), limits = c(0, NA)) +
  theme_minimal()
```

# Word cloud and text area

As explained before, by default, this is the size of the font which is
proportional to the square root of the value of the size aesthetic. This
is a natural choice for a shape as the area of the shape will be
proportional to the raw size aesthetic but not necessarily for texts
with different lengths. In `ggwordcloud2`, there is an option,
`area_corr` to scale the font of each label so that the text area is a
function of the raw size aesthetic when used in combination with
`scale_size_area`:

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word, size = speakers)) +
  geom_text_wordcloud(area_corr = TRUE) +
  scale_size_area(max_size = 50) +
  theme_minimal()
```

One can equivalently use the `geom_text_wordcloud_area` geom:

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word, size = speakers)) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 50) +
  theme_minimal()
```

By default, the area is proportional to the raw size aesthetic. To
better match the human area perception, one can use the `power_trans`
scale with a factor of $1/.7$:

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word, size = speakers)) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 50, trans = power_trans(1/.7)) +
  theme_minimal()
```

# Word cloud with too many words

The non overlapping algorithm may fail to place some words due to a lack
of space. By default, those words are displayed at the center of the
word cloud and comes with a warning.

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word, size = speakers)) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 80) +
  theme_minimal()
```

It is up to the user to avoid this issue by either removing some words
or changing the size scale. One can also chose to remove those words
using the `rm_outside` option:

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word, size = speakers)) +
  geom_text_wordcloud_area(rm_outside = TRUE) +
  scale_size_area(max_size = 80) +
  theme_minimal()
```

# Word cloud and rotation

The words can be rotated by setting the `angle` aesthetic. For instance,
one can use a rotation of 90 degrees for a random subset of 40 % of the
words:

```{r}
library(dplyr, quietly = TRUE)
love_words_small <- love_words_small %>%
  mutate(angle = 90 * sample(c(0, 1), n(), replace = TRUE, prob = c(60, 40)))
```

```{r}
set.seed(42)
ggplot(love_words_small, aes(
  label = word, size = speakers,
  angle = angle
)) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 40) +
  theme_minimal()
```

`ggwordcloud` is not restricted to rotation of 90 degrees:

```{r}
love_words_small <- love_words_small %>%
  mutate(angle = 45 * sample(-2:2, n(), replace = TRUE, prob = c(1, 1, 4, 1, 1)))
```

```{r}
set.seed(42)
ggplot(love_words_small, aes(
  label = word, size = speakers,
  angle = angle
)) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 40) +
  theme_minimal()
```

# Word cloud and eccentricity

The `ggwordcloud` algorithm moves the text around a spiral until it
finds a free space for it. This spiral has by default a vertical
eccentricity of .65, so that the spiral is 1/.65 wider than taller.

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word, size = speakers)) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 40) +
  theme_minimal()
```

This can be changed using the `eccentricity` parameter:

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word, size = speakers)) +
  geom_text_wordcloud_area(eccentricity = 1) +
  scale_size_area(max_size = 40) +
  theme_minimal()
```

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word, size = speakers)) +
  geom_text_wordcloud_area(eccentricity = .35) +
  scale_size_area(max_size = 40) +
  theme_minimal()
```

# Word cloud and shape

The base shape of `ggwordcloud` is a circle: the words are place by
following a *circle* spiral. This base shape `circle` can be change to
others (`cardioid`, `diamond`, `square`, `triangle-forward`,
`triangle-upright`, `pentagon` or `star`) using the `shape` option.

```{r}
for (shape in c(
  "circle", "cardioid", "diamond",
  "square", "triangle-forward", "triangle-upright",
  "pentagon", "star"
)) {
  set.seed(42)
  print(ggplot(love_words_small, aes(label = word, size = speakers)) +
    geom_text_wordcloud_area(shape = shape) +
    scale_size_area(max_size = 40) +
    theme_minimal() + ggtitle(shape))
}
```

# Word cloud and color

A color can be assign to each word using the color aesthetic. For
instance, one can assign a random factor to each word:

```{r}
set.seed(42)
ggplot(
  love_words_small,
  aes(
    label = word, size = speakers,
    color = factor(sample.int(10, nrow(love_words_small), replace = TRUE)),
    angle = angle
  )
) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 40) +
  theme_minimal()
```

One can also map the color to a value, for instance the number of
speakers, and chose the colormap with a `scale_color_*` scale:

```{r}
set.seed(42)
ggplot(
  love_words_small,
  aes(
    label = word, size = speakers,
    color = speakers, angle = angle
  )
) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 40) +
  theme_minimal() +
  scale_color_gradient(low = "darkred", high = "red")
```

# Word cloud and mask

`ggwordcloud` allows to specify a mask within which the words should be
placed. More precisely, the non *transparent* pixels in an image array
(or the black pixel if there is no transparency) will be used as a mask:

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word, size = speakers)) +
  geom_text_wordcloud_area(
    mask = png::readPNG(system.file("extdata/hearth.png",
      package = "ggwordcloud", mustWork = TRUE
    )),
    rm_outside = TRUE
  ) +
  scale_size_area(max_size = 42) +
  theme_minimal()
```

# Word cloud with almost everything

We are now ready to make a *lovely* word cloud:

```{r}
love_words <- love_words %>%
  mutate(angle = 45 * sample(-2:2, n(), replace = TRUE, prob = c(1, 1, 4, 1, 1)))
```

```{r warning=FALSE}
set.seed(42)
ggplot(
  love_words,
  aes(
    label = word, size = speakers,
    color = speakers, angle = angle
  )
) +
  geom_text_wordcloud_area(
    mask = png::readPNG(system.file("extdata/hearth.png",
      package = "ggwordcloud", mustWork = TRUE
    )),
    rm_outside = TRUE
  ) +
  scale_size_area(max_size = 40) +
  theme_minimal() +
  scale_color_gradient(low = "darkred", high = "red")
```

# Modified label content and markdown/html syntax

With the `label_content` aesthetic, cne can specify a different label
content than the one used to compute the size. Note that this is
equivalent to replace `label` when not using the text area option.

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word, size = speakers,
                             label_content = sprintf("%s (%g)", word, speakers))) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 30) +
  theme_minimal()
```

We can combined this with the markdown/html syntax of `gridtext` to
obtain the nicer

```{r}
set.seed(42)
ggplot(love_words_small, aes(label = word, size = speakers,
                             label_content = sprintf("%s<span style='font-size:7.5pt'>(%g)</span>", word, speakers))) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 40) +
  theme_minimal()
```

# Advanced features

`geom_text_wordcloud` is compatible with the facet system of `ggplot2`.
For instance, one can easily display two word clouds for the speakers
and the native speakers with the same scales:

```{r}
library(dplyr, quietly = TRUE, warn.conflicts = FALSE)
library(tidyr, quietly = TRUE)
love_words_small_l <- love_words_small %>%
  gather(key = "type", value = "speakers", -name, -word, -angle, -iso_639_3) %>%
  arrange(desc(speakers))
```

```{r}
set.seed(42)
ggplot(
  love_words_small_l,
  aes(label = word, size = speakers)
) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 30) +
  theme_minimal() +
  facet_wrap(~type)
```

One can also specify an original position for each label that what will
be used as the starting point of the spiral algorithm for this label:

```{r}
set.seed(42)
ggplot(
  love_words_small_l,
  aes(
    label = word, size = speakers,
    x = type, color = type
  )
) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 30) +
  scale_x_discrete(breaks = NULL) +
  theme_minimal()
```

Finally, there is a `angle_group` option that can be used to restrict
the words to appear only in a angular sector depending on their
`angle_group`. For instance, we will visualize the changes of
proportions of each language between the speakers and the native
speakers by displaying the words above the horizontal line if the
proportion is greater than in the other category and below otherwise.

```{r}
love_words_small_l <- love_words_small_l %>%
  group_by(type) %>%
  mutate(prop = speakers / sum(speakers)) %>%
  group_by(name, word) %>%
  mutate(propdelta = (prop - mean(prop)) / sqrt(mean(prop)))
```

```{r}
set.seed(42)
ggplot(
  love_words_small_l,
  aes(
    label = word, size = abs(propdelta),
    color = propdelta < 0, angle_group = propdelta < 0
  )
) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 30) +
  theme_minimal() +
  facet_wrap(~type)
```

# `ggwordcloud` as an approximate replacement for `wordcloud` and `wordcloud2`

`ggwordcloud` and `ggwordcloud2` are two approximate replacements for
respectively `wordcloud` and `wordcloud2`. They provide a similar syntax
than the original functions and yields similar word clouds, but not all
the options of the original functions are implemented. Note that both
use a font size proportional to the raw size aesthetic rather than its
square root.

```{r}
set.seed(42)
ggwordcloud(love_words_small$word, love_words_small$speakers)
```

```{r}
set.seed(42)
ggwordcloud2(love_words_small[, c("word", "speakers")], size = 2.5)
```