---
title: "Quick Start Guide"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Quick Start Guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE}
knitr::opts_chunk$set(collapse = FALSE, comment = "##")
```

# Installing the package

Since **quanteda** is available on [CRAN](https://CRAN.R-project.org/package=quanteda), you can install  by using your GUI's R package installer, or execute:

```{r, eval = FALSE}
install.packages("quanteda")
```
See an instructions at https://github.com/quanteda/quanteda to install the (development) GitHub version. 

## Additional recommended packages:

The following packages contain modularised functions that were formerly part of **quanteda**, and we recommend that you always install them along with **quanteda**:

*  [**quanteda.textmodels**](https://github.com/quanteda/quanteda.textmodels): Functions for scaling and classifying textual data.  
*  [**quanteda.textstats**](https://github.com/quanteda/quanteda.textstats): Statistics for textual data.  
*  [**quanteda.textplots**](https://github.com/quanteda/quanteda.textplots): Statistics for textual data.  

The following packages work well with or extend **quanteda** and we recommend that you also install them:

*  [**readtext**](https://github.com/quanteda/readtext):  An easy way to read text data into R, from almost any input format.
*  [**spacyr**](https://github.com/quanteda/spacyr): NLP using the [spaCy](https://spacy.io) library, including part-of-speech tagging, entity recognition, and dependency parsing.
*  [**quanteda.corpora**](https://github.com/quanteda/quanteda.corpora): Additional textual data for use with **quanteda**.
    ```{r eval = FALSE}
    remotes::install_github("quanteda/quanteda.corpora")
    ```
*  [**quanteda.dictionaries**](https://github.com/kbenoit/quanteda.dictionaries): Various dictionaries for use with **quanteda**, including the function `liwcalike()`, an R implementation of the [Linguistic Inquiry and Word Count](https://www.liwc.app) approach to text analysis.
    ```{r eval = FALSE}
    remotes::install_github("kbenoit/quanteda.dictionaries")
    ```

# Creating a Corpus

You load the package to access to functions and data in the package.

```{r, message = FALSE}
library("quanteda")
```

```{r include=FALSE}
quanteda_options(threads = 1)
```

## Currently available corpus sources

**quanteda** has a simple and powerful companion package for loading texts: [**readtext**](https://github.com/quanteda/readtext). The main function in this package, `readtext()`, takes a file or fileset from disk or a URL, and returns a type of data.frame that can be used directly with the `corpus()` constructor function, to create a **quanteda** corpus object.

`readtext()` works on:

* text (`.txt`) files;
* comma-separated-value (`.csv`) files;
* XML formatted data;
* data from the Facebook API, in JSON format;
* data from the Twitter API, in JSON format; and
* generic JSON data.

The corpus constructor command `corpus()` works directly on:

* a vector of character objects, for instance that you have already loaded into the workspace using other tools;
* a `VCorpus` corpus object from the **tm** package.
* a data.frame containing a text column and any other document-level metadata.

### Building a corpus from a character vector

The simplest case is to create a corpus from a vector of texts already in memory in R. This gives the advanced R user complete flexibility with his or her choice of text inputs, as there are almost endless ways to get a vector of texts into R.

If we already have the texts in this form, we can call the corpus constructor function directly. We can demonstrate this on the built-in character object of the texts about immigration policy extracted from the 2010 election manifestos of the UK political parties (called `data_char_ukimmig2010`).

```{r}
corp_uk <- corpus(data_char_ukimmig2010)  # build a new corpus from the texts
summary(corp_uk)
```

If we wanted, we could add some document-level variables -- what quanteda calls *docvars* -- to this corpus.

We can do this using the R's `names()` function to get the names of the character vector `data_char_ukimmig2010`, and assign this to a document variable (*docvar*).
```{r}
docvars(corp_uk, "Party") <- names(data_char_ukimmig2010)
docvars(corp_uk, "Year") <- 2010
summary(corp_uk)
```

### Loading in files using the readtext package

```{r, eval=FALSE}
require(readtext)

# Twitter json
dat_json <- readtext("social_media/zombies/tweets.json")
corp_twitter <- corpus(dat_json)
summary(corp_twitter, 5)

# generic json - needs a textfield specifier
dat_sotu <- readtext("corpora/sotu/sotu.json", text_field = "text")
summary(corpus(dat_sotu), 5)

# text file
dat_txtone <- readtext("corpora/project_gutenberg/pg2701.txt")
summary(corpus(dat_txtone), 5)

# multiple text files
dat_txtmultiple1 <- readtext("corpora/inaugural/*.txt")
summary(corpus(dat_txtmultiple1), 5)

# multiple text files with docvars from filenames
dat_txtmultiple2 <- readtext("corpora/inaugural/*.txt",
                             docvarsfrom = "filenames", sep = "-",
                             docvarnames = c("Year", "President"))
summary(corpus(dat_txtmultiple2), 5)

# XML data
dat_xml <- readtext("xmlData/plant_catalog.xml", text_field = "COMMON")
summary(corpus(dat_xml), 5)

# csv file
write.csv(data.frame(inaug_speech = as.character(data_corpus_inaugural),
                     docvars(data_corpus_inaugural)),
          file = "/tmp/inaug_texts.csv", row.names = FALSE)
dat_csv <- readtext("/tmp/inaug_texts.csv", text_field = "inaug_speech")
summary(corpus(dat_csv), 5)
```


## Working with a quanteda corpus

### Corpus principles

A corpus is designed to be a "library" of original documents that have been converted to plain, UTF-8 encoded text, and stored along with meta-data at the corpus level and at the document-level.  We have a special name for document-level meta-data: *docvars*.  These are variables or features that describe attributes of each document.

A corpus is designed to be a more or less static container of texts with respect to processing and analysis.  This means that the texts in corpus are not designed to be changed internally through (for example) cleaning or pre-processing steps, such as stemming or removing punctuation.  Rather, texts can be extracted from the corpus as part of processing, and assigned to new objects, but the idea is that the corpus will remain as an original reference copy so that other analyses -- for instance those in which stems and punctuation were required, such as analysing a reading ease index -- can be performed on the same corpus.

A corpus is a special form of character vector, meaning most functions that work with a character input will also work on a corpus.  But a corpus object (as do other **quanteda** core objects) has its own convenient print method.
```{r}
print(data_corpus_inaugural)
```

To coerce a corpus to a plain character type, stripping its special attributes, use `as.character()`.
```{r}
as.character(data_corpus_inaugural)[2]
```

To summarize the texts from a corpus, we can call a `summary()` method defined for a corpus.
```{r}
summary(data_corpus_inaugural, n = 5)
```

We can save the output from the summary command as a data frame, and plot some basic descriptive statistics with this information:
```{r, fig.width = 8}
tokeninfo <- summary(data_corpus_inaugural)
tokeninfo$Year <- docvars(data_corpus_inaugural, "Year")
with(tokeninfo, plot(Year, Tokens, type = "b", pch = 19, cex = .7))
```

```{r}
# longest inaugural address: William Henry Harrison
tokeninfo[which.max(tokeninfo$Tokens), ]
```


## Tools for handling corpus objects

### Adding two corpus objects together

The `+` operator provides a simple method for concatenating two corpus objects.  If they contain different sets of document-level variables, these will be stitched together in a fashion that guarantees that no information is lost.  Corpus-level meta-data is also concatenated.

```{r}
corp1 <- head(data_corpus_inaugural, 2)
corp2 <- tail(data_corpus_inaugural, 2)
corp3 <- corp1 + corp2
summary(corp3)
```

### Subsetting corpus objects

There is a method of the `corpus_subset()` function defined for corpus objects, where a new corpus can be extracted based on logical conditions applied to *docvars*:

```{r}
summary(corpus_subset(data_corpus_inaugural, Year > 1990))
summary(corpus_subset(data_corpus_inaugural, President == "Adams"))
```


## Exploring corpus texts

The `kwic` function (keywords-in-context) performs a search for a word and allows us to view the contexts in which it occurs:
```{r}
data_tokens_inaugural <- tokens(data_corpus_inaugural)
kwic(data_tokens_inaugural, pattern = "terror")
```

Patterns in **quanteda** can take the form of "glob" patterns (the default), regular expressions, or fixed expressions, set through the `valuetype` argument.
```{r}
kwic(data_tokens_inaugural, pattern = "terror", valuetype = "regex")
```

```{r}
kwic(data_tokens_inaugural, pattern = "communist*")
```

Using `phrase()` we can also look up multi-word expressions.
```{r}
# show context of the first six occurrences of "United States"
kwic(data_tokens_inaugural, pattern = phrase("United States")) |>
    head()
```

In the above summary, `Year` and `President` are variables associated with each document. We can access such variables with the `docvars()` function.
```{r}
# inspect the document-level variables
head(docvars(data_corpus_inaugural))
```

More corpora are available from the [quanteda.corpora](https://github.com/quanteda/quanteda.corpora) package.

# Tokenizing texts

To simply tokenize a text, quanteda provides a powerful command called `tokens()`.  This produces an intermediate object, consisting of a list of tokens in the form of character vectors, where each element of the list corresponds to an input document.

`tokens()` is deliberately conservative, meaning that it does not remove anything from the text unless told to do so.

```{r}
txt <- c(text1 = "This is $10 in 999 different ways,\n up and down; left and right!",
         text2 = "@koheiw7 working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.")
tokens(txt)
tokens(txt, remove_numbers = TRUE,  remove_punct = TRUE)
tokens(txt, remove_numbers = FALSE, remove_punct = TRUE)
tokens(txt, remove_numbers = TRUE,  remove_punct = FALSE)
tokens(txt, remove_numbers = FALSE, remove_punct = FALSE)
tokens(txt, remove_numbers = FALSE, remove_punct = FALSE, remove_separators = FALSE)
```

We also have the option to tokenize characters:
```{r}
tokens("Great website: http://textasdata.com?page=123.", what = "character")
tokens("Great website: http://textasdata.com?page=123.", what = "character",
         remove_separators = FALSE)
```

and sentences:
```{r}
# sentence level       
tokens(c("Kurt Vongeut said; only assholes use semi-colons.",
         "Today is Thursday in Canberra:  It is yesterday in London.",
         "En el caso de que no puedas ir con ellos, ¿quieres ir con nosotros?"),
          what = "sentence")
```

### "Pre-processing" tokens

A common step at the tokenisation stage is to apply certain transformations to
the text such as removal of punctuation, numbers, or symbols, removing
"stopwords", removing URLs, etc.  Other options might involve rules for how to
treat hyphenated words (splitting versus preserving them) or whether or how to
treat special characters such as those found in social media (hashtags starting
with "#" or usernames starting with "@").  

The approach taken in **quanteda** to what is common known as "pre-processing" the texts involves three core principles.

1. No transformations should be applied directly to the corpus object (except
for cleaning out non-textual elements or correcting errors).  The corpus should
remain a complete representation of the original documents.

2. Tokenisation through the default **quanteda** tokeniser involves only two forms of token manipulation: removals and splits.  By default, the tokeniser is very conservative, removing only separators by default.
    
    a. _Removals_ take the the form of `remove_*` arguments -- e.g., `remove_punct` for removing punctuation characters -- that remove classes of characters or tokens, such as URLs.  With the sole exception of `remove_separators = TRUE`, all of these removals are `FALSE` by default.
    
    b. _Splits_ take the form of two arguments, `split_hyphens` and `split_tags`, both `FALSE` by default.  "Tags" in this case means whether to break up social media hashtags or whether to preserve them.  

3. Other transformations, including (word-based) patterns or removals, case
transformations, n-gram formation, chunking, etc. all take place through
additional functions, such as `tokens_tolower()` to lower-case tokens or
tokens_remove()` to remove patterns such as stopwords.

Pre-defined stopwords are available defined for numerous languages, accessed through 
the `stopwords()` function (which re-exports the [stopwords](https://github.com/quanteda/stopwords) package function `stopwords()`):
```{r}
head(stopwords("en"), 20)
head(stopwords("ru"), 10)
head(stopwords("ar", source = "misc"), 10)
```

### Splitting and compounding tokens

With `tokens_compound()`, we can concatenate multi-word expressions and keep them as a single feature in subsequent analyses: 

```{r}
tokens("New York City is located in the United States.") |>
    tokens_compound(pattern = phrase(c("New York City", "United States")))
```

Conversely, the function `tokens_split()` can split tokens by a pattern.
```{r}
tokens("one~two~three") |>
    tokens_split(separator = "~")
```

# Constructing a document-feature matrix

In order to perform statistical analysis such as document scaling, we must
extract a matrix associating values for certain features with each document. In
quanteda, we use the `dfm()` function to produce such a matrix. "dfm" is short
for *document-feature matrix*, and always refers to documents in rows and
"features" as columns.  We fix this dimensional orientation because it is
standard in data analysis to have a unit of analysis as a row, and features or
variables pertaining to each unit as columns.  We call them "features" rather
than terms, because features are more general than terms: they can be defined as
raw terms, stemmed terms, the parts of speech of terms, terms after stopwords
have been removed, or a dictionary class to which a term belongs.  Features can
be entirely general, such as ngrams or syntactic dependencies, and we leave this
open-ended.

The `dfm()` function is slightly less conservative than the `tokens()` function, applying
one transformation by default: converting the texts to lower case, via the default `tolower
= TRUE`.
```{r}
corp_inaug_post1990 <- corpus_subset(data_corpus_inaugural, Year > 1990)

# make a dfm
dfmat_inaug_post1990 <- corp_inaug_post1990 |>
    tokens() |>
    dfm()
print(dfmat_inaug_post1990)
```

### Analysing the document-feature matrix

The dfm can be inspected in the Environment pane in RStudio, or by calling R's `View()` function. Calling `textplot_wordcloud()` on a dfm will display a wordcloud.

```{r fig.width = 8, fig.height = 8}
dfmat_uk <- tokens(data_char_ukimmig2010, remove_punct = TRUE) |>
  tokens_remove(stopwords("en")) |>
  dfm()
dfmat_uk
```

To access a list of the most frequently occurring features, we can use `topfeatures()`:
```{r}
# 20 most frequent words
topfeatures(dfmat_uk, 20)
```

### Grouping documents by document variable 

Often, we are interested in analysing how texts differ according to substantive factors which may be encoded in the document variables, rather than simply by the boundaries of the document files. We can group documents that share the same value for a document variable using `dfm_group()`.  When the variables for selection in functions such as `dfm_group()` are document variables attached to the dfm, then they can be referred to directly.
```{r}
dfmat_pres <- tail(data_corpus_inaugural, 20) |>
  tokens(remove_punct = TRUE) |>
  tokens_remove(stopwords("en")) |>
  dfm() |>
  dfm_group(groups = Party)
```

We can sort the features of this dfm, by default in terms of greatest frequency first, and inspect it:
```{r}
dfm_sort(dfmat_pres)
```

## Dictionary functions

For some applications we have prior knowledge of sets of words that are indicative of traits we would like to measure from the text. For example, a general list of positive words might indicate positive sentiment in a movie review, or we might have a dictionary of political terms which are associated with a particular ideological stance. In these cases, it is sometimes useful to treat these groups of words as equivalent for the purposes of analysis, and sum their counts into classes. 

For example, let's look at how words associated with terrorism and words associated with the economy vary by President in the inaugural speeches corpus. From the original corpus, we select Presidents since Clinton:

```{r}
corp_inaug_post1991 <- corpus_subset(data_corpus_inaugural, Year > 1991)
```

Now we define a demonstration dictionary:
```{r}
dict <- dictionary(list(terror = c("terrorism", "terrorists", "threat"),
                        economy = c("jobs", "business", "grow", "work")))
```


We can use the dictionary when making the dfm:
```{r}
dfmat_inaug_post1991_dict <- tokens(corp_inaug_post1991) |>
  tokens_lookup(dictionary = dict) |>
  dfm()
dfmat_inaug_post1991_dict
```

The constructor function `dictionary()` also works with several common "foreign" dictionary formats: the 
LIWC and Provalis Research's Wordstat format. For instance, we can load the LIWC and apply this to the Presidential inaugural speech corpus:
```{r, eval = FALSE}
dictliwc <- dictionary(file = "LIWC2001_English.dic", format = "LIWC")
dfmat_inaug_subset <- tokens(data_corpus_inaugural[52:58]) |> 
                      dfm() |> 
                      dfm_lookup(dictionary = dictliwc)
dfmat_inaug_subset[, 1:10]
```
```
## Document-feature matrix of: 7 documents, 10 features (1.43% sparse) and 4 docvars.
##               features
## docs           Pronoun  I  We Self You Other Negate Assent Article Preps
##   1993-Clinton     179 15 124  139  12    12     25      2     115   211
##   1997-Clinton     188  8 134  142   0    27     27      4     194   310
##   2001-Bush        176 15 111  126   8    16     40      2     104   208
##   2005-Bush        171 10  92  102  25    20     25      5     174   307
##   2009-Obama       243  5 156  161  17    34     40      2     185   294
##   2013-Obama       219  7 168  175   5    21     42      1     148   265
## [ reached max_ndoc ... 1 more document ]
```

# Further examples

See the Further Examples article at https://quanteda.io.