statCanR

Thierry Warin

library(statcanR)

Overview

statcanR provides the R user with a consistent process to identify and collect data from Statistics Canada’s data portal. It allows users to search for and access all of Statistics Canada’ open economic data (formerly known as CANSIM tables) now identified by product IDs (PID) by the new Statistics Canada’s Web Data Service: https://www150.statcan.gc.ca/n1/en/type/data.

Quick start

First, install statcanR:

devtools::install_github("warint/statcanR")

Next, call statcanR to make sure everything is installed correctly.

library(statcanR)

Practical usage

This section presents an example of how to use the statcanR R package and its functions: statcan_search() and statcan_data().

The following example is provided to illustrate how to use these functions. It consists of collecting some descriptive statistics about the Canadian labour force at the federal, provincial and industrial levels, on a monthly basis.

To identify a relevant table, the statcan_search() function can be used by using a keyword or set of keywords, depending on the language the user wishes to search in (between English and French). Below is an example that reveals the data tables we could be interested in:

# Identify with statcan_search() function
statcan_search(c("federal","expenditures"),"eng")

Notice that for each corresponding table, the unique table number identifier is also presented. Let’s focus on the first table out of the two that appear, which contains data on Federal expenditures on science and technology, by socio-economic objectives. Once this table number is identified (‘27-10-0014-01’), the statcan_data() function is easy to use in order to collect the data, as following:

# Get data with statcan_data function
mydata <- statcan_download_data("27-10-0014-01", "eng")

Should you want the same data in French, just replace the argument at the end of the function with “fra”.

Code architecture

This section describes the code architecture of the statcanR package.

The statcan_search() function has 2 arguments:

  1. keywords
  2. lang

The ‘keywords’ argument is useful for identifying what kind of open economic data Statistics Canada has available for users. It can take as inputs either a single word (i.e., statcan_search(“expenditures”)) or a vector of multiple keywords (such as: statcan_search(c(“expenditures”,“federal”,“objectives”))).

The second argument, ‘lang’, refers to the language. As Canada is a bilingual country, Statistics Canada displays all the economic data in both languages. Therefore, users can see what data tables are available in French or English by setting the lang argument with c(“fra”, “eng”).

For all data tables that contain these words in either their title or description, the resulting output will include four pieces of information: The name of the data table, its unique table number identifier, the description, and the release date. In details, the statcan_download_data() function has 2 arguments:

  1. tableNumber
  2. lang

The ‘tableNumber’ argument simply refers to the table number of the Statistics Canada data table a user wants to collect, such as ‘27-10-0014-01’ for the Federal expenditures on science and technology, by socio-economic objectives, as an example.

Just as in the statcan_search() command, ‘lang’, refers to the language, and can be set to “eng” for English or “fra” for French. Therefore, users can choose to collect satistics data tables in French or English by setting the lang argument with c(“fra”, “eng”).

The code architecture of the statcan_download_data() function is as follows. The first step if to clean the table number in order to align it with the official typology of Statistics Canada’s Web Data Service. The second step is to create a temporary folder where all the data wrangling operations are performed. The third step is to check and select the correct language. The fourth step is to define the right URL where the actual data table is stored and then to download the .Zip container. The fifth step is to unzip the previously downloaded .Zip file to extract the data and metadata from the .csv files. The final step is to load the statistics data into a data frame called ‘data’ and to add the official table indicator name in the new ‘INDICATOR’ column.

To be more precise about the functions, below is some further code description:

statcan_download_data()

  • tableNumber <- gsub("-", "", substr(tableNumber, 1, nchar(tableNumber)-2)): The first step is to clean the table number provided by the user in order to collect the overall data table related to the specific indicator the user is interested in. In fact, each indicator is an excerpt of the overall table. In addition, following the new Statistics Canada’s Web Data Service, the URL typology defined by REST API is stored in csv files by table numbers without a ‘-’. Also, the last 2 digits after the last ‘-’ refer to the specific excerpt of the original table. Therefore, following the Statistics Canada Web Data Service’s typology, the function first removes the ‘-’ and the last 2 digits from the user’s selection.

  • if(lang == "eng") | if(lang == "fra"): The second step is the ‘if statement’ to get the data in the correct language.

  • urlFra <- paste0("https://www150.statcan.gc.ca/n1/fr/tbl/csv/", tableNumber, "-fra.zip"): The third step is to create the correct url in order to download the respective .Zip file from the Statistics Canada Web Data Service.

  • download(urlFra, download_dir, mode = "wb"): The fourth step is a simple downloading function that extracts .Zip file and download it into a temporary folder.

  • unzip(zipfile = download_dir, exdir = unzip_dir): The fifth step consists in unzipping the .Zip file. The unzipping process gives access to two different .csv files, such as the overall data table and the metadata table.

  • data.table::fread(csv_file): The sixth step consists in loading the data table into a unique data frame. The fread() function from the data.table package is used for its higher performance.

  • data$INDICATOR <- as.character(0) and data$INDICATOR <- as.character(read.csv(paste0(path,"/temp/", tableNumber, "_MetaData.csv"))[1,1]): The seventh step of the statcan_data() function consists in adding the name of the table from the metadata table.

  • unlink(tempdir()): The eighth step deletes the temporary folder used to download, unzip, extract and load the data.

  • return(Data): Finally, the last step of the statcan_data() function allows to return the value into the user’s environment.