--- title: "Import, Export, and Convert Data Files" date: "`r Sys.Date()`" output: html_document: fig_caption: false toc: true toc_float: collapsed: false smooth_scroll: false toc_depth: 3 vignette: > %\VignetteIndexEntry{Introduction to 'rio'} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Import, Export, and Convert Data Files The idea behind **rio** is to simplify the process of importing data into R and exporting data from R. This process is, probably unnecessarily, extremely complex for beginning R users. Indeed, R supplies [an entire manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) describing the process of data import/export. And, despite all of that text, most of the packages described are (to varying degrees) out-of-date. Faster, simpler, packages with fewer dependencies have been created for many of the file types described in that document. **rio** aims to unify data I/O (importing and exporting) into two simple functions: `import()` and `export()` so that beginners (and experienced R users) never have to think twice (or even once) about the best way to read and write R data. The core advantage of **rio** is that it makes assumptions that the user is probably willing to make. Specifically, **rio** uses the file extension of a file name to determine what kind of file it is. This is the same logic used by Windows OS, for example, in determining what application is associated with a given file type. By taking away the need to manually match a file type (which a beginner may not recognize) to a particular import or export function, **rio** allows almost all common data formats to be read with the same function. By making import and export easy, it's an obvious next step to also use R as a simple data conversion utility. Transferring data files between various proprietary formats is always a pain and often expensive. The `convert` function therefore combines `import` and `export` to easily convert between file formats (thus providing a FOSS replacement for programs like Stat/Transfer or Sledgehammer. ## Supported file formats **rio** supports a variety of different file formats for import and export. To keep the package slim, all non-essential formats are supported via "Suggests" packages, which are not installed (or loaded) by default. To ensure rio is fully functional, install these packages the first time you use **rio** via: ```R install_formats() ``` The full list of supported formats is below: ```{r, include = FALSE} suppressPackageStartupMessages(library(data.table)) ``` ```{r featuretable, echo = FALSE} rf <- data.table(rio:::rio_formats)[!input %in% c(",", ";", "|", "\\t") & type %in% c("import", "suggest", "archive"),] short_rf <- rf[, paste(input, collapse = " / "), by = format_name] type_rf <- unique(rf[,c("format_name", "type", "import_function", "export_function", "note")]) feature_table <- short_rf[type_rf, on = .(format_name)] colnames(feature_table)[2] <- "signature" setorder(feature_table, "type", "format_name") feature_table$import_function <- stringi::stri_extract_first(feature_table$import_function, regex = "[a-zA-Z0-9\\.]+") feature_table$import_function[is.na(feature_table$import_function)] <- "" feature_table$export_function <- stringi::stri_extract_first(feature_table$export_function, regex = "[a-zA-Z0-9\\.]+") feature_table$export_function[is.na(feature_table$export_function)] <- "" feature_table$type <- ifelse(feature_table$type == "suggest", "Suggest", "Default") feature_table <- feature_table[,c("format_name", "signature", "import_function", "export_function", "type", "note")] colnames(feature_table) <- c("Name", "Extensions / \"format\"", "Import Package", "Export Package", "Type", "Note") knitr::kable(feature_table) ``` Additionally, any format that is not supported by **rio** but that has a known R implementation will produce an informative error message pointing to a package and import or export function. Unrecognized formats will yield a simple "Unrecognized file format" error. ## Data Import **rio** allows you to import files in almost any format using one, typically single-argument, function. `import()` infers the file format from the file's extension and calls the appropriate data import function for you, returning a simple data.frame. This works for any for the formats listed above. ```{r, echo=FALSE, results='hide'} library("rio") export(mtcars, "mtcars.csv") export(mtcars, "mtcars.dta") export(mtcars, "mtcars_noext", format = "csv") ``` ```{r} library("rio") x <- import("mtcars.csv") y <- import("mtcars.dta") # confirm identical all.equal(x, y, check.attributes = FALSE) ``` If for some reason a file does not have an extension, or has a file extension that does not match its actual type, you can manually specify a file format to override the format inference step. For example, we can read in a CSV file that does not have a file extension by specifying `csv`: ```{r} head(import("mtcars_noext", format = "csv")) ``` ```{r, echo=FALSE, results='hide'} unlink("mtcars.csv") unlink("mtcars.dta") unlink("mtcars_noext") ``` ### Importing Data Lists Sometimes you may have multiple data files that you want to import. `import()` only ever returns a single data frame, but `import_list()` can be used to import a vector of file names into R. This works even if the files are different formats: ```r str(import_list(dir()), 1) ``` Similarly, some single-file formats (e.g. Excel Workbooks, Zip directories, HTML files, etc.) can contain multiple data sets. Because `import()` is type safe, always returning a data frame, importing from these formats requires specifying a `which` argument to `import()` to dictate which data set (worksheet, file, table, etc.) to import (the default being `which = 1`). But `import_list()` can be used to import all (or only a specified subset, again via `which`) of data objects from these types of files. ## Data Export The export capabilities of **rio** are somewhat more limited than the import capabilities, given the availability of different functions in various R packages and because import functions are often written to make use of data from other applications and it never seems to be a development priority to have functions to export to the formats used by other applications. That said, **rio** currently supports the following formats: ```{r} library("rio") export(mtcars, "mtcars.csv") export(mtcars, "mtcars.dta") ``` It is also easy to use `export()` as part of an R pipeline (from magrittr or dplyr). For example, the following code uses `export()` to save the results of a simple data transformation: ```{r} library("magrittr") mtcars %>% subset(hp > 100) %>% aggregate(. ~ cyl + am, data = ., FUN = mean) %>% export(file = "mtcars2.dta") ``` Some file formats (e.g., Excel workbooks, Rdata files) can support multiple data objects in a single file. `export()` natively supports output of multiple objects to these types of files: ```{r} # export to sheets of an Excel workbook export(list(mtcars = mtcars, iris = iris), "multi.xlsx") ``` It is also possible to use the new (as of v0.6.0) function `export_list()` to write a list of data frames to multiple files using either a vector of file names or a file pattern: ```{r} export_list(list(mtcars = mtcars, iris = iris), "%s.tsv") ``` ## File Conversion The `convert()` function links `import()` and `export()` by constructing a dataframe from the imported file and immediately writing it back to disk. `convert()` invisibly returns the file name of the exported file, so that it can be used to programmatically access the new file. Because `convert()` is just a thin wrapper for `import()` and `export()`, it is very easy to use. For example, we can convert ```{r} # create file to convert export(mtcars, "mtcars.dta") # convert Stata to SPSS convert("mtcars.dta", "mtcars.sav") ``` `convert()` also accepts lists of arguments for controlling import (`in_opts`) and export (`out_opts`). This can be useful for passing additional arguments to import or export methods. This could be useful, for example, for reading in a fixed-width format file and converting it to a comma-separated values file: ```{r} # create an ambiguous file fwf <- tempfile(fileext = ".fwf") cat(file = fwf, "123456", "987654", sep = "\n") # see two ways to read in the file identical(import(fwf, widths = c(1, 2, 3)), import(fwf, widths = c(1, -2, 3))) # convert to CSV convert(fwf, "fwf.csv", in_opts = list(widths = c(1, 2, 3))) import("fwf.csv") # check conversion ``` ```{r, echo=FALSE, results='hide'} unlink("mtcars.dta") unlink("mtcars.sav") unlink("fwf.csv") unlink(fwf) ``` With metadata-rich file formats (e.g., Stata, SPSS, SAS), it can also be useful to pass imported data through `characterize()` or `factorize()` when converting to an open, text-delimited format: `characterize()` converts a single variable or all variables in a data frame that have "labels" attributes into character vectors based on the mapping of values to value labels (e.g., `export(characterize(import("file.dta")), "file.csv")`). An alternative approach is exporting to CSVY format, which records metadata in a YAML-formatted header at the beginning of a CSV file. It is also possible to use **rio** on the command-line by calling `Rscript` with the `-e` (expression) argument. For example, to convert a file from Stata (.dta) to comma-separated values (.csv), simply do the following: ``` Rscript -e "rio::convert('mtcars.dta', 'mtcars.csv')" ``` ```{r, echo=FALSE, results='hide'} unlink("mtcars.csv") unlink("mtcars.dta") unlink("multi.xlsx") unlink("mtcars2.dta") unlink("mtcars.tsv") unlink("iris.tsv") ```