--- title: "Parsing Functions in edgarWebR" output: rmarkdown::html_vignette date: 2017-10-12 author: "Micah J Waldstein " vignette: > %\VignetteIndexEntry{Parsing Functions in edgarWebR} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} projects: ['edgarWebR'] categories: ['RStats'] type: post draft: false --- ```{r setup, echo = FALSE, message = FALSE} knitr::opts_chunk$set(collapse = T, comment = "#>") library(edgarWebR) set.seed(0451) # Cache http requests library(httptest) start_vignette("parsing") ``` New to edgarWebR 0.2.0 are functions for parsing SEC documents. While there are good R packages for XBRL processing, there is a gap in extracting information from other document types available via the site. edgarWebR currently provides functions for 2 of those - * `parse_submission()` - Processes a raw SGML filing into component documents. These are the 'Complete submission text file' on filing pages. Similar to zip files, they contain all the files included in particular submission. * `parse_filing()` - Processes a narrative filing (e.g. 10-K, 10-Q) into paragraphs annotated with part and item numbers. In a submission with many files, this is the main form. This vignette will show how to use both functions to find the risks reported by in a company's recent filing. ## Find a Submission Using edgarWebR functions, we'll first look up a recent filing. ```{r companyInfo} ticker <- "STX" filings <- company_filings(ticker, type = "10-Q", count = 40) # Specifying the type provides all forms that start with 10-, so we need to # manually filter. filings <- filings[filings$type == "10-Q", ] # We're only interested in a particular filing filing <- filings[filings$filing_date == "2017-10-27", ] filing$md_href <- paste0("[Link](", filing$href, ")") knitr::kable(filing[, c("type", "filing_date", "accession_number", "size", "md_href")], col.names = c("Type", "Filing Date", "Accession No.", "Size", "Link"), digits = 2, format.args = list(big.mark = ",")) ``` ## Get the Complete Submission File We'll next get the list of files and find the link to the complete submission. ```{r document} docs <- filing_documents(filing$href) doc <- docs[docs$description == 'Complete submission text file', ] doc$md_href <- paste0("[Link](", doc$href, ")") knitr::kable(doc[, c("seq", "description", "document", "size", "md_href")], col.names = c("Sequence", "Description", "Document", "Size", "Link"), digits = 2, format.args = list(big.mark = ",")) ``` Normally, we would use `filing_documents()` to get to the 10-Q directly, but as an example we'll be using the complete submission file to demonstrate the `parse_submission()` function. You would want to use the complete submission file if you want to access the full list of files - e.g. in this case there are 80 files in the submission, but only 10 available on the website and therefore available to `filing_documents()` - or if you worry about efficiency and are downloading all of the documents. ## Parse the Complete Submission File Now that we have the link to the complete submission file, we can parse it into components. ```{r parse_submission} parsed_docs <- parse_submission(doc$href) knitr::kable(head(parsed_docs[, c("SEQUENCE", "TYPE", "DESCRIPTION", "FILENAME")]), col.names = c("Sequence", "Type", "Description", "Document"), digits = 2, format.args = list(big.mark = ",")) ``` And just for example, here's the end of the full list - note the excel that isn't on the SEC site for instance. ```{r parse_submission_tail} knitr::kable(tail(parsed_docs[, c("SEQUENCE", "TYPE", "DESCRIPTION", "FILENAME")]), col.names = c("Sequence", "Type", "Description", "Document"), digits = 2, format.args = list(big.mark = ",")) ``` The 10-Q Filing document is Seq. 1, with the full text of the document in the TEXT column. ```{r show_text} # NOTE: the filing document is not always #1, so it is a good idea to also look # at the type & Description filing_doc <- parsed_docs[parsed_docs$TYPE == '10-Q' & parsed_docs$DESCRIPTION == '10-Q', 'TEXT'] substr(filing_doc, 1, 80) ``` We can see that contains the raw document. For document types which are not plain text, e.g. the XBRL zip file, the content is uuencoded and would been further processing. ## Parse the Filing Document Fortunately edgaWebR functions that take URL's will also take a string containing the document, so to parse the document, while we could have passed the URL to the online document we can just pass in the full string. ```{r parseFiling} doc <- parse_filing(filing_doc, include.raw = TRUE) unique(doc$part.name) unique(doc$item.name) head(doc[grepl("market risk", doc$item.name, ignore.case = TRUE), "text"], 3) risks <- doc[grepl("market risk", doc$item.name, ignore.case = TRUE), "raw"] ``` Now the document is all ready for whatever further processing we want. As a quick example we'll pull out all the italicized risks. ```{r parseRisks} risks <- risks[grep("", risks)] risks <- gsub("^.*|.*$", "", risks) risks <- gsub("\n", " ", risks) risks ``` This is a fairly simplistic example, but should serve as a good tutorial on processing filings. ## How to Download edgarWebR is available from CRAN, so can be simply installed via ```{r eval=FALSE} install.packages("edgarWebR") ``` If you want the latest and greatest, you can get a copy of the development version from github by using devtools: ```{r eval=FALSE} # install.packages("devtools") devtools::install_github("mwaldstein/edgarWebR") ``` ```{r, include=FALSE} # Cleanup end_vignette() ```