Function reference

get_issues_from_archive: Scraping issues’ URLs from the OJS issues archive

get_issues_from_archive() takes a vector of OJS URLs and scrapes the issues URLs from the issue archive

You don’t need to provide the actual URL to issue archives. get_issues_from_archive() parses the URL you provide to compose it. Then, it looks for links containing “/issue/view” in the href. Links are post-processed to comply to OJS routing conventions before returning.

journal <- 'https://revistapsicologia.uchile.cl/index.php/RDP/'

issues <- ojsr::get_issues_from_archive(input_url = journal)

The result is a long-format data frame (1 input_url may result in several rows, one for each output_url) containing:

input_url - the URL you provided
output_url - the issues’ URL that has been scraped

get_articles_from_issue: Scraping articles URLs from the ToC of OJS issues

get_articles_from_issue() takes a vector of OJS (issue) URLs and scrapes the links to articles from the issues table of content

You don’t need to provide the actual URL of the issues’ ToC, but you must provide URLs that include issue ID (articles URLs do not include this info!). get_articles_from_issue() parses the URL you provide to compose the ToC URL. Then, it looks for links containing “/article/view” in the href. Links are post-processed to comply to OJS routing conventions before returning.

issue <- 'https://revistapsicologia.uchile.cl/index.php/RDP/issue/view/6031/'

articles <- ojsr::get_articles_from_issue(input_url = issue)

The result is a long-format dataframe (1 input_url may result in several rows, one for each output_url), containing:

input_url - the URL you provided
output_url - the articles URL that has been scraped

get_articles_from_search: Scraping OJS search results for a given criteria to retrieve articles’ URL

get_articles_from_search() takes a vector of OJS URLs and a string for search criteria to compose search result URLs, then it scrapes them to retrieve the articles’ URLs.

You don’t need to provide the actual URL of the search result pages. get_articles_from_search() parses the URL you provide to compose the search result page(s) URL. If pagination is involved, necessary links are also included. Then, it looks for links containing “/article/view” in the href. Links are post-processed to comply to OJS routing conventions before returning.

journal <- 'https://revistapsicologia.uchile.cl/index.php/RDP/'

criteria <- "psicologia+social"

articles_search <- ojsr::get_articles_from_search(input_url = journal, search_criteria = criteria)

The result is a long-format dataframe (1 input_url may result in several rows, one for each output_url), containing:

input_url - the URL you provided
output_url - the article URL

get_galleys_from_article: Scraping galleys URLs from OJS articles

Galleys are the final presentation version of the articles content. Most of the time, these include full content in PDF and other reading formats. Less often, they are supplementary files (tables, dataset) in different formats.

get_galleys_from_article() takes a vector of OJS URLs and scrapes all the galleys URLs from the article view

You may provide any article-level URL (article abstract view, inline view, PDF direct download, etc.). get_galleys_from_article() parses the URL you provide to compose the article view URL. Then, it looks for links containing “/article/view” in the href. Links are post-processed to comply to OJS routing conventions before returning (i.e., having a galley ID).

article <- 'https://dspace.palermo.edu/ojs/index.php/psicodebate/article/view/516/311' # inline reader

galleys <- ojsr::get_galleys_from_article(input_url = article)

The result is a long-format dataframe (1 input_url may result in several rows, one for each output_url), containing:

input_url - the URL you provided
output_url - the galleys URL that has been scraped
format - the format of the galley (e.g., PDF, XML)
download_url - the conventional URL to force galley download. You may pass these to a download function of your own (e.g., https://stackoverflow.com/questions/39246739/download-multiple-files-using-download-file-function).

get_html_meta_from_article: Scraping metadata from the OJS articles HTML

get_html_meta_from_article() takes a vector of OJS URLs and scrapes all metadata written in HTML from the article view (e.g., https://publicaciones.sociales.uba.ar/index.php/psicologiasocial/article/view/593).

You may provide any article-level URL (article abstract view, inline view, PDF direct download, etc.). get_html_meta_from_article() parses the URL you provide to compose the URL of the article view. Then, it looks for <meta> tags in the <head> section of the HTML. Important! This may not only retrieve bibliographic metadata; any other “meta” property detailed on the HTML will be obtained (e.g., descriptions for propagation on social network, etc.).

article <- 'https://revistapsicologia.uchile.cl/index.php/RDP/article/view/75178'

metadata <- ojsr::get_html_meta_from_article(input_url = article)

The result is a long-format dataframe (1 input_url may result in several rows, one for each output_url), containing:

input_url - the URL you provided
meta_data_name - name of the property/metadata (e.g., “DC.Date.created” for the Date of creation)
meta_data_content - the actual metatag value
meta_data_scheme - the standard in which the content is annotated
meta_data_xmllang - the language in which the metadata was entered

get_oai_meta_from_article: Retrieving OAI records for OJS articles

An alternative to web scraping metadata from the article pages HTML is to retrieve their OAI-PMH (Open Archives Initiative Protocol for ‘Metadata’ Harvesting) records.

get_oai_meta_from_article() will try to access the OAI records within the OJS for any article’s URL you have provided.

article <- 'https://dspace.palermo.edu/ojs/index.php/psicodebate/article/view/516/311' # xml galley

metadata_oai <- ojsr::get_oai_meta_from_article(input_url = article)

The result is a long-format dataframe (1 input_url may result in several rows, one for each output_url), containing:

input_url - the URL you provided
meta_data_name - name of the property/metadata (e.g., “DC.Date.created” for the Date of creation)
meta_data_content - the actual metatag value
meta_data_scheme - it always returns NA (included just for easier binding with get_html_meta_from_article() results)
meta_data_xmllang - it always returns NA (included just for easier binding with get_html_meta_from_article() results)

Note: This function is in a very preliminary stage. If you are interested in working with OAI records, you may want to check Scott Chamberlain’s OAI package for R https://CRAN.R-project.org/package=oai. If you only have the OJS home url, and would like to check all the article’s OAI records at one shot, an interesting option is to parse it with ojsr::parse_oai_url() and passing the output_url to oai::list_identifiers().

parse_base_url: Parsing URLs against OJS routing conventions to retrieve the base URL

parse_base_url() takes a vector of OJS URLs and retrieves their base URL, according to OJS routing conventions.

mix_links <- c(
   'https://dspace.palermo.edu/ojs/index.php/psicodebate/issue/archive',
   'https://revistapsicologia.uchile.cl/index.php/RDP/article/view/75178'
)

base_url <- ojsr::parse_base_url(input_url = mix_links)

The result is a vector of the same length of your input.

parse_oai_url: Parsing URLs against OJS routing conventions to retrieve the OAI protocol URL

parse_oai_url() takes a vector of OJS URLs and retrieves their OAI entry URL, according to OJS routing conventions.

mix_links <- c(
   'https://dspace.palermo.edu/ojs/index.php/psicodebate/issue/archive',
   'https://revistapsicologia.uchile.cl/index.php/RDP/article/view/75178'
)

oai_url <- ojsr::parse_oai_url(input_url = mix_links)

The result is a vector of the same length of your input.

ojsr-vignette

Gaston Becerra

Overview

About OJS

OJS API

Example

Scraping complete journals

Function reference

get_issues_from_archive: Scraping issues’ URLs from the OJS issues archive

get_articles_from_issue: Scraping articles URLs from the ToC of OJS issues

get_articles_from_search: Scraping OJS search results for a given criteria to retrieve articles’ URL

get_galleys_from_article: Scraping galleys URLs from OJS articles

get_html_meta_from_article: Scraping metadata from the OJS articles HTML

get_oai_meta_from_article: Retrieving OAI records for OJS articles

parse_base_url: Parsing URLs against OJS routing conventions to retrieve the base URL

parse_oai_url: Parsing URLs against OJS routing conventions to retrieve the OAI protocol URL