Welcome to kibior package introduction vignette!





1 General notions

As one of the hot topics in science, being able to make findable, accessible, interoperable and researchable our datasets (FAIR principles) brings openness, versionning and unlocks reproductibility. To support that, great projects such as biomaRt R package enable fast consumption and ease handling of massive validated data through a small R interface.

Even though main entities such as Ensembl or NBCI avail massive amounts of data, they do not provide a way to store data elsewhere, delegating data handling to research teams. During data analysis, this can be an issue since researchers often need to send intermediary subsets of analyzed data to collaborators. Moreover, it is pretty common now that, when a new database or dataset emerges, a web platform and an API are provided alongside it, allowing easier exploration and querying.

Multiplying the number of research teams in life-science worldwide with the ever-growing database and datasets publication on widely varying sub-columns results in an even greater number of ways to query heterogenous life-science data.

Here, we present an easy way for datasets manipulation and sharing throught decentralization. Indeed, kibior seeks to make available a search engine and distributed database system for sharing data easily through the use of Elasticsearch (ES) and Elasticsearch-based architectures such as Kibio.

It is a way to handle large datasets and unlock the possibility to:

  • pull/download datasets from a local or remote instance of Elasticsearch,
  • filter, query and search in large amounts of data,
  • push/store datasets to local or remote instance of Elasticsearch,
  • share datasets for collaborators around the world,
  • perform joins between R in-memory and ES-based datasets,
  • import and export datasets from and to files,
  • valid safe-state datasets during pipeline execution,
  • comply to FAIR-sharing requirements by allowing REST requests on data and metadata from Elasticsearch API.

1.1 Goal of this vignette

The following sections will explain some basic and advanced technical usage of kibior. A second vignette will focus these features to biological applicaitons.

1.2 Vocabulary

We will use both Elasticsearch and R vocabulary, which have similar notions:

R Elasticsearch
data(set), tibble, df, etc. index
columns, variables fields
lines, observations documents

kibior uses tibbles as main data representation.

1.3 Public instances

The public Kibio instance is available at kibio.compbio.ulaval.ca port 80. You can simply connect to it via the get_kibio_instance() method of kibior.

1.4 Demonstration datasets

Before going to the second separate vignette showing biological datasets example, we strongly advise the reader to start reading the basic and advanced usage sections. In these sections, we will use some datasets taken from other known packages, such as dplyr::starwars

dplyr::storms

datasets::iris

…and ggplot2::diamonds to show our examples.





2 Deploying an Elasticsearch instance

Before starting, you should know that this step will start an Elasticsearch service and store all data on your machine.

So, you should ponder the quantity of data you will handle in your code according the remaining space left on your computer.

2.1 Installation with Docker and docker-compose

To use this feature, you will need Docker and docker-compose installed on your system.

To install Docker, simply follow the steps detailled on its website.

If you are on a Linux / Unix-based system, you should also check the post-installation steps, mainly for the Manage Docker as a non-root user step.

To install docker-compose, simply follow the next steps.

2.2 Run your own Elasticsearch instance

We want something easy to use, so we use the following docker-compose fashion. You can use the docker way by passing all parameters inline but it is verbose.

You can find the following described files in the kibior package, folder inst/docker_conf.

2.2.2 DNS configuration file

Copy-paste these lines in a new resolv.conf file if you need to connect to ES named services on the web.

2.2.3 Docker-compose configuration file

Copy-paste these lines inside a single-es.yml file.

version: '2.4'
services:

##  --------------------------
##  If you need rstudio
##  --------------------------

  # rstudio4:
  #   container_name: rstudio4
  #   image: rocker/rstudio:4.0.3
  #   environment:
  #   - PASSWORD=myrstudio
  #   - USERID=1000
  #   #
  #   volumes:
  #   - type: bind
  #     source: <path_for_RStudio_data_folder_on_your_computer>
  #     target: /work/rstudio/data    # we create a folder inside the container
  #     read_only: false
  #   #
  #   ports:
  #   - 8787:8787
  #   networks:
  #   - kibiornet
  #   # cpu and ram constraints
  #   cpu_count: 1
  #   cpu_percent: 75
  #   cpus: 0.75
  #   memswap_limit: 0
  #   mem_reservation: 256m
  #   mem_limit: 6g

##  --------------------------
##  If you need a bash cli + R cli
##  See https://hub.docker.com/u/rocker for more versions 
##  with preinstalled material (e.g. tidyverse)
##  --------------------------

  # r4:
  #   container_name: r4
  #   image: roncar/kibior-env:4.0.3        # pre-configured R version 4.0.3 with Kibior installed
  #   stdin_open: true
  #   tty: true
  #   entrypoint: "/bin/bash"
  #   #
  #   volumes:
  #   - type: bind
  #     source: <path_for_R_data_folder_on_your_computer>
  #     target: /work/r/data    # we create a folder inside the container
  #     read_only: false
  #   - type: bind
  #     source: ./resolv.conf
  #     target: /etc/resolv.conf
  #     read_only: false
  #   #
  #   networks:
  #   - kibiornet
  #   # cpu and ram constraints
  #   cpu_count: 1
  #   cpu_percent: 75
  #   cpus: 0.75
  #   memswap_limit: 0
  #   mem_reservation: 256m
  #   mem_limit: 6g

##  --------------------------
##  Elasticsearch container
##  --------------------------

  elasticsearch:
    # this configuration will run a service called "elasticsearch"
    container_name: elasticsearch
    # the elasticsearch image used will be version 7
    # but you can use another version, such as 6.8.6
    image: docker.elastic.co/elasticsearch/elasticsearch:7.10.2
    # defines env var
    # last line tells us java will use 512MB
    # if you need more, change it for 2GB, for instance
    # "ES_JAVA_OPTS=-Xms2g -Xmx2g"
    environment:
    - discovery.type=single-node
    - bootstrap.memory_lock=true
    - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    # strict limit to 1GB of RAM
    mem_limit: 1g
    memswap_limit: 0
    # lock memory
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    # bind files and folders of your system with those inside of the container 
    volumes:
    # ES data folder
    - type: bind
      source: <path_for_es_data_folder_on_your_computer>
      target: /usr/share/elasticsearch/data
      read_only: false
    # ES configurations
    - type: bind
      source: ./elasticsearch.yml
      target: /usr/share/elasticsearch/config/elasticsearch.yml
      read_only: true
    # export port to access Elasticsearch service from outside docker
    ports: 
    - 9200:9200
    # networks managed by docker 
    networks:
    - kibiornet

# network declaration
networks:
  kibiornet:

2.3 R session

2.3.1 I have R already installed on my computer

If you have R installed on your computer, simply use it with a kibior instance pointing at localhost:9200. Since it is the default configuration, you will only need this to work:

2.3.2 I do not already have R installed on my computer

If you do not have R installed on your computer, you can:

  1. Install it, or
  2. Use Docker and docker-compose.

The following sections guide you to use the R cli or the RStudio container. Both have kibior and its dependencies installed, but you can choose to use a clean R environment instead (i.e. rocker containers).

2.3.2.1 R command-line interface (R cli)

Steps:

  1. Uncomment the R cli section (i.e. section “If you need a bash cli + R cli”) in the es-single.yml file.
  2. Put a volume path if you need to work on specific files.
  3. Use the same command to launch the service.
  4. Use the R command-line interface inside the container.
# run services (daemonized)
docker-compose -f single-es.yml up -d
elasticsearch is up-to-date
Creating r4 ... done

#  see the current docker processes
docker ps
CONTAINER ID   IMAGE                                                  COMMAND                  CREATED          STATUS          PORTS                              NAMES
0f1afd07f58a   roncar/kibior-env:4.0.3                                "/bin/bash"              4 minutes ago    Up 4 minutes                                       r4
40814036d980   docker.elastic.co/elasticsearch/elasticsearch:7.10.2   "/tini -- /usr/local…"   4 minutes ago    Up 4 minutes    0.0.0.0:9200->9200/tcp, 9300/tcp   elasticsearch

# open an interactive bash inside the R container (see previous command container ID)
docker exec -it 0f1afd07f58a bash

# inside the R container, query the ES container (with its container name)
root@0f1afd07f58a:/$ curl -X GET "http://elasticsearch:9200"
{
  "name" : "20f2383b909a",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "InZqVTNiTK6idAWrEweWDg",
  "version" : {
    "number" : "7.10.2",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
    "build_date" : "2021-01-13T00:42:12.435326Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

# inside the R container, run R cli
root@0f1afd07f58a:/$ R --vanilla

R version 4.0.3 (2020-10-10) -- "Bunny-Wunnies Freak Out"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(kibior)
# Here you can directly load kibior as it is pre-installed inside the container.

This container comes with R version 4.0.3 and kibior package and its dependencies pre-installed. If you need a clean container with only R, you can use the rocker/r-ver:4.0.3 image instead.

2.3.2.2 RStudio

Steps:

  1. Uncomment the RStudio section (i.e. section “If you need rstudio”) in the es-single.yml file.
  2. Put a volume path if you need to work on specific files.
  3. Use the same command to launch the service.
  4. Use RStudio interface inside your web browser with login/password you configure in the RStudio section.

Connect with your web browser at localhost:8787 with login/password that where configured in the es-single.yml file.

This container comes with RStudio version 4.0.3 and kibior package and its dependencies pre-installed. If you need a clean container with only RStudio, you can use the rocker/rstudio:4.0.3 image instead.

3 Vignettes menu

This vignette is organized as a simple tutorial with some examples you can follow to get the base of how kibior works:

  • Basic usage, shows the main methods and simple examples how to use them.
  • Advanced usage, details the kibior object and methods specificities, such as attributes and querying syntax.

The last part is the second vignette, illustrating a more biologically-oriented use case with kibior.





4 Basic usage

Here, we will see the main methods (push(), pull(), list(), columns(), keys(), has(), match(), export(), import(), move(), copy()) and public attributes (verbosity) of kibior class. kibior uses elastic (Chamberlain 2020) to perform base functions.

4.1 Verbosity attributes

By default, kibior comes with three public attributes: $verbose, $quiet_progress and $quiet_results all initiliazed to FALSE.

  • $verbose toggles the printing of more informations which can be useful to see all processes steps.
  • $quiet_progress toggles the printing of progress bars. This can be useful for scripts.
  • $quiet_results toggles the verbosity output of called methods. You may want to deactivate it when you do not need interactive feedback.

To quickly show them, simply print the instance you are using:

Use kc$<attribute-name> <- TRUE/FALSE to toggle verbosity mode on these three attributes.

A new instance of kibior has defaults to interactive behavior: progress bar and results immediate printing, but no additional informations.

See Attribute access in Advanced usage section for all attribute descriptions.

4.2 $push(): Store a dataset to Elasticsearch

To store data using kc connection:

If not already taken, the given index name will be created automatically before receiving data. If already taken, an error is raised.

Important points:

  1. $push() automatically send data to Elasticsearch server, which needs unique IDs. One can define its own IDs using the id_col parameter which requires a column name that has unique elements.
  2. If not defined, kibior will attribute a kid column counter as unique IDs (default).
  3. $push() expects well-formatted data, mainly in a data.frame or derivative structure such as tibble.

See Push modes in Advanced usage section for more information.

4.4 $list(): List all Elasticsearch indices

4.6 $count(): Count the number of elements

As $search() and $pull(), this method accepts a query parameter to count the number of hits in your dataset following a query. See Querying in Advanced usage section for more information.

4.7 $keys(): List all unique keys of an Elasticsearch index column

You should not use this on columns that can represent a continuous range such as temperature or coordinate. It will aggregate all possible values which can a large amount of time if your dataset is big enough.

4.9 $match(): Select matching Elasticsearch indices

$match() and $has() differ on some points:

  • $has() retuns TRUE or FALSE for any string passed.
  • $has() does not accept patterns and only looks if the given strings are in $list().
  • $match() only returns something if some indices match the given strings.
  • $match() accepts patterns and unpacks all possible indices matching given strings.

4.10 $export(): Extract Elasticsearch index content to a file

The $export() method create file and export in-memory dataset or Elasticsearch index to this file.

#> Create temp files with data
storms_memory_tmp <- tempfile(fileext=".csv")
storms_elastic_tmp <- tempfile(fileext=".csv")

#> export a in-memory dataset to a file
dplyr::storms %>% kc$export(data = ., filepath = storms_memory_tmp)

## [1] "/tmp/RtmpVAwsWi/file243436451ae3.csv"

kc$import(storms_memory_tmp) %>% tibble::as_tibble()

## # A tibble: 10,010 x 13
##    name   year month   day  hour   lat  long status category  wind pressure
##    <chr> <int> <int> <int> <int> <dbl> <dbl> <chr>     <int> <int>    <int>
##  1 Amy    1975     6    27     0  27.5 -79   tropi…       -1    25     1013
##  2 Amy    1975     6    27     6  28.5 -79   tropi…       -1    25     1013
##  3 Amy    1975     6    27    12  29.5 -79   tropi…       -1    25     1013
##  4 Amy    1975     6    27    18  30.5 -79   tropi…       -1    25     1013
##  5 Amy    1975     6    28     0  31.5 -78.8 tropi…       -1    25     1012
##  6 Amy    1975     6    28     6  32.4 -78.7 tropi…       -1    25     1012
##  7 Amy    1975     6    28    12  33.3 -78   tropi…       -1    25     1011
##  8 Amy    1975     6    28    18  34   -77   tropi…       -1    30     1006
##  9 Amy    1975     6    29     0  34.4 -75.8 tropi…        0    35     1004
## 10 Amy    1975     6    29     6  34   -74.8 tropi…        0    40     1002
## # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>,
## #   hu_diameter <dbl>

#> export an Elasticsearch index to a file
"storms" %>% kc$export(data = ., filepath = storms_elastic_tmp)

## [1] "/tmp/RtmpVAwsWi/file24343220815.csv"

kc$import(storms_elastic_tmp) %>% tibble::as_tibble()

## # A tibble: 10,010 x 14
##    name   year month   day  hour   lat  long status category  wind pressure
##    <chr> <int> <int> <int> <int> <dbl> <dbl> <chr>     <int> <int>    <int>
##  1 Ike    2008     9     7    18  21   -74   hurri…        3   105      946
##  2 Ike    2008     9     8     0  21.1 -75.2 hurri…        4   115      945
##  3 Ike    2008     9     8     2  21.1 -75.7 hurri…        4   115      945
##  4 Ike    2008     9     8     6  21.1 -76.5 hurri…        3   100      950
##  5 Ike    2008     9     8    12  21.1 -77.8 hurri…        2    85      960
##  6 Ike    2008     9     8    18  21.2 -79.1 hurri…        1    75      964
##  7 Ike    2008     9     9     0  21.5 -80.3 hurri…        1    70      965
##  8 Ike    2008     9     9     6  22   -81.4 hurri…        1    70      965
##  9 Ike    2008     9     9    12  22.4 -82.4 hurri…        1    70      965
## 10 Ike    2008     9     9    14  22.6 -82.9 hurri…        1    70      965
## # … with 10,000 more rows, and 3 more variables: ts_diameter <dbl>,
## #   hu_diameter <dbl>, kid <int>

This method can also automatically use zip by adding the file extension.

Note: kibior is using rio (Chan et al. 2018) that can export much more formats. See rio documentation and rio::install_formats() function.

4.11 $import(): Get a file content to a new Elasticsearch index

The $import() method can duplicate a dataset retrieved from a file to a in-memory variable, a new Elasticsearch index or both.

As $export(), it can also read directly from zipped files.

Note: kibior is using rio (Chan et al. 2018) that can export much more formats. See rio documentation and rio::install_formats() function.

The $import() method can natively manage sequence, alignement and feature formats (e.g. fasta, bam, gtf, gff, bed, etc.) since it also wraps Bioconductor library methods such as rtracklayer::import() (Lawrence, Gentleman, and Carey 2019), Biostrings::read*StringSet() (Pagès et al. 2020) and Rsamtools::scanBam() (Morgan et al. 2020).

Dedicated methods are implemented inside kibior (e.g. $import_features() and $import_alignments()), and the generic $import() method tries to open the right format according to file extension. You can also use specific methods if the import cannot be guessed by the general import() method: import_sequences(), import_alignments(), import_features(), import_tabluar() and import_json().

4.12 $move(): Rename an index

The $move() method rename an index. The $copy() method is equivalent to $move(copy = TRUE).

4.13 $copy(): Copy an index

The $copy() method copy an index to another name. It is a wrapper around $move(copy = TRUE).

4.15 $search(): Search everything

Elasticsearch is here… You know, For search. As a search engine, it is its main feature.

Using $search() method, you can search for everything inside a part or all data indexed by Elasticsearch. If no restrictions is found in the query parameter, all data will be searched, which means in every indices, every columns, every keywords.

By default, $search() has head mode active, which will return a small subset (default is 5) of the actual complete result to allow quick inspection of data. With $verbose <- TRUE, it will be printed in the result as “Head mode: on”. To change the head size, modify the $head_search_size attribute.

To get the full result, you have to use $search(head = FALSE), or more simply : $pull().

See Querying in Advanced usage section for more information.

4.16 $stats(): base statistics of columns

Alongside data handling methods are descriptive statistical methods. You already know $count() but here some others displayed by kibior.

The $stats() method is a shortcut to ask for: count, min, max, avg, sum, sum_of_squares, variance, std_deviation, std_deviation_upper (bound), std_deviation_lower (bound).

Some important warnings here:

  1. Counts are approximate
  2. Standard Deviation and Bounds require normality

In addition to $count() and $stats(), lots of others methods exist to perform descriptive analysis: avg, mean, min, max, sum, q1, q2, median, q3 and summary.

4.17 $describe_index() and $describe_columns(): get the description of index and columns

You can ask for description of datasets with these methods.

Important: this feature requires the user that pushed the data to manually add the metadata with $add_description().





5 Advanced usage

5.2 Attributes access

As objects, kibior instances attributes can be accessed and updated for some.

Attribute name Read-only Default Description
$host “localhost” the Elasticsearch host
$port 9200 the Elasticsearch port
$user x NULL the Elasticsearch user
$pwd x NULL the Elasticsearch password
$connection x NULL the Elasticsearch connection object
$head_search_size 5 the head size default value
$cluster_name x When connected the cluster name if and only if already connected
$cluster_status x When connected the cluster status if and only if already connected
$nb_documents x When connected the current cluster total number of documents if already connected
$version x When connected the Elasticsearch version if and only if already connected
$elastic_wait 2 the Elasticsearch wait time for update commands if already connected (in seconds)
$valid_joins x A vector the valid joins available in `kibior
$valid_count_types x A vector the valid count types available (mainly observations = rows, variables = columns)
$valid_elastic_metadata_types x A vector the valid Elasticsearch metadata types available
$valid_push_modes x A vector the valid push modes available
$shard_number 1 the number of allocated primary shards when creating an Elasticsearch index
$shard_replicas_number 1 the number of allocated replicas in an Elasticsearch index
$default_id_col “kid” the ID column name used when sending data to Elasticsearch if not provided by user
$verbose FALSE the verbose mode
$quiet_progress FALSE the progress bar printing mode
$quiet_results FALSE the method results printing mode

Some attributes cannot be modified.

5.3 Organizing data for searches

Working alone directly on a massive cluster of servers is an unlikely situation. Moreover, handling large datasets on your own computer or storing all data in your local Elasticsearch repository is generally a bad idea. We generally tend to only handle what we can afford to, and organize pipelines and softwares accordingly.

There are multiple strategies to organize data, and our main objective here is to use servers for what they have been built for: to do the cpu- and memory-greedy job. Thus, in comparison, our personal computers or laptop will not have huge load processes. Putting kibior in this equation will help us further as it is backed by a database and search engine.

As a rule of thumb, subsetting and querying is a good strategy, e.g. splitting on categorial variables.

What we can do then, is searching in all indices names starting with the prefix “storms_

#> Within them, we search some minimum winds and pressure
#> results come already filtered by storm names
kc$search("storms_*", 
        query = "wind:>25 && pressure:>30", 
        columns = c("name", "year", "month", "lat", "long", "status"), 
        head = FALSE)

## $storms_al011993
## # A tibble: 4 x 6
##   month  year name       lat  long status             
##   <int> <int> <chr>    <dbl> <dbl> <chr>              
## 1     6  1993 AL011993  25.4 -77.5 tropical depression
## 2     6  1993 AL011993  26.1 -75.8 tropical depression
## 3     6  1993 AL011993  26.7 -74   tropical depression
## 4     6  1993 AL011993  27.8 -71.8 tropical depression
## 
## $storms_al021992
## # A tibble: 4 x 6
##   month  year name       lat  long status             
##   <int> <int> <chr>    <dbl> <dbl> <chr>              
## 1     6  1992 AL021992  25.7 -85.5 tropical depression
## 2     6  1992 AL021992  27   -84.5 tropical depression
## 3     6  1992 AL021992  27.6 -84   tropical depression
## 4     6  1992 AL021992  28.5 -82.9 tropical depression
## 
## $storms_al022000
## # A tibble: 10 x 6
##    month  year name       lat  long status             
##    <int> <int> <chr>    <dbl> <dbl> <chr>              
##  1     6  2000 AL022000   9.6 -21   tropical depression
##  2     6  2000 AL022000   9.9 -22.6 tropical depression
##  3     6  2000 AL022000  10.2 -24.5 tropical depression
##  4     6  2000 AL022000  10.1 -26.2 tropical depression
##  5     6  2000 AL022000   9.9 -27.8 tropical depression
##  6     6  2000 AL022000   9.9 -29.3 tropical depression
##  7     6  2000 AL022000  10.1 -30.1 tropical depression
##  8     6  2000 AL022000  10.1 -32.6 tropical depression
##  9     6  2000 AL022000  10   -34.2 tropical depression
## 10     6  2000 AL022000   9.8 -36.2 tropical depression
## 
## $storms_al021994
## # A tibble: 2 x 6
##   month  year name       lat  long status             
##   <int> <int> <chr>    <dbl> <dbl> <chr>              
## 1     7  1994 AL021994  33   -79.1 tropical depression
## 2     7  1994 AL021994  33.2 -79.2 tropical depression
## 
## $storms_al021999
## # A tibble: 3 x 6
##   month  year name       lat  long status             
##   <int> <int> <chr>    <dbl> <dbl> <chr>              
## 1     7  1999 AL021999  20.2 -95   tropical depression
## 2     7  1999 AL021999  20.6 -96.3 tropical depression
## 3     7  1999 AL021999  20.5 -97   tropical depression
## 
## $storms_al012000
## list()

As we show before, we did not push all data but only some subsets of interest. By selecting and pushing what we need, datasets can be searched and shared immediately after.

If you work in sync with multiple remote collaborators on the same Elasticsearch cluster, that can be a great strategy. For instance, one of your collaborators can add a new dataset that will not change the request, but will enrich the result.

We can apply the same request and found some new results.

5.4 Querying

One of the main features of kibior is to be able to search inside vast amounts of data thanks to Elasticsearch. You can use the search feature with the eponym method $search() but also $pull() by using the query parameter.

5.4.1 Querying notation

To query specific data, the query parameter of methods such as $count() or $search() requires one string following the Elasticsearch Query String Syntax.

To sum them up, you can search for:

  • terms,
  • or phrases, with double-quotes.

To complement, you can apply multiple operators:

  • boolean operators:
    • AND (or “&&”, double-ampersand),
    • OR (or “||”, double-pipe),
    • NOT (or “!”, exclamation point),
    • + (plus) the term MUST be present,
    • - (minus) the term MUST NOT be present.
  • grouping: organize boolean operators, ex: “(quick OR brown) AND fox”.

  • field selecting: target a specific column.
    • Phrases can be searched.
  • Boolean operators can be used.
  • range notation: using [min TO max] for inclusive or {min TO max} for exclusive.
    • Can be use as a simple search expression for one side unbounded:
      • n:>=10 is equivalent to n:[10 TO *].
      • n:<=10 is equivalent to n:[* TO 10].
      • n:>10 is equivalent to n:{10 TO *}.
      • n:<10 is equivalent to n:{* TO 10}.
    • Inclusive threshold.
  • Exclusive threshold.
  • Mixing inclusive and exclusive.
  • fuzzyness and proximity: using “~” at the end of a term to use approximative search.
    • Default fuzzy factor is 2, meaning “quikc~” and “quikc~2” are identical.
    • It can be applied to phrases, ex: “"fox quick"~5”.
  • boosting: using “^” ponderate some expressions over others.
    • Value:
      • O to 1: decrease boosting.
      • Superior to 1: increase boosting.
    • Boost type:
      • terms, ex: quick^2 fox, quick is boosted.
      • phrases, ex: "foo bar"^2.
      • groups, ex: (foo bar)^4.

Now, we can consider making easily a more complex search query:

5.4.2 $search() behavior

Though Elasticsearch is very powerful as a document-oriented database, it is a full-text search engine.

With wildcard and targeting a single index:

Column selection:

As you can see on the last request, some columns did not match, thus were not returned.

Now a more complex search, directly done by pulling data:

This was executed on a small 54k observations and 10 variables dataset. We will see it on a bigger one in biological example vignette.

5.4.3 text and keyword querying

Lastly, we need to see the difference between a keyword and a text field.

Elasticsearch can index text values as two different types: text and keyword. The difference between those two is that:

  • text columns such as “name” or “skin_color” are broken up into words during indexing, allowing searches on one or more words,
  • keyword columns (always added when pushing data with kibior) keep the full text as one string.

kibior indexes all text values as text AND keyword, so we can use whole-text search (with .keyword tag) AND word-specific (without .keyword tag).

Doing a search for a word starting with a specific prefix in pure R is a bit more annoying:

5.4.4 Reserved Elasticsearch characters

Elasticsearch has some reserved characters : + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /

You should remove them before pushing them into Elasticsearch. If it is not possible or you want to retrieve data from someone else that contains reserved characters, you should try to query with a keyword field.

5.5 $push() details

5.5.1 Define a unique IDs column

When pushing data with default parameters, kibior will define unique IDs for each record (each line of a table) and add them as metadata. You can retrieve them by using $pull(keep_metadata = TRUE).

Metadata columns are mainly prefixed by an underscore. The actual record is embedded into the _source field. Since data have been pushed without specifying an ID column, the _id field that defines Elasticsearch unique IDs reflects the one automatically added by kibior in the data (kid by default). To change the default ID column added by kibior, change the $default_id_col attribute value.

Letting kibior handle ID attribution will produce uniqueness, but might not be the most meaningful and practical for update.

To change that behavior, you can define your own ID field when calling $push() data by using the id_col parameter.

Caution here: the columns parameter does not apply to metadata.

5.5.2 Push modes

When pushing data, if the index you are using in $push() already exists, an error will be thrown. This is due to mode = "check" parameter that will check if an index with the name you gave already exists. This is the default option, but can be changed to "recreate" or "update":

  • "recreate" will erase the index and write to a fresh one with the same name. Be cautious with this option as you will erase previously written data from that index name.
  • "update" will push and update indexed data with corresponding IDs. For this option, you must know which field is the unique ID and send updated documents over them. You do not need all data to be updated, just send a subset of updated data. Send all data again might be error prone and can take a lot of time if your dataset is big. Knowing which field is the unique ID also helps a lot and prevent errors.
#> we will change the height of orange-eyed inhabitants of "Naboo"
#> homeworld to 300 and update that subset to the main one.
s <- kc$pull("starwars", query = "eye_color:orange && homeworld:naboo")$starwars
s

## # A tibble: 3 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <int> <int> <chr>      <chr>      <chr>          <int> <chr> <chr>  <chr>     <chr>   <lis> <chr>    <chr>    
## 1 Jar …    196    66 none       orange     orange            52 male  mascu… Naboo     Gungan  <chr… ""       ""       
## 2 Roos…    224    82 none       grey       orange            NA male  mascu… Naboo     Gungan  <chr… ""       ""       
## 3 Rugo…    206    NA none       green      orange            NA male  mascu… Naboo     Gungan  <chr… ""       ""       
## # … with 1 more variable: kid <int>

#> change the height of those selected to 300
s$height <- 300
s

## # A tibble: 3 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <dbl> <int> <chr>      <chr>      <chr>          <int> <chr> <chr>  <chr>     <chr>   <lis> <chr>    <chr>    
## 1 Jar …    300    66 none       orange     orange            52 male  mascu… Naboo     Gungan  <chr… ""       ""       
## 2 Roos…    300    82 none       grey       orange            NA male  mascu… Naboo     Gungan  <chr… ""       ""       
## 3 Rugo…    300    NA none       green      orange            NA male  mascu… Naboo     Gungan  <chr… ""       ""       
## # … with 1 more variable: kid <int>

#> and update the main dataset. Since it is a subset of that dataset, 
#> IDs are the same, which is default "kid" column.
ns <- kc$push(s, "starwars", mode = "update", id_col = "kid")
#> see the result
ns <- kc$pull("starwars", 
              query = "eye_color:orange && homeworld:naboo")$starwars
ns

## # A tibble: 3 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <int> <int> <chr>      <chr>      <chr>          <int> <chr> <chr>  <chr>     <chr>   <lis> <chr>    <chr>    
## 1 Jar …    300    66 none       orange     orange            52 male  mascu… Naboo     Gungan  <chr… ""       ""       
## 2 Roos…    300    82 none       grey       orange            NA male  mascu… Naboo     Gungan  <chr… ""       ""       
## 3 Rugo…    300    NA none       green      orange            NA male  mascu… Naboo     Gungan  <chr… ""       ""       
## # … with 1 more variable: kid <int>

5.6 Comparison with dplyr functions

dplyr package offers simple and effective functions called filter and select to quickly reduce the scope of interest. In the same fashion, kibior uses Elasticsearch query string syntax that is very similar to the dplyr syntax (see Querying section). Elasticsearch decuple the search possibilities by allowing similar usage on multiple indices, or datasets, on multiple remote servers.

Moreover, using $count(), $search() or $pull(), one can use their analogous features:

  • dplyr::select() with columns parameter,
  • and dplyr::filter() with query parameter.

Using both of them result in much more powerful search capabilities in a much more readable code.

Following sections are some examples of analogous requests.

5.6.1 Similarities

Select some columns:

Filter on strict thresholds:

Filter on soft thresholds:

Filter on ranges:

Filter on exact string match for one field:

Filter on exact string match with multiple choices on one field:

Filter on partial string matching:

Filter over a compositions of multiple filters (multiple columns):

5.6.2 Differences

Even if there are lots of similarities regarding the syntax, Elasticsearch is powerful search engine. Thus, requests on billions of records are less expensive to do with it. Also, Elasticsearch is accessible throught an its API. Numerous people can access it at the same time. Which mean you can work synchronously with a collaborator pushing data and using them immediately after. Moreover, using wildcards, we can search on multiple indices at once.

What we can do very easily with Elasticsearch is searching everywhere: in every indices, in every columns, and in every words. Lastly, full-text searches are the big deal. See Text and Keyword querying for more details.

5.7 Change tibble column type

kibior will return base types in tibble structures (integer, character, logical, and list) for representing data. If you want to change some columns, use readr::type_convert() after retrieving the dataset.

5.8 Compare two instances

If you manage multiple instances, you can compare host:port couple easily with == and != operators.

5.9 Attach one instance to global environment

Using only one instance of kibior, you might want to attach this instance to the global environment. This will indeed remove the instance call at the beginning of each method call (in our examples: kc$...).

Though it can be practical in local developments for only one instance, we strongly discourage that pratice if you entend to share your code. It can induce wrong behaviors during execution in environments with different configurations or multiple instances.

5.10 Joins

kibior integrated dplyr package joins: full, left, right, inner, anti, and semi joins.

By using kibior joins, you can apply these joins to in-memory datasets and Elasticsearch-based indices. kibior supports query parameter when joining to accelerate data retrival time but cannot join on listed columns.

As you can see, kibior uses suffixes left and right on data column.

5.11 Moving and copying data from another instance

Appart from moving and copying indices from the same cluster of Elasticsearch instances, the $move() and $copy() methods can do the same with REMOTE instances. The remote Elasticsearch endpoint has to be declared inside your elasticsearch.yml configuration file.

By adding one line to the elasticsearch.yml configuration file, allowing a server whitelist, Elasticsearch servers can talk to each others. By this, they can transfer data across them in a much faster and secure way.

Full description can be found on Elasticsearch documentation.

After that, kibior will be able to use the from_instance parameter of $move() and $copy().

This method allows massive data copying in a much faster way since all data are structured the same.





6 Known limits

As all implementations and developments, there are some limits:

  • Elasticsearch cannot store uppercase field names, thus all column names are forced to lowercase when submitted by default.

  • Elasticsearch interprets dots in strings as nested values (ex: “aaa.bbb” is understand as field “aaa” containing a field “bbb”), which is prone to errors with R language since variables can be named with dots. To avoid errors when pushing data to Elasticsearch, dots in column names are replaced by underscores.

  • Elasticsearch has updatable default limitations to 1000 columns, so if datasets pushed with more than 1000 variables, it will generate an error. Two solutions: try to transpose it, or define a higher Elasticsearch limit in its configurations.

  • Elasticsearch handles each document (each line of a table) with a unique ID: a specific "_id" metadata field. What can be confusing here is that metadata are not on the same level as data in Elasticsearch. To be able to update data more easily by targeting accurately document IDs, we force add a new unique field (default is kid) when pushing data to Elasticsearch and define it as the unique "_id" field. If you know one of your column is unique and can be used as an ID column, you can use the id_col of the $push() method to define this column as main ID.

  • The columns parameter does not handle metadata columns.

  • Elasticsearch is really great for textual and keyword search, for that the text has to have common delimiters to be cut down to words. Passing a single, billions-long, uninterrupted biomolecular sequence is not a good thing for Elasticsearch and may result in an indexing failure.

  • $move() and $copy() for remote instances are very sensitive to authentication and security configurations. Some tasks will not be possible due to each organism security measures. Check with your favorite or proper system administrator.

  • Joins are not executed server-side (on ES), which actually means the Elasticsearch data must be downloaded before executing the actual join. Querying and selecting columns with joins parameters left_columns, right_columns, left_query and right_query is realtively important to lower data transfer payload and fasten the execution.

  • Elasticsearch limits returned results to 10.000 elements per bulk. If you try to set bulk_size > 10000 in parameter, kibior will downsize it to match the maximum allowed.

  • The query parameter expressiveness is a powerful string-based mecanism. Users need to understand that the query parameter sends in one request a query to an Elasticsearch instance. If the request is generated based on a list of elements such as c("id1", "id2", "id3", ...) %>% paste0(collapse = " || ") %>% kc$search("*", query = .), it can possibly represents a very long string which cannot be entirely passed down to Elasticsearch properly. One way to counter this issue is to split up the element vector into subset and do mulitple calls. It will be fully automated in future versions.

  • Kibior applies some modifications on datasets before sending them on Elasticsearch: turns all dataset names to lowercase, removes all dataset dotted-based names to underscore-based names, adds kid column, etc. All these tranformations can affect the behavior of $*_join() methods.

  • The $keys() method limits by default the number of unique keys found to 1000 since it aggregate a possible unlimited number of keys which can happen when calling it on integer or floating point values. If you want more, change the max_size method parameter.





7 Tested with

kibior has been tested with these configurations:

Software Version
Elasticsearch 6.8, 7.5, 7.8, 7.9, 7.10
R 3.6.1, 4.0.2, 4.0.3
RStudio 1.2.5001, build 93, 7b3fe265, 1.4.1103, build "Wax Begonia", 458706c3

This vignette has been built using the following session:

Session info

```r
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] kibior_0.1.1   magrittr_2.0.1 readr_1.4.0    stringr_1.4.0  dplyr_1.0.3   
## [6] ggplot2_3.3.3  knitr_1.30    
## 
## loaded via a namespace (and not attached):
##  [1] zip_2.1.1         Rcpp_1.0.6        cellranger_1.1.0  pillar_1.4.7     
##  [5] compiler_4.0.3    forcats_0.5.0     elastic_1.1.0     tools_4.0.3      
##  [9] digest_0.6.27     jsonlite_1.7.2    evaluate_0.14     lifecycle_0.2.0  
## [13] tibble_3.0.5      gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.10     
## [17] openxlsx_4.2.3    crul_1.0.0        curl_4.3          yaml_2.2.1       
## [21] haven_2.3.1       xfun_0.20         rio_0.5.16        withr_2.4.1      
## [25] generics_0.1.0    vctrs_0.3.6       hms_1.0.0         grid_4.0.3       
## [29] tidyselect_1.1.0  glue_1.4.2        httpcode_0.3.0    data.table_1.13.6
## [33] R6_2.5.0          readxl_1.3.1      foreign_0.8-80    rmarkdown_2.6    
## [37] tidyr_1.1.2       purrr_0.3.4       scales_1.1.1      ellipsis_0.3.1   
## [41] htmltools_0.5.1.1 colorspace_2.0-0  stringi_1.5.3     munsell_0.5.0    
## [45] crayon_1.3.4
```

</p>





References

Chamberlain, Scott. 2020. “Elastic: General Purpose Interface to ‘Elasticsearch’.” Bioinformatics. https://CRAN.R-project.org/package=elastic.

Chan, Chung-hong, Geoffrey CH Chan, Thomas J. Leeper, and Jason Becker. 2018. “Rio: A Swiss-Army Knife for Data File I/O.” https://CRAN.R-project.org/package=rio.

Lawrence, Michael, Robert Gentleman, and Vincent Carey. 2019. “Rtracklayer: An R Package for Interfacing with Genome Browsers.” Bioinformatics 25: 1841–2. https://doi.org/10.1093/bioinformatics/btp328.

Morgan, Martin, Hervé Pagès, Valerie Obenchain, and Nathaniel Hayden. 2020. “Rsamtools: Binary Alignment (BAM), FASTA, Variant Call (BCF), and Tabix File Import.” https://doi.org/10.18129/B9.bioc.Rsamtools.

Pagès, H., P. Aboyoun, R. Gentleman, and S. DebRoy. 2020. “Biostrings: Efficient Manipulation of Biological Strings.” https://doi.org/10.18129/B9.bioc.Biostrings.