Dealing with the GESLA dataset in R

Fernando Mayer, Niamh Cahill

The geslaR package was designed to fetch the GESLA dataset into R, as a full dataset or as a subset of the data. There a few ways that this can be achieved:

In the sections below, we will discuss each of these possibilities.

Loading the package

When you first load the geslaR package into an R session, two more packages will automatically be loaded: arrow and dplyr.

library(geslaR)
#> Loading required package: arrow
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

These two packages are strictly necessary for dealing with a huge dataset like GESLA, to alleviate the need to load it in memory all at once. The full GESLA dataset has more than 1.7 billion lines, and is nearly impossible to load it in R (or any other RAM memory language, like e.g. Python) all at once. To avoid this problem, there are two possibilities:

  1. Load (smaller) subsets of the full dataset
  2. Use an “in-memory” analytics framework

In the best case scenario, these two approaches can be used together for maximum efficiency, and the geslaR package was designed to achieve this.

Regardless the way the GESLA data is loaded into R, the best way to deal with it is using the Apache Arrow framework. For a more detailed explanation and examples, please see the vignette("intro-to-arrow").

Downloading the full GESLA dataset

If you want to explore the full GESLA dataset, you can download it with the download_gesla() function as follows

download_gesla()

This will create a directory called gesla_dataset in the current working directory (as defined by getwd()) and download the full dataset locally. This download may take some time (expect around 5 to 10 minutes), as it depends on internet connection. Note that this full dataset will need at least 7GB of (hard drive) storage, so make sure this is feasible. However, once downloaded, you will have access to the full dataset, and you will only need to do this once.

You will notice that the full dataset is composed of 5119 Apache Parquet files, ending in .parquet

## Number of downloaded files
length(list.files("gesla_dataset"))
#> [1] 5119
## Check the first files
head(list.files("gesla_dataset"))
#> [1] "a121-a12-nld-cmems.parquet"      "a2-a2-bel-cmems.parquet"        
#> [3] "aalesund-aal-nor-cmems.parquet"  "aarhus-aar-dnk-cmems.parquet"   
#> [5] "aasiaat-aas-grl-gloss.parquet"   "abashiri-347a-jpn-uhslc.parquet"

These files are the same as those originally distributed in the GESLA dataset, so that each one refers to a site from where the data comes from. To load this full dataset in R, use the arrow::open_dataset() function, specifying the location of the .parquet files. Although there are many files, this function recognises them as a single dataset, because they all have the same structure (or “Schema”).

## Open dataset
da <- open_dataset("gesla_dataset")
## Check the object
da
#> FileSystemDataset with 5119 Parquet files
#> date_time: timestamp[us]
#> year: int64
#> month: int64
#> day: int64
#> hour: int64
#> country: string
#> site_name: string
#> lat: double
#> lon: double
#> sea_level: double
#> qc_flag: int64
#> use_flag: int64
#> file_name: string
#> 
#> See $metadata for additional Schema metadata
## Verify class
class(da)
#> [1] "FileSystemDataset" "Dataset"           "ArrowObject"      
#> [4] "R6"

Since this is an ArrowObject object, it will actually not load the full dataset in memory (as it would if it was a standard R object, such as a tibble or data.frame). Note, however, that some basic information, such as dim() and names() can be retrieved simply with

dim(da)
#> [1] 1172435674         13
names(da)
#>  [1] "date_time" "year"      "month"     "day"       "hour"      "country"  
#>  [7] "site_name" "lat"       "lon"       "sea_level" "qc_flag"   "use_flag" 
#> [13] "file_name"

Any other manipulation of the dataset must be made using dplyr verbs. For example, to count the number of observations by country, one could use

da |>
    count(country) |>
    collect()
#> # A tibble: 113 × 2
#>    country         n
#>    <chr>       <int>
#>  1 BEL       1263467
#>  2 JPN      74580447
#>  3 USA     243838504
#>  4 DNK      36726648
#>  5 SLV        812230
#>  6 GBR      76038844
#>  7 YEM         63362
#>  8 CAN      67019312
#>  9 AUS      72441932
#> 10 EGY        220872
#> # ℹ 103 more rows

Querying subsets directly into R

The query_gesla() function can be used to load subsets of the full GESLA dataset into R. This function will connect to the GESLA dataset hosted on an Amazon AWS S3 bucket, and will automatically filter the required piece of data. Note that the query may vary, as it will depend on the size of the required subset and on internet connection speed.

The selection is made by specifying one or more countries, years and site names. The only mandatory argument is country, where it should be a character vector of three-letter ISO 3166-1 alpha-3 codes. If the argument year is empty, then all available years for those countries will be selected. The same occurs for the site_name argument.

For example, to select all the available data for Ireland (IRL), simply use

da <- query_gesla(country = "IRL")
#> ℹ This process can take some time, as it depends on the size of the final dataset, and on
#> internet connection.
#> 
ℹ Connecting to the data server...

✔ Connecting to the data server... [3.5s]
#> 
ℹ Filtering data. This can take some time...

✔ Filtering data. This can take some time... [2m 43.4s]
#> 
ℹ Query finished.

✔ Query finished. [13ms]
da
#> Table
#> 16321218 rows x 13 columns
#> $date_time <timestamp[us]>
#> $year <int64>
#> $month <int64>
#> $day <int64>
#> $hour <int64>
#> $country <string>
#> $site_name <string>
#> $lat <double>
#> $lon <double>
#> $sea_level <double>
#> $qc_flag <int64>
#> $use_flag <int64>
#> $file_name <string>
#> 
#> See $metadata for additional Schema metadata
dim(da)
#> [1] 16321218       13
names(da)
#>  [1] "date_time" "year"      "month"     "day"       "hour"      "country"  
#>  [7] "site_name" "lat"       "lon"       "sea_level" "qc_flag"   "use_flag" 
#> [13] "file_name"
class(da)
#> [1] "Table"        "ArrowTabular" "ArrowObject"  "R6"

This is also an ArrowObject object, and works the same way as any other Arrow object type, i.e., it can be manipulated with dplyr verbs. By the end of this section we will show some examples.

Many other queries can be made, for example,

## Select one specific year
da <- query_gesla(country = "IRL", year = 2015)

## Multiple years
da <- query_gesla(country = "IRL", year = c(2015, 2017))
da <- query_gesla(country = "IRL", year = 2010:2017)
da <- query_gesla(country = "IRL", year = c(2010, 2012, 2015))

## Multiple countries
da <- query_gesla(country = c("IRL", "ATA"), year = 2015)
da <- query_gesla(country = c("IRL", "ATA"), year = 2010:2017)

As one last example, we may be interested in data only for the site Dublin_Port in Ireland, and for the years 2015 and 2017. This query would be done as follows

da <- query_gesla(country = "IRL", year = c(2015, 2017),
    site_name = "Dublin_Port")
#> ℹ This process can take some time, as it depends on the size of the final dataset, and on
#> internet connection.
#> 
ℹ Connecting to the data server...

✔ Connecting to the data server... [3.4s]
#> 
ℹ Filtering data. This can take some time...

✔ Filtering data. This can take some time... [2m 34s]
#> 
ℹ Query finished.

✔ Query finished. [21ms]

Now we could explore this subset using dplyr verbs as usual

## Verifying number of observations per year
da |>
    count(year) |>
    collect()
#> # A tibble: 2 × 2
#>    year      n
#>   <int>  <int>
#> 1  2015  93790
#> 2  2017 104835

## Calculating summary statistics per year
da |>
    group_by(year) |>
    summarise(
        min = min(sea_level),
        mean = mean(sea_level),
        max = max(sea_level)) |>
    collect()
#> # A tibble: 2 × 4
#>    year   min    mean   max
#>   <int> <dbl>   <dbl> <dbl>
#> 1  2015 -2.62 -0.0104  2.38
#> 2  2017 -2.43  0.0762  2.53

Note the use of the dplyr::collect() function at the end of each call. This is necessary to convert the final result (which is processed by Arrow) to a common R object (usually a tibble or similar).

As stated before, the queries made by query_gesla() may take some time to process. To avoid repeating this query in every new session, you can save this object in a file locally, so it will be available instantly the next time you may need it. To achieve this, you can use the write_gesla() function

write_gesla(da)

In this case, this call will create a file named gesla-data.parquet, and it will be ready for reading at any time, without the need for querying again. At any time, just use

da <- read_gesla("gesla-data.parquet")

to make it available again.

Please, do not try to save this object as a standard R file, such as .rds or .rda, as using saveRDS() or save() for example. In this case the resulting saved object will not be the same as the original object, as an ArrowObject object is not a standard R object (it actually is just a pointer to an Arrow style object).

The query_gesla() function also has two more arguments. The use_flag = 1 means that only the data suggested by the GESLA team as being useful for analysis must be used. Please see the GESLA format page for more information. The other argument, as_data_frame = FALSE specifies if the resulting object should be an ArrowObject (the default) or a data.frame. We highly recommend to keep this default option, as the resulting data.frame can be larger-than-memory, resulting in a memory overflow or even an R session crash. Keeping the object as an ArrowObject will always guarantee that the resulting query will be made available. In any case, an ArrowObject can always be converted to a data.frame, if that is desired, without the need to repeat the query with as_data_frame = TRUE. For example, the object da created above (which is an ArrowObject) may be converted to a data.frame with

class(da)
#> [1] "Table"        "ArrowTabular" "ArrowObject"  "R6"
db <- da |>
    collect()
class(db)
#> [1] "tbl_df"     "tbl"        "data.frame"

Using the GESLA Shiny app

The GESLA Shiny app (geslaR-app) was developed as a user-friendly interface to explore the whole GESLA dataset. The app may be called simply as

run_gesla_app()

This will create a directory called gesla-app on the current working directory, containing the necessary files for the app itself, and a sub-directory called gesla_dataset containing all the GESLA dataset files (as .parquet files, as this is done with the download_gesla() function as explained above). Note that the first time this function is called, it will create these directories and download the full GESLA dataset. However, once this is done, the next time this function is called on this working directory, only the app will open in the default browser (no more downloads will be necessary).

This web-interface allows the user to apply some basic filters, by choosing countries, years and sites. By doing this, the app will filter the data and show information about the selected subset. The first tab contains a preview of the selected subset. The two subsequent tabs provide some plots with the number of observations by year, and summary plots for the sea level data itself. The last tab will show a map with the selected locations, and by clicking on the locations, a pop-up will show information about that specific station.

The user has the option to download the selected subset in CSV or Parquet file formats. We highly recommend to export it as a Parquet file, as the file size will be much smaller, and it can be simply loaded into R with the read_gesla() function, creating an ArrowObject object.