[Bioc-devel] Question relating to extending a class and inclusion of data

Rainer Johannes Joh@nne@@R@|ner @end|ng |rom eur@c@edu
Wed May 22 08:18:24 CEST 2024


Dear Vilhelm,

notame seems to be an interesting package filling some gaps that currently exist in the untargeted metabolomics workflow. I would strongly suggest to support the SummarizedExperiment classes (in future). I would maybe suggest to keep it as generic as possible without dedicated additional slots (group_col, time_col, subject_col) as it seems this information would anyway be available within the `colData` of the SummarizedExperiment. Keeping the object as generic as possible would simplify integration with other Bioconductor packages.

We are still heavily working on the xcms package and recently made an update to use more modern classes there too (see https://jorainer.github.io/xcmsTutorials/index.html for an up-to-date tutorial of the new xcms preprocessing). As a final result we support at present to extract the data as a SummarizedExperiment. In our workflows we are using this object that to perform data normalziation etc (adding e.g. the normalized abundance matrix as an additional assay to the SummarizedExperiment). This works extremely well - but using in addition or as an alternative the notame package directly in these workflows would be great.

Also, we have then subsequent workflows for annotation (in the MetaboAnnotation package) that can work on both SummarizedExperiment objects as well as the XcmsExperiment class (extending the MsExperiment object from the MsExperiment package).

I would be very much interested to discuss this further, maybe in the #metabolomics channel of the Bioconductor Slack - would be great to better integrate the various packages for metabolomics data analysis.

cheers, jo

Johannes Rainer, PhD

Eurac Research
Institute for Biomedicine
Via A.-Volta 21, I-39100 Bolzano, Italy

email: johannes.rainer using eurac.edu
github: jorainer
mastodon: jorainer using fosstodon.org

Hervé Pagès wrote:


Hi,

On 5/21/24 01:58, Vilhelm Suksi wrote:
> Hi!
>
> Excuse the long email, but there are a number of things to be clarified in preparation for submitting the notame package which I have been developing to meet Bioconductor guidelines. As of now it passes almost all of the automatic checks, with the exception of formatting and some functions that are over 50 lines long.
>
> Background 1:
> The notame package already has a significant following, and was published in 2020 with an associated protocol article published in the "Metabolomics Data Processing and Data Analysis—Current Best Practices" special issue of the Metabolites journal (https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mdpi.com%2F2218-1989%2F10%2F4%2F135&data=05%7C02%7Cjohannes.rainer%40eurac.edu%7Cc62fb74963fb42d78bbd08dc7a0be544%7C9251326703e3401a80d4c58ed6674e3b%7C0%7C0%7C638519438735874737%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=sAsG5xRW19jgJFD2efhamacpvbJdMk10na1SCBaPMAw%3D&reserved=0<https://www.mdpi.com/2218-1989/10/4/135>). The original package relies on the MetaboSet container class, which extends ExpressionSet with three slots, namely group_col, time_col and subject_col. These slots are used to store the names of the corresponding sample data columns, and are used as default arguments to most functions. This makes for a more streamlined experience. However, the submission guidelines state that existing classes should be preferred, such as SummarizedExperiment. We will be implementing support for SummarizedExperiment over the summer. We have included a MetaboSet - SummarizedExperiment converter for interoperability.
>
> Q1: Can an initial Bioconductor submission rely on the Metaboset container class? Support for MetaboSet would do well to be included anyways for existing users until it is phased out.
Since you already have a user base, you will need a roadmap for the
transition from Metaboset to MetaboExperiment. Bioconductor has a
6-month release cycle that facilitates this. More on this below.
> Q2: Is it ok to extend the SummarizedExperiment class to utilize the three aforementioned slots? It could be called MetaboExperiment. Or should the functions be modified such that said columns are specified explicitly, using SummarizedExperiment?

It's better to define your own SummarizedExperiment extension with the
three additional slots. This way you will have a container
(MetaboExperiment) that is semantically equivalent (or close) to
Metaboset. Which means that: (1) in principle you won't need to modify
the interface of your existing functions, and (2) you'll be able to
provide coercion methods to go back and forth between the
MetaboExperiment and Metaboset representations (see ?setAs). Overall
this should make the transition from Metaboset to MetaboExperiment
easier/smoother.

This transition would roughly look something like this:

1. Submit theMetaboset-based version of the package for inclusion in
BioC 3.20.

2. After the 3.20 release (next Fall), make the following changes in the
devel branch of the package:

- Implement the MetaboExperiment class + accessors (getters/setters) +
constructor function(s) + show() method.

- Implement the coercion methods to go from Metaboset to
MetaboExperiment and vice-versa.

- Modify the implementation of all the functions that deal with
Metaboset objects to deal with MetaboExperiment objects. This will be
the primary representation that they handle. If they receive a
Metaboset, they will immediately replace it with a MetaboExperiment
using as(..., "MetaboExperiment").

- Modify all the documentation, unit tests, and serialized objects
accordingly.

3. Now you are ready to deprecate the Metaboset class. I recommend that
you also do this in the devel branch before the 3.21 release. There are
no well established guidelines to deprecate an S4 class. I recommend
that you use .Deprecated() to display a deprecation message in its
show() method, constructor function(s), getters/setters, and coercion
method from MetaboExperiment to Metaboset.

4. After the 3.21 release (Spring 2025), make the Metaboset class
defunct by replacing all the .Deprecated() calls with .Defunct() calls.

> Background 2:
> The notame package caters to untargeted LC-MS data analysis metabolic profiling experiments, encompassing data pretreatment (quality control, normalization, imputation and other steps leading up to feature selection) and feature selection (univariate analysis and supervised learning). Raw data preprocessing is not supported. Instead, the package offers utilities for flexibly reading peak tables from an Excel file, resulting from various point-and-click software such as MS-DIAL. As such, data in Excel format needs to be included, but is not available in any Bioconductor package, although such Excel data could be procured from existing data in Bioconductor. However, existing untargeted LC-MS data in Bioconductor can not be used, as is, to demonstrate the full functionality of the notame package. With regard to feature data, there needs to be several analytical modes. Sample data needs to include study group, time point, subject ID and several batches. Blank samples would be good as well. Packages I have checked for data with the above specifications include FaahKO, MetaMSdata, msdata, msqc1, mtbls2, pmp, PtH2O2lipids, and ropls. As of now, the example data is not realistic in that it is scrambled and I have not yet been informed of the origin and modification of the data.
>
> Q3: If I get access to information about the origin and modification of the now used data, can I further modify it to satisfy the needs of the package for an initial Bioconductor release? Or does it need to be realistic? Consider this the explicit pre-approval inquiry for including data in the notame package.
I'm not sure I fully understand the question (or its connection with
Excel) but yes you can include unrealistic data in the package. As long
as it allows you to properly illustrate the basic usage of your
functions in the man pages and/or vignette(s). It can also be useful to
have small (and unrealistic) data for the unit tests. The important
thing here is that the data must be small.
> Q4: Do you think a separate ExperimentData package satisfying the specifications laid out in Background 2 is warranted? This could be included in a future version with SummarizedExperiment/MetaboExperiment support.
It depends on the size of the data. For a software package, we limit the
size of the source tarball to 5G. So if you're going to exceed that
limit then the datasets need to go in an experiment data package.
>
> Q5: The instructions state that the data needs to be documented (https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcontributions.bioconductor.org%2Fdocs.html%23doc-inst-script&data=05%7C02%7Cjohannes.rainer%40eurac.edu%7Cc62fb74963fb42d78bbd08dc7a0be544%7C9251326703e3401a80d4c58ed6674e3b%7C0%7C0%7C638519438735884650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=wz%2FFbT%2FLyVWpFNQ076D5JIqF5xOklmrnrwzJr75Ii88%3D&reserved=0<https://contributions.bioconductor.org/docs.html#doc-inst-script>). Is the availability of the original data strictly necessary?  I notice many packages don't include documentation on how the data was procured.

The availability of the original data is not strictly necessary but the
data still needs to be documented i.e. what's its nature, where it's
coming from, how it was imported/transformed, etc...

Best,

H.

>
> Thanks,
> Vilhelm Suksi
> Turku Data Science Group
> vksuks using utu.fi
>
> _______________________________________________
> Bioc-devel using r-project.org  mailing list
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=05%7C02%7Cjohannes.rainer%40eurac.edu%7Cc62fb74963fb42d78bbd08dc7a0be544%7C9251326703e3401a80d4c58ed6674e3b%7C0%7C0%7C638519438735891028%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=eXGyRIDuYYpzj%2BlmamTeG5mb%2FOaxINydmPyqJFOAcfU%3D&reserved=0<https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Hervé Pagès

Bioconductor Core Team
hpages.on.github using gmail.com


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel using r-project.org mailing list
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=05%7C02%7Cjohannes.rainer%40eurac.edu%7Cc62fb74963fb42d78bbd08dc7a0be544%7C9251326703e3401a80d4c58ed6674e3b%7C0%7C0%7C638519438735895253%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=7IcY8%2BcH%2FModjzgmiJ0BAxdD5Bx8GWA2APHjJISkL4s%3D&reserved=0<https://stat.ethz.ch/mailman/listinfo/bioc-devel>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list