[Bioc-devel] Question relating to extending a class and inclusion of data

Tue May 21 10:58:24 CEST 2024

Hi!

Excuse the long email, but there are a number of things to be clarified in preparation for submitting the notame package which I have been developing to meet Bioconductor guidelines. As of now it passes almost all of the automatic checks, with the exception of formatting and some functions that are over 50 lines long.

Background 1:
The notame package already has a significant following, and was published in 2020 with an associated protocol article published in the "Metabolomics Data Processing and Data Analysis—Current Best Practices" special issue of the Metabolites journal (https://www.mdpi.com/2218-1989/10/4/135). The original package relies on the MetaboSet container class, which extends ExpressionSet with three slots, namely group_col, time_col and subject_col. These slots are used to store the names of the corresponding sample data columns, and are used as default arguments to most functions. This makes for a more streamlined experience. However, the submission guidelines state that existing classes should be preferred, such as SummarizedExperiment. We will be implementing support for SummarizedExperiment over the summer. We have included a MetaboSet - SummarizedExperiment converter for interoperability. 

Q1: Can an initial Bioconductor submission rely on the Metaboset container class? Support for MetaboSet would do well to be included anyways for existing users until it is phased out.

Q2: Is it ok to extend the SummarizedExperiment class to utilize the three aforementioned slots? It could be called MetaboExperiment. Or should the functions be modified such that said columns are specified explicitly, using SummarizedExperiment?

Background 2:
The notame package caters to untargeted LC-MS data analysis metabolic profiling experiments, encompassing data pretreatment (quality control, normalization, imputation and other steps leading up to feature selection) and feature selection (univariate analysis and supervised learning). Raw data preprocessing is not supported. Instead, the package offers utilities for flexibly reading peak tables from an Excel file, resulting from various point-and-click software such as MS-DIAL. As such, data in Excel format needs to be included, but is not available in any Bioconductor package, although such Excel data could be procured from existing data in Bioconductor. However, existing untargeted LC-MS data in Bioconductor can not be used, as is, to demonstrate the full functionality of the notame package. With regard to feature data, there needs to be several analytical modes. Sample data needs to include study group, time point, subject ID and several batches. Blank samples would be good as well. Packages I have checked for data with the above specifications include FaahKO, MetaMSdata, msdata, msqc1, mtbls2, pmp, PtH2O2lipids, and ropls. As of now, the example data is not realistic in that it is scrambled and I have not yet been informed of the origin and modification of the data. 

Q3: If I get access to information about the origin and modification of the now used data, can I further modify it to satisfy the needs of the package for an initial Bioconductor release? Or does it need to be realistic? Consider this the explicit pre-approval inquiry for including data in the notame package.

Q4: Do you think a separate ExperimentData package satisfying the specifications laid out in Background 2 is warranted? This could be included in a future version with SummarizedExperiment/MetaboExperiment support.

Q5: The instructions state that the data needs to be documented (https://contributions.bioconductor.org/docs.html#doc-inst-script). Is the availability of the original data strictly necessary?  I notice many packages don't include documentation on how the data was procured.

Thanks,
Vilhelm Suksi
Turku Data Science Group
vksuks using utu.fi