This vignette demonstrates the integration of chem16S with phyloseq to calculate chemical metrics of community reference proteomes.

Load required packages

library(chem16S)
library(phyloseq)
library(ggplot2)
theme_set(theme_bw())

This vignette was compiled on 2025-01-16 with chem16S version 1.2.0 and phyloseq version 1.48.0.

Mouse microbiome data

The mothur MiSeq SOP has an example dataset with 16S rRNA gene sequences for mouse gut communities, which was extracted from the complete dataset reported by Schloss et al. (2012). This example dataset includes data collected at 0–9 (early) and 141-150 (late) days post-weaning from one mouse. This dataset is used in the DADA2 Pipeline Tutorial to demonstrate construction of amplicon sequence variants (ASV), taxonomic classification of the ASVs, and analysis using the phyloseq package. Here, we load the phyloseq-class object that was created on 2023-06-14 using dada2 (version 1.28.0) by following the steps in the DADA2 pipeline tutorial (version 1.16). Note that taxonomy was assigned to genus level using a newer version of the Silva training set (silva_nr99_v138.1_train_set.fa.gz) than that indicated in the DADA2 tutorial on this date (silva_nr_v132_train_set.fa.gz).

data(mouse.silva)
mouse.silva
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 232 taxa and 19 samples ]
## sample_data() Sample Data:       [ 19 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 232 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 232 reference sequences ]

To start to visualize this data, let’s recreate the bar plot from the DADA2 tutorial:

top20.silva <- names(sort(taxa_sums(mouse.silva), decreasing = TRUE))[1:20]
mouse.silva.top20 <- transform_sample_counts(mouse.silva, function(OTU) OTU/sum(OTU))
mouse.silva.top20 <- prune_taxa(top20.silva, mouse.silva.top20)
plot_bar(mouse.silva.top20, x = "Day", fill = "Family") + facet_wrap(~When, scales = "free_x")

Lowest-level taxa

The chem16S package contains the amino acid compositions of archaeal, bacterial, and viral taxa at levels from genus to phylum constructed from the RefSeq database; see the vignette Chemical metrics of reference proteomes for more information. There are denoted as reference proteomes for taxa. Several functions are available to identify the lowest-level taxon for each classified read (or in this case, each ASV), combine the reference proteomes for all taxa to create community reference proteomes, and calculate chemical metrics from them.

First, let’s make a table that lists the lowest-level taxon for each ASV and their abundances. Because chem16S was first developed to analyze the output from the RDP Classifier, this data frame has a format that is similar to the file created by using the classify -h command of the RDP Classifier followed by the merge-count command. That is, the data frame has columns named taxid, rank, lineage, and name, followed by columns for all samples. Because of the long text in the taxid column (see below), let’s look at the other columns first.

tc.silva <- ps_taxacounts(mouse.silva)
head(tc.silva[, -1])
##                                                                                           lineage
## ASV1                               Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Muribaculaceae
## ASV5                   Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides
## ASV8                      Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Rikenellaceae;Alistipes
## ASV11                  Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus
## ASV13              Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Ligilactobacillus
## ASV15 Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Lachnospiraceae NK4A136 group
##                                name   rank F3D0 F3D141 F3D142 F3D143 F3D144
## ASV1                 Muribaculaceae family 3370   2927   1734   1603   2535
## ASV5                    Bacteroides  genus  154    189    180    130    104
## ASV8                      Alistipes  genus  184    321     89     83     41
## ASV11                 Lactobacillus  genus   17    168     42     78    269
## ASV13             Ligilactobacillus  genus   52     12    103     43     16
## ASV15 Lachnospiraceae NK4A136 group  genus  895    245     76     76    116
##       F3D145 F3D146 F3D147 F3D148 F3D149 F3D150 F3D1 F3D2 F3D3 F3D5 F3D6 F3D7
## ASV1    4346   2205   9454   6365   6123   2482 1469 8711 3216 1719 3583 2667
## ASV5     307    179    453    443    417    169  140  338  402  151  476  470
## ASV8     125     71     75    507    515    120  190 1211  381  207  261  213
## ASV11    317    178    448    411    478     61  102   55  171   61   45   16
## ASV13     22      4    147     18     88     64  129  330   94   48  422  117
## ASV15     94    230    460    409    455    218  948 1739  178  564  580  192
##       F3D8 F3D9
## ASV1  1647 2313
## ASV5   582  596
## ASV8   286  438
## ASV11   19   27
## ASV13  145  182
## ASV15  512  772

ASVs with identical lowest-level taxonomic assignments are merged, so this data frame has fewer rows than the number of ASVs. The short names of the ASVs are stored in the taxid column, and the names of multiple ASVs with the same lowest-level taxonomic classification are separated by a semicolon.

head(tc.silva$taxid)
## [1] "ASV1;ASV2;ASV3;ASV4;ASV6;ASV7;ASV9;ASV10;ASV12;ASV14;ASV17;ASV18;ASV20;ASV100;ASV143"                                              
## [2] "ASV5;ASV80;ASV163"                                                                                                                 
## [3] "ASV8"                                                                                                                              
## [4] "ASV11"                                                                                                                             
## [5] "ASV13"                                                                                                                             
## [6] "ASV15;ASV23;ASV24;ASV30;ASV31;ASV33;ASV36;ASV37;ASV38;ASV56;ASV57;ASV69;ASV73;ASV84;ASV91;ASV94;ASV111;ASV129;ASV134;ASV162;ASV180"

As explained in the DADA2 tutorial, the ASVs themselves (that is, the DNA sequences, not their short names) are available in the phyloseq-class object, but they are not used here. Now let’s list the number of unique lowest-level classifications at each taxonomic level:

table(tc.silva$rank)
## 
## family  genus  order 
##      9     52      4

These taxa will be used below to construct community reference proteomes.

Taxonomic mapping

A challenge of using reference proteomes is that they come from one source (GTDB and RefSeq are available in chem16S), but taxonomic assignments may be made using a different training set (such as Silva or RDP). While identical names and ranks can be used to automatically match taxonomic assignments to reference proteomes, inherent differences in available taxonomies pose difficulties. For example, RDP uses the name Escherichia/Shigella, for which the closest correspondence in RefSeq is Escherichia. This and other manual mappings were used by Dick and Tan (2023) for mapping between RDP and RefSeq.

Let’s analyze a phyloseq-class object that was made using the Silva training set but use the manual mapping that was designed to map from RDP to RefSeq.

The manual mapping from RDP to RefSeq was not designed for taxonomic assignments made using Silva. This is just an example to show the limitations of the technique.

map.silva <- map_taxa(tc.silva, refdb = "RefSeq_206")
## [1] "map_taxa: using these manual mapping(s) to NCBI RefSeq:"

family_Ruminococcaceae –> family_Oscillospiraceae (0.1%)

## [1] "map_taxa: can't map groups genus_Lachnospiraceae NK4A136 group (7.3%), order_RF39 (1.35%), 22 others (5%)"
## [1] "map_taxa: mapping rate to RefSeq_206 taxonomy is 86.4%"

The messages show that one group was manually mapped, and the total mapping rate (including both automatic name matching and manual mapping) is 86.4%.

Not let’s analyze a phyloseq-class object that was made using the RDP training set. This results in a considerably higher mapping rate to the NCBI taxonomy.

data(mouse.RDP)
tc.RDP <- ps_taxacounts(mouse.RDP)
map.RDP <- map_taxa(tc.RDP, refdb = "RefSeq_206")
## [1] "map_taxa: using these manual mapping(s) to NCBI RefSeq:"

order_Clostridiales –> order_Eubacteriales (1.4%)

family_Ruminococcaceae –> family_Oscillospiraceae (0.4%)

## [1] "map_taxa: can't map groups genus_Clostridium_XlVa (0.47%), genus_Ihubacter (0.29%), 7 others (0.54%)"
## [1] "map_taxa: mapping rate to RefSeq_206 taxonomy is 98.7%"

Now the total mapping rate is 98.7%.

Manual mapping does not overcome inconsistencies between different taxonomies but is simply a starting point for getting the most out of the available taxonomic classifications and reference proteomes. Using the Genome Taxonomy Database (GTDB) for both taxonomic classification and reference proteomes obviates the need for manual mapping and is described further below.

From taxonomy to amino acid composition

The ps_metrics() function in chem16S wraps the previous two functions (ps_taxacounts() and map_taxa()) as well as another function (get_metrics()). In order, these functions get the lowest-level taxon for each OTU or ASV, map taxonomic names to RefSeq, and combine taxonomic abundances with reference proteomes for taxa to calculate the amino acid composition of community reference proteomes, which are then used to calculate chemical metrics. To peek into this process, let’s begin by listing the amino acid compositions of some community reference proteomes.

AA.RDP <- ps_metrics(mouse.RDP, refdb = "RefSeq_206", quiet = TRUE, return_AA = TRUE)
head(AA.RDP)
##      Run chains       Ala      Cys       Asp       Glu      Phe       Gly
## 1   F3D0   6311 168957.04 28145.99 127964.63 144258.44 90485.66 153197.94
## 2 F3D141   4844 131696.18 21018.59  99639.82 109007.77 70725.38 118533.00
## 3 F3D142   2504  68690.57 10656.83  52038.20  55416.74 36915.27  61378.43
## 4 F3D143   2506  68110.94 10812.55  51760.34  56159.19 36655.56  61374.30
## 5 F3D144   3475  94794.51 14396.10  72002.10  75436.86 50753.16  84495.19
## 6 F3D145   5777 158713.35 24481.12 120504.38 126494.34 85184.00 141730.49
##        His       Ile       Lys       Leu      Met      Asn      Pro      Gln
## 1 38825.56 150203.83 129193.60 190479.38 61593.45 97793.83 81072.95 64847.15
## 2 30605.73 115752.86  98376.59 147717.39 47186.31 76797.75 64185.98 50711.12
## 3 16112.96  59866.06  50578.38  77032.13 24249.03 40450.34 33999.57 26543.86
## 4 15866.29  59980.25  51043.97  76421.37 24496.76 39949.30 33278.73 26250.61
## 5 22360.65  83019.84  69553.84 106001.86 33597.48 55812.08 46764.89 36638.24
## 6 37246.79 138367.70 115624.12 176975.14 56166.29 93440.60 78714.49 60420.63
##         Arg       Ser       Thr       Val      Trp      Tyr
## 1 108700.05 135531.51 118757.78 144695.16 23899.49 88542.69
## 2  86150.54 106069.75  92865.09 112114.17 19198.37 68566.76
## 3  45216.96  55896.96  48672.76  58110.19 10238.79 35484.42
## 4  44330.80  55167.48  48100.75  58075.59  9968.14 35548.78
## 5  61656.32  77077.97  66902.65  80283.65 13955.21 48662.21
## 6 104476.27 129395.65 111906.70 133940.97 23651.22 81864.74

In this data frame, columns named Run and chains have the sample names and abundance-weighted numbers of reference proteomes – that is, the total abundance of ASVs that have taxonomic classifications mapped to the NCBI taxonomy – that are used to compute the amino acid composition. We can use canprot::calc_metrics() to calculate the average protein length of each community reference proteome, then make a histogram:

lengths <- canprot::calc_metrics(AA.RDP, "length")[, 1]
hist(lengths)
imin <- which.min(lengths)
text(lengths[imin], 1.5, AA.RDP$Run[imin], adj = 1)

Compared to the other samples, that from Day 143 has the lowest average protein length of the community reference proteome.

From taxonomy to chemical metrics

We’re now ready to calculcate chemical metrics from the amino acid compositions.

metrics.RDP <- ps_metrics(mouse.RDP, refdb = "RefSeq_206", quiet = TRUE)
head(metrics.RDP)
##                Zc        nO2       nH2O
## F3D0   -0.1567602 -0.7047975 -0.7698409
## F3D141 -0.1549554 -0.7025843 -0.7738190
## F3D142 -0.1536513 -0.7007578 -0.7759440
## F3D143 -0.1549676 -0.7025566 -0.7740865
## F3D144 -0.1543925 -0.7014054 -0.7745432
## F3D145 -0.1532613 -0.7000987 -0.7767563

This data frame has columns for ZC (carbon oxidation state), nO2 (stoichiometric oxidation state), and nH2O (stoichiometric hydration state). ZC is computed from the elemental formula, while nO2 and nH2O are coefficients on O2 and H2O in theoretical formation reactions of the proteins, per residue, from a particular set of thermodynamic components, also known as basis species. The rationale for the choice of basis species used here (glutamine, glutamic acid, cysteine, O2, and H2O, abbreviated as QEC) was given by Dick et al. (2020). Because they are both measures of oxidation state, in general there is a strong positive correlation between ZC and nO2. For this dataset, there is in addition a strong negative correlation between ZC and nH2O, but other datasets do not show such a correlation.

cor(metrics.RDP$Zc, metrics.RDP$nO2)
## [1] 0.9937252
cor(metrics.RDP$Zc, metrics.RDP$nH2O)
## [1] -0.9266887

Let’s plot each of these chemical metrics against the sampling day.

plot_ps_metrics(mouse.RDP, x = "Day", color = "When", shape = "When", refdb = "RefSeq_206", quiet = TRUE) +
  geom_point(size = 3)

Let’s plot two chemical metrics against each other.

plot_ps_metrics2(mouse.RDP, color = "When", shape = "When", refdb = "RefSeq_206", quiet = TRUE) +
  geom_point(size = 3)

The community reference proteomes for Late samples tend to be more oxidized and less hydrated (i.e., they have higher ZC and lower nH2O) than Early samples, but there is some overlap between the groups.

Using GTDB for reference proteomes

As noted above, mapping taxon names from Silva or RDP to RefSeq is not perfect. To overcome this limitation, the GTDB can be used for both taxonomic classification of 16S rRNA gene sequences and for reference proteomes of taxa. Here we load a phyloseq-class object prepared following the DADA2 Pipeline Tutorial modified to classify 16S rRNA gene sequences using GTDB r220 (Alishum, 2024). Then, we plot chemical metrics calculated for reference proteomes also derived from the GTDB, but a later version. (The reason for this is that DADA2-formatted 16S rRNA sequences from GTDB r220 are not yet available in the Zenodo repository maintained by Alishum (2024).)

data(mouse.GTDB_220)
plot_ps_metrics2(mouse.GTDB_220, refdb = "GTDB_220", color = "When", shape = "When") + geom_point(size = 3)
## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 100.0%"

Compared to using the RDP for classification of 16S rRNA sequences and RefSeq for reference proteomes, using the GTDB for both yields a clearer separation of Early and Late samples in chemical space. There is one Early sample with anomalously high ZC and low nH2O; let’s identify it:

metrics.GTDB <- ps_metrics(mouse.GTDB_220)
## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 100.0%"
is.early <- sample_data(mouse.GTDB_220)$When == "Early"
iout <- which.min(metrics.GTDB[is.early, ]$nH2O)
(day <- sample_data(mouse.GTDB_220)[is.early, ]$Day[iout])
## [1] 7

This is the sample from day 7. Let’s look again at the taxonomic classifications, this time at the phylum level in GTDB:

top20.GTDB <- names(sort(taxa_sums(mouse.GTDB_220), decreasing = TRUE))[1:20]
mouse.GTDB_220.top20 <- transform_sample_counts(mouse.GTDB_220, function(OTU) OTU/sum(OTU))
mouse.GTDB_220.top20 <- prune_taxa(top20.GTDB, mouse.GTDB_220.top20)
plot_bar(mouse.GTDB_220.top20, x = "Day", fill = "Phylum") + facet_wrap(~When, scales = "free_x")

This plot suggests that a relatively high proportion of Bacteroidota makes the Day 7 sample more similar to samples in the Late group. Bacteroidetes (an older name for Bacteroidota) have some of the lowest nH2O among major bacterial phyla (see Dick and Tan, 2023), which could partially explain the low nH2O observed for the Day 7 sample.

Take-home messages

  1. Taxonomic mapping from RDP to RefSeq is incomplete. If you use this, it is advisable to report the percentage of mapped taxonomic classifications as well as the major unmapped groups, which can be listed with the quiet = FALSE argument to map_taxa(), ps_metrics(), plot_ps_metrics(), and plot_ps_metrics2().
  2. Consistent taxonomy can be achieved by using the GTDB for both reference proteomes and taxonomic classification of 16S rRNA gene sequences. For the dataset analyzed here, this leads to a clearer separation between sample groups in chemical space and helps to identify the taxonomic basis for possible outliers.
  3. Gut communities in later stages of mouse growth are characterized by relatively oxidized and dehydrated reference proteomes.

Beyond conventional diversity metrics used for microbiome analysis, visualizing the chemical metrics of community reference proteomes can provide new insight into genomic adaptation. See the next vignette (Plotting two chemical metrics) for more examples.

References

Alishum A. 2024. DADA2 Formatted 16S rRNA Gene Sequences for Both Bacteria & Archaea (GTDB Release 220). Zenodo. doi: 10.5281/zenodo.13984843

Dick JM, Tan J. 2023. Chemical links between redox conditions and estimated community proteomes from 16S rRNA and reference protein sequences. Microbial Ecology 85(4): 1338–1355. doi: 10.1007/s00248-022-01988-9

Dick JM, Yu M, Tan J. 2020. Uncovering chemical signatures of salinity gradients through compositional analysis of protein sequences. Biogeosciences 17(23): 6145–6162. doi: 10.5194/bg-17-6145-2020

Schloss PD, Schubert AM, Zackular JP, Iverson KD, Young VB, Petrosino JF. 2012. Stabilization of the murine gut microbiome following weaning. Gut Microbes 3(4): 383–393. doi: 10.4161/gmic.21008