Plotting two chemical metrics

Load required packages
GlobalPatterns
Humboldt Sulfuretum (Marine Sediment)
Qinghai-Tibet Plateau (Hot Springs)
Take-home messages
References

Chemical metrics quantify the elemental variation between communities by using reference proteomes derived from sequenced genomes, multiplied by taxonomic abundances inferred from high-throughput 16R rRNA gene sequences. Because chemical metrics are defined independently of the data, they can be compared across studies. Furthermore, they are basic tools for exploring hypotheses about molecular adaptation to environments.

In this vignette we’ll analyze phyloseq’s GlobalPatterns dataset (based on data from Caporaso et al., 2011) to visualize chemical variation of community reference proteomes across environments. Then, we’ll explore specific hypotheses about the effects of redox conditions and salinity on genomic adaptation by analyzing datasets for microbial communities in marine sediment (Fonseca et al., 2022) and geothermal waters (Zhang et al., 2023).

Load required packages

require(chem16S)
require(phyloseq)
require(ggplot2)
theme_set(theme_bw())

# For composing plots and making a common legend (plot_layout())
require(patchwork)

# For annotating plots with regression coefficients (stat_poly_line())
require(ggpmisc)

This vignette was compiled on 2023-07-17 with chem16S version 1.0.0 and phyloseq version 1.44.0.

GlobalPatterns

We will use the GlobalPatterns dataset ‘as-is’, without the preprocessing described in phyloseq’s Ordination Plots tutorial. There, less-abundant OTUs and phyla were removed in order to show high-level trends and shorten computing time. One step we do take from that tutorial is the addition of a categorical variable that identifies whether the samples are human-associated:

data(GlobalPatterns)
Human = get_variable(GlobalPatterns, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue")
sample_data(GlobalPatterns)$Human <- factor(Human)

Taxonomic assignments were made using the RDP Classifier without any additional training set specified (Caporaso et al., 2011), so we use the refdb = "RefSeq" argument to apply the manual mappings between RDP and RefSeq described by Dick and Tan (2023).

p2 <- plot_ps_metrics2(GlobalPatterns, color = "SampleType", shape = "Human", refdb = "RefSeq")
## [1] "map_taxa: using these manual mapping(s) to NCBI RefSeq:"

order_Rhizobiales –> order_Hyphomicrobiales (0.2%)

order_Clostridiales –> order_Eubacteriales (0.6%)

family_Ruminococcaceae –> family_Oscillospiraceae (3.1%)

## [1] "map_taxa: can't map groups order_Stramenopiles (12.94%), family_ACK-M1 (3.27%), 374 others (11.75%)"
## [1] "map_taxa: mapping rate to RefSeq taxonomy is 71.9%"

p2 + geom_polygon(aes(fill = SampleType), alpha = 0.5) + geom_point(size = 3) +
  guides(colour = guide_legend(override.aes = list(shape = c(17, 19, 19, 17, 19, 19, 17, 19, 17))))

At the extremes of carbon oxidation state (Z_C), soil communities are the most oxidized and skin and tongue communities are the most reduced. At the extremes of stoichiometric hydration state (n_H₂O), skin communities are the most hydrated and some fecal communities are the least hydrated. In more detail, there are environmental microbiomes that show similar ranges of chemical metrics (e.g., Freshwater (creek) and Sediment (estuary)) and others that are different. Freshwater – described as “lake” by Caporaso et al. (2011) – has lower Z_C than Freshwater (creek), and some ocean samples have lower n_H₂O than either freshwater group. These patterns could suggest an influence of greater oxygenation in flowing water compared to lakes (this is a distinction between lotic and lentic systems), and dehydration in communities adapted to life in salty water compared to freshwater.

Humboldt Sulfuretum (Marine Sediment)

In the remainder of this vignette, we will use the GTDB reference database (which is the default in chem16S) because the taxonomic classifications for the datasets analyzed below were made using the GTDB training set provided by Alishum (2022).

Fonseca et al. (2022) reported 16S rRNA gene sequences for sediment samples from the oxygen minimum zone of the Pacific Ocean along the coast of Chile, known as the Humboldt Sulfuretum. The sample data include dissolved oxygen, redox potential in sediment and overlying water, and organic matter (OM) content. This is a useful dataset for exploring the hypothesis that Z_C of proteins is shaped by redox conditions.

Here we read the phyloseq-class object created by using DADA2 (Callahan et al., 2016) to identify amplicon sequence variants (ASVs) in this dataset and to classify them using the GTDB training set. A sample taken from 50 m depth at the Valparaiso location on 2012-05-12 is available in the Sequence Read Archive (SRA) but was not included in the analysis described by Fonseca et al. (2022). The taxonomic composition of this sample is highly different from the all the others (see the ordination plots in the extdata directory where the ps_FEN+22.rds file is located), so we exclude it to avoid anomalous results.

psfile <- system.file("extdata/DADA2/FEN+22/ps_FEN+22.rds", package = "chem16S")
ps <- readRDS(psfile)
ps <- prune_samples(sample_names(ps) != "SRR1346095", ps)
ps

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 2095 taxa and 13 samples ]
## sample_data() Sample Data:       [ 13 samples by 14 sample variables ]
## tax_table()   Taxonomy Table:    [ 2095 taxa by 6 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 2095 reference sequences ]

Then, we plot Z_C and n_H₂O for the community reference proteomes using different colors for sample groups.

plot_ps_metrics2(ps, color = "Location") +
  geom_polygon(aes(fill = Location), alpha = 0.5) + geom_point(size = 3)

## [1] "map_taxa: mapping rate to GTDB taxonomy is 100.0%"

Among the areas with more than one sample, the community reference proteomes for Iquique are more reduced (i.e., have lower Z_C) than those for Concepcion and Valparaiso. Two of the communities at Valparaiso are also characterized by higher n_H₂O, suggesting the influence of a hydrating factor.

Let’s take a step toward more quantitative tests of these hypotheses about genomic adaptation to environments. The color scales in the next two plots reflect sediment redox potential and concentration of organic matter. The rationale for choosing these environmental measurements is described below.

p2 <- plot_ps_metrics2(ps, color = "Sediment_redox") +
  geom_point(size = 4) + labs(color = "Sediment redox (Eh)")

## [1] "map_taxa: mapping rate to GTDB taxonomy is 100.0%"

p3 <- plot_ps_metrics2(ps, color = "Organic_matter") +
  geom_point(size = 4) + labs(color = "Organic matter (%)")

## [1] "map_taxa: mapping rate to GTDB taxonomy is 100.0%"

p2 / p3

Thermodynamic considerations predict a positive correlation between redox potential and Z_C (Dick and Meng, 2023). It has also been predicted that salinity has a dehydrating effect that favors proteins with n_H₂O (Dick et al., 2020). However, these samples have no documented salinity gradient. Previous observations that protein expression in cells is shifted toward lower n_H₂O under hyperglycemic (high-glucose) conditions (Dick et al., 2020) suggest another hypothesis: a higher content of organic matter may be a proxy for dehydrating conditions.

We can use correlations between two environmental variables (redox potential or OM) and two chemical metrics for communities (Z_C or n_H₂O) in order to test these hypotheses. To make the plots, let’s construct a single data frame containing the sample data and chemical metrics.

sample.data.and.chemical.metrics.for.communities <- cbind(sample_data(ps), ps_metrics(ps))

## [1] "map_taxa: mapping rate to GTDB taxonomy is 100.0%"

Now let’s write a function to create a scatter plot for two variables and add a regression line. We use this function to make a plot for each combination of environmental variable and chemical metric.

# Defuse (enquo) and Inject (!!) from https://www.tidyverse.org/blog/2018/07/ggplot2-tidy-evaluation/
# Regression line and equation from https://stackoverflow.com/questions/7549694/add-regression-line-equation-and-r2-on-graph
scatter_plot <- function(data = sample.data.and.chemical.metrics.for.communities, x, y, xlab, ylab) {
  x <- enquo(x)
  y <- enquo(y)
  ggplot(data, aes(x = !!x, y = !!y, color = .data[["Location"]])) +
    geom_point() + xlab(xlab) + ylab(ylab) +
    # Override aes to plot one regression line for samples from all locations
    stat_poly_line(aes(x = !!x, y = !!y), inherit.aes = FALSE) +
    stat_poly_eq(aes(x = !!x, y = !!y), inherit.aes = FALSE, label.x = "center")
}

sp1 <- scatter_plot(x = Sediment_redox, y = Zc, xlab = "Sediment redox (mV)", ylab = chemlab("Zc"))
sp2 <- scatter_plot(x = Sediment_redox, y = nH2O, xlab = "Sediment redox (mV)", ylab = chemlab("nH2O"))
sp3 <- scatter_plot(x = Organic_matter, y = Zc, xlab = "Organic matter (%)", ylab = chemlab("Zc"))
sp4 <- scatter_plot(x = Organic_matter, y = nH2O, xlab = "Organic matter (%)", ylab = chemlab("nH2O"))
sp1 + sp2 + sp3 + sp4 + plot_layout(guides = "collect")

We find that carbon oxidation state is positively correlated with redox potential, and stoichiometric hydration state is negatively correlated with organic matter content. Taken alone, each of these correlations supports our initial hypotheses. However, in part because of strong covariation of the environmental variables, Z_C is also negatively correlated with OM content, and n_H₂O is positively correlated with redox potential.

The covariation of environmental variables makes it difficult to identify primary factors that drive the observed differences between communities. However, the chemical nature of these variables provides additional clues. The covariation of environmental variables (higher OM content with lower redox potential) makes sense if greater availability of organic compounds drives respiration and ensuing depletion of oxygen. This interaction among environmental variables yields a mechanistic hypothesis for the positive association between Z_C and n_H₂O (see first plot above), which could not be explained by our initial hypotheses about the effects of single variables.

Qinghai-Tibet Plateau (Hot Springs)

Zhang et al. (2023) reported 16S rRNA gene sequences for mildly alkaline hot spring reservoirs in the Qinghai-Tibet Plateau. I made the following predictions before downloading the sequence data from SRA:

I expected to find a positive correlation between Z_C of community reference proteomes and oxidation-reduction potential (ORP).
Because of input of reducing fluids, I expected to find lower Z_C than communities in marine sediments of the Humboldt Sulfuretum.
Assuming that these samples have relatively low salinity, I expected to find higher n_H₂O than marine sediment communities.

The following commands load the data and plot two environmental variables (ORP and temperature (T)) against two chemical metrics.

psfile2 <- system.file("extdata/DADA2/ZFZ+23/ps_ZFZ+23.rds", package = "chem16S")
ps2 <- readRDS(psfile2)
data.and.metrics <- cbind(sample_data(ps2), ps_metrics(ps2))

## [1] "map_taxa: mapping rate to GTDB taxonomy is 100.0%"

ps2

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 9466 taxa and 7 samples ]
## sample_data() Sample Data:       [ 7 samples by 24 sample variables ]
## tax_table()   Taxonomy Table:    [ 9466 taxa by 6 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 9466 reference sequences ]

scatter_plot_2 <- function(data = data.and.metrics, x, y, xlab, ylab) {
  x <- enquo(x)
  y <- enquo(y)
  ggplot(data, aes(x = !!x, y = !!y)) +
    geom_point() + xlab(xlab) + ylab(ylab) +
    stat_poly_line() +
    stat_poly_eq(label.x = "center")
}
sp1 <- scatter_plot_2(x = ORP, y = Zc, xlab = "ORP (mV)", ylab = chemlab("Zc"))
sp2 <- scatter_plot_2(x = ORP, y = nH2O, xlab = "ORP (mV)", ylab = chemlab("nH2O"))
sp3 <- scatter_plot_2(x = T, y = Zc, xlab = "T (°C)", ylab = chemlab("Zc"))
sp4 <- scatter_plot_2(x = T, y = nH2O, xlab = "T (°C)", ylab = chemlab("nH2O"))
sp1 + sp2 + sp3 + sp4

The correlation between ORP and Z_C isn’t as strong as I had expected. Moreover, neither of the chemical metrics is strongly associated with temperature. Therefore, this dataset seems to be an exception to the notion that particular chemical metrics of community reference proteomes are shaped by the environment at a local scale.

But let’s not forget about the global-scale predictions! How do communities in hot springs compare to those in ocean sediments? In order to make a plot, we can merge both datasets into a new phyloseq-class object. The sequence processing pipeline assigned the same taxon names to both datasets (ASV1, ASV2, etc.). Therefore, let’s append a letter to one set of names so that distinct taxa are not mistakenly combined.

taxa_names(ps2) <- paste0(taxa_names(ps2), "b")
ps_merged <- merge_phyloseq(ps, ps2)
ps_merged

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 11561 taxa and 20 samples ]
## sample_data() Sample Data:       [ 20 samples by 34 sample variables ]
## tax_table()   Taxonomy Table:    [ 11561 taxa by 6 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 11561 reference sequences ]

Let’s add a column to the sample data to indicate the type of environment for each dataset, then plot n_H₂O against Z_C.

sample_data(ps_merged)$Environment <-
  ifelse(is.na(sample_data(ps_merged)$Depth), "Hot spring", "Marine sediment")
plot_ps_metrics2(ps_merged, color = "Environment", shape = "Environment") + geom_point(size = 3)

## [1] "map_taxa: mapping rate to GTDB taxonomy is 100.0%"

In most cases, the communities from hot springs in the Qinghai-Tibet Plateau have lower Z_C and higher n_H₂O compared to those in marine sediments of the Humboldt Sulfuretum. This outcome is consistent with predictions about genomic adaptation to relatively more reducing and less saline conditions of the hot springs.

It is also notable that the positive association between Z_C and n_H₂O observed for the sediment communities does not extend to the comparison between two datasets. This distinction suggests that the factors that influence elemental composition of community reference proteomes are different at local and global scales.

Take-home messages

Chemical metrics are defined independently of the data and can be compared across datasets.
In some cases, covariation between Z_C and n_H₂O suggests interactions between environmental variables. For instance, if lower redox potential is associated with greater organic matter content in sediments, this could drive a positive association between Z_C and n_H₂O.
Variation of n_H₂O may reflect the influence of diverse factors including salinity, organic matter content, and others. An open question is: What explains the very low n_H₂O for communities in some fecal samples?
Multiple datasets can be used to examine global-scale influences of redox potential and salinity on genomic adaptation at the community level. More comprehensive tests of the prediction of a positive correlation between Z_C and redox potential at local and global scales have been reported by Dick and Meng (2023).

References

Alishum A. 2022. DADA2 formatted 16S rRNA gene sequences for both bacteria & archaea. Zenodo. doi: 10.5281/zenodo.6655692

Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. 2016. DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods 13(7): 581–583. doi: 10.1038/nmeth.3869

Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R. 2011. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proceedings of the National Academy of Sciences 108(Supplement 1): 4516–4522. doi: 10.1073/pnas.1000080107

Dick JM, Meng D. 2023. Community- and genome-based evidence for a shaping influence of redox potential on bacterial protein evolution. mSystems. doi: 10.1128/msystems.00014-23

Dick JM, Tan J. 2023. Chemical links between redox conditions and estimated community proteomes from 16S rRNA and reference protein sequences. Microbial Ecology 85(4): 1338–1355. doi: 10.1007/s00248-022-01988-9

Dick JM, Yu M, Tan J. 2020. Uncovering chemical signatures of salinity gradients through compositional analysis of protein sequences. Biogeosciences 17(23): 6145–6162. doi: 10.5194/bg-17-6145-2020

Fonseca A, Espinoza C, Nielsen LP, Marshall IPG, Gallardo VA. 2022. Bacterial community of sediments under the Eastern Boundary Current System shows high microdiversity and a latitudinal spatial pattern. Frontiers in Microbiology 13: 1016418. doi: 10.3389/fmicb.2022.1016418

Zhang H-S, Feng Q-D, Zhang D-Y, Zhu G-L, Yang L. 2023. Bacterial community structure in geothermal springs on the northern edge of Qinghai-Tibet plateau. Frontiers in Microbiology 13: 994179. doi: 10.3389/fmicb.2022.994179