Trait biases in microbial reference genomes

Albright, Sage; Louca, Stilianos

doi:10.1038/s41597-023-01994-7

Download PDF

Analysis
Open access
Published: 09 February 2023

Trait biases in microbial reference genomes

Scientific Data volume 10, Article number: 84 (2023) Cite this article

4144 Accesses
5 Citations
26 Altmetric
Metrics details

Subjects

Abstract

Common culturing techniques and priorities bias our discovery towards specific traits that may not be representative of microbial diversity in nature. So far, these biases have not been systematically examined. To address this gap, here we use 116,884 publicly available metagenome-assembled genomes (MAGs, completeness ≥80%) from 203 surveys worldwide as a culture-independent sample of bacterial and archaeal diversity, and compare these MAGs to the popular RefSeq genome database, which heavily relies on cultures. We compare the distribution of 12,454 KEGG gene orthologs (used as trait proxies) in the MAGs and RefSeq genomes, while controlling for environment type (ocean, soil, lake, bioreactor, human, and other animals). Using statistical modeling, we then determine the conditional probabilities that a species is represented in RefSeq depending on its genetic repertoire. We find that the majority of examined genes are significantly biased for or against in RefSeq. Our systematic estimates of gene prevalences across bacteria and archaea in nature and gene-specific biases in reference genomes constitutes a resource for addressing these issues in the future.

Towards the biogeography of prokaryotic genes

Article 15 December 2021

Greengenes2 unifies microbial data in a single reference tree

Article Open access 27 July 2023

Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4

Article Open access 23 February 2023

Introduction

Culturing remains the golden standard for studying bacterial and archaeal (henceforth “prokaryotic” for brevity) physiology, metabolism and pathogenicity¹, and for obtaining high-quality genome sequences, such as those in the NCBI RefSeq reference sequence database². The thousands of prokaryotic genomes now available in RefSeq, in turn, enable large-scale analyses that yield insight into the processes shaping microbial genome structure and evolution^{3,4,5,6,7,8,9}, and also form the starting pool for curated gene ontologies or databases such as eggNOG¹⁰ and rrnDB¹¹. Reference genome databases and culture-based phenotype databases also enable predictions of the likely gene contents and traits of other less studied prokaryotic clades seen in environmental samples based on phylogenetic relationships, e.g., using tools such as PICRUSt¹², Tax4Fun¹³ and FAPROTAX¹⁴. However, to date the vast majority of extant prokaryotic diversity remains uncultured and lacking a whole genome sequence, partly due to the difficulties associated with determining the proper growth conditions for each species. Conventional culturing techniques, the fact that pure cultures must grow in the absence of syntrophic partners, and typical research priorities are thought to bias our discovery towards traits that may not be representative of the broader prokaryotic diversity in nature^1,15. These biases can distort our vision of prokaryotic diversity, limit our capacity to discover new useful biochemical functions¹⁵, introduce biases in comparative phylogenetic and other evolutionary analyses¹⁶, and most likely bias phylogeny-based predictions of gene content and traits in uncultured organisms¹⁶. For example, trait biases in RefSeq are expected to cause corresponding prediction biases in PICRUSt and Tax4Fun. Systematically quantifying the true distribution of traits across prokaryotic clades and determining trait biases in reference genome databases (and by extension, in prokaryotic cultures) is required for assessing the extent of these important issues and addressing them in the future. To date such an analysis across a broad range of clades and environments is lacking, one reason being that it was until recently impossible to efficiently recover a large culture-independent set of microbial genomes from natural environments.

Recent advances in genome-resolved metagenomics now enable the recovery of nearly-complete prokaryotic genomes from complex natural microbial communities without the need for culturing^17,18,19,20. Here we use 116,884 previously published prokaryotic metagenome-assembled genomes (MAGs) from around the world to obtain a culture-independent sample of extant prokaryotic diversity in nature. We use this collection of MAGs to estimate the true prevalences of thousands of different genes (considered here as proxies of traits) across prokaryotic clades. We compare these gene prevalences to those in the widely used NCBI RefSeq prokaryotic reference genome database², and quantify gene-dependent biases of clades represented in RefSeq. The RefSeq genome database was chosen for comparison as it is one of the oldest, most comprehensive and most widely used reference genome databases, and because it is largely based on cultured organisms, although we acknowledge that some cultured organisms have not yet had their genome sequenced and accessioned in RefSeq. We use statistical models to examine to what extent specific genes influence the probability of a random MAG being represented in RefSeq to at least 95% average nucleotide identity (ANI), which is a common modern measure for delineating prokaryotic species^{6,21,22,23,24}. To account for obvious environmental preferences in both culturing and metagenomic sequencing efforts, we perform our analyses separately for different environment types, including the human microbiome, the microbiomes of non-human animals (henceforth “animals” for brevity), bioreactors, the ocean, soil, and lakes.

Results

A diverse collection of MAGs

Our collection of 116,884 prokaryotic MAGs was obtained from 203 distinct studies, covering the human microbiome (20 studies), other animals (30), bioreactors (including wastewater treatment plants, 35), the ocean (including estuaries and coastal lagoons, 72), soils (23) and lakes (28) (overview in Supplemental Table S2), and covers over 150 different phyla (Supplemental Fig. S2). All MAGs were estimated to be at least 80% complete and exhibit no more than 5% contamination, based on a set of universal single-copy marker genes (details in Methods section, overview in Supplemental Fig. S1). To avoid redundancies in species representation, we clustered MAGs into species genome bins (SGBs) based on an average nucleotide identity (ANI) threshold of 95%^{6,21,22,23,24}. The probability that a randomly chosen prokaryotic species is represented in RefSeq (henceforth “coverage”) was determined based on the fraction of MAG-SGB representatives that could be matched to at least one RefSeq genome at ANI ≥95%.

Overall, we found that only a small fraction of MAG-SGBs could be matched to a RefSeq genome, although strong differences existed between environments. Notably, by far the highest coverage was found for human-associated prokaryotes (33%) and the lowest coverages were found for lakes (2.2%) and soil (4.9%) (overview in Supplemental Table S2). This is consistent with a previous study that showed that the cultured fraction of prokaryotes associated with the human gut is substantially above the average across environments²⁵, and confirms the general expectation that only a small fraction of non-human-associated prokaryotic clades has been cultured. Our coverage estimates are also comparable to the global coverage estimated previously by Zhang et al.²⁶ based on 16S SSU rRNA amplicon sequences (~2.1%).

Gene prevalence estimates

To estimate how the prevalence of various genes differs between MAGs and prokaryotic RefSeq genomes, we searched for KEGG gene orthologs (KO’s) in MAGs as well as in RefSeq genomes using Hidden Markov Models (HMMs) from the KOfam database²⁷ (annotation summaries in Supplemental Table S1). We chose to focus on KEGG because (a) it is widely used in microbial ecology, (b) it provides ready-to-use HMMs for more accurate gene annotation than BLAST-based searches, and (c) its functional focus facilitates the interpretation of the gene-specific biases examined here. In order to avoid Eukaryote-specific genes, only genes found in at least one MAG or prokaryotic RefSeq genome were considered (12,454 genes). To eliminate redundancies in species representation in RefSeq, we clustered RefSeq genomes into species bins based on their provided species-level taxon-IDs (STIBs), and only considered a single representative per STIB. Note that we focus on the true prevalence of each gene in the original populations represented by the MAG-SGBs or RefSeq STIBs (henceforth denoted α), and not merely the detection rate of that gene in the MAG-SGBs or RefSeq STIBs; indeed, mere detection rates are generally lower than true prevalences due to the incompleteness of many MAGs and some RefSeq genomes. To account for the incompleteness of each MAG and RefSeq genome in our gene prevalence estimates, we used an appropriate probabilistic model that we fitted via maximum-likelihood (details in Methods).

Overall, we found that gene detection rates were strongly skewed towards the lower end regardless of environment, with the majority of genes detected in fewer than 5% of MAG-SGBs and RefSeq STIBs (histograms in Supplemental Fig. S3 and Supplemental Fig. S4). As expected, estimated gene prevalences in MAG-SGBs (i.e., accounting for MAG incompleteness) were generally greater than mere gene detection rates (Supplemental Fig. S6), although gene prevalences still exhibited a strong skew towards lower values regardless of environment (Supplemental Fig. S5). MAG-SGB-based gene prevalences exhibited substantial and clearly significant positive correlations between environments, i.e., a gene that was widespread in species from one environment also tended to be widespread in species from other environments (Pearson r ≥ 0.840 and P < 0.001 for all environment pairs, Supplemental Fig. S7). That said, the degree to which gene prevalences correlated between environments varied considerably. For example, the strongest correlation was found between lakes and bioreactors (r = 0.991), between humans and other animals (r = 0.990) and between lakes and ocean (r = 0.985), while the weakest correlations were found between soil and humans (r = 0.840) and between soil and other animals (r = 0.846). These results suggest that the selective forces determining the prevalences of various genes across species tend to be somewhat similar across environments, and tend to be particularly similar between the human microbiome are other animal microbiomes, and between lakes, bioreactors and the ocean.

Estimated gene prevalences in MAG-SGBs correlated positively with those in RefSeq STIBs, with Pearson correlation coefficients (r) ranging between 0.89 and 0.95 depending on the environment (P < 0.001 in all cases, Fig. 1). While these correlations may seem high, we point out that they are generally similar to the correlations observed between environments (discussed in the previous paragraph), in other words differences between MAG-SGBs and RefSeq STIBs (when controlling for environment) are comparable to differences between environments. This compromises the ecological conclusions that one may draw based on gene prevalences in reference databases such as RefSeq. In fact, the vast majority of genes displayed significantly different prevalences between MAG-SGBs and RefSeq STIBs, with a general tendency for gene prevalences to be higher among RefSeq STIBs than among MAG-SGBs. Specifically, the median ratio between MAG-SGB-based prevalences and RefSeq STIB-based prevalences (“median prevalence ratio”, or MPR) was substantially above 1 in all environments, ranging from 1.9 for bioreactors and soil up to 2.9 for humans and 2.8 for non-human animals. This tendency of genes to be more prevalent in RefSeq STIBs than MAG-SGBs could be caused by three distinct but not mutually exclusive mechanisms: First, there may be a general bias in RefSeq towards larger genomes. Indeed, prokaryotes with larger genomes tend to be more metabolically versatile and thus likely less dependent on syntrophic partners, which facilitates their culturing^28,29,30. Consistent with this interpretation, we found that genome sizes among RefSeq STIBs tended to be larger on average than genome sizes among MAG-SGBs in all considered environments, even after correcting for MAG incompleteness (Fig. 2). This result confirms and extends a previous finding that uncultured human gut bacteria tend to have significantly smaller genomes when compared to cultured ones³¹. To further examine to what extent size biases in RefSeq can explain the generally higher gene prevalences therein, we adjusted the gene prevalence estimates in RefSeq STIBs for the differences in the size distribution between MAG-SGBs and RefSeq STIBs. The adjustment applied was analogous to those commonly done in stratified demographic surveys with disproportionate sampling of strata, where stratum averages are weighted by each stratum’s proportion in the overall population in order to obtain an unbiased population average³² (see Methods for details). We found that this adjustment indeed reduced the discrepancy in gene prevalences between MAG-SGBs and RefSeq STIBs for all environments, with Pearson correlations increasing slightly in all cases and MPRs decreasing substantially towards a value of 1 (Supplemental Fig. S8). Nevertheless, in all environments the MPR remained greater than 1, with the greatest MPR (1.7) found for animal-associated and lake prokaryotes and the smallest MPR (1.3) found for human-associated and ocean prokaryotes. This suggests that size biases partly — but not fully — explain the generally greater gene prevalences estimated for RefSeq STIBs compared to MAG-SGBs.

Second, it is in principle possible that our gene detection approach was less effective in MAGs than RefSeq genomes, for example due to the generally lower quality of MAGs, rather than there being true gene prevalence differences between the two datasets. Further, difficulties still exist in fully reconstructing genomes from metagenomes, notably regarding the inclusion of plasmids and genomic islands³³. These issues could in principle also cause an under-representation of genes in MAGs, compared to genomes from cultures. To examine this possibility, we repeated our gene prevalence estimates for the subset of MAG-SGBs that could be matched to a RefSeq genome, and for the subset of RefSeq genomes matched by a MAG-SGB. By restricting our gene prevalence estimates to these subsets of MAG-SGBs and RefSeq genomes, we eliminated any major differences in species representation between the two datasets, thus focusing on potential differences in gene inclusion/detection efficacy. For these restricted datasets we found a much closer agreement between gene prevalences in MAG-SGBs and RefSeq genomes, with nearly none of the genes exhibiting a statistically significant difference in prevalence and with MPRs being nearly identical to 1 for all environments (0.97≤MPR≤1.03, Supplemental Fig. S9). We thus conclude that difficulties in including plasmids and genomic islands in MAGs, as well as differences in gene detection efficacy between MAGs and RefSeq genomes, are negligible and, in particular, are not a major source of the gene prevalence differences seen between the full datasets.

Third, RefSeq STIBs may be truly biased towards clades that exhibit the genes examined, to an extent beyond that caused by mere genome size biases. Indeed, KEGG is a highly curated, experimentally informed and functionally focused gene database, and it is possible that genes represented in KEGG tend to be of particular industrial, medical or environmental interests; fewer than half of protein-coding genes predicted in MAGs could be assigned to a KEGG ortholog (overview in Supplemental Table S1). These same interests presumably also guide the majority of culturing and whole sequencing efforts. Similarities between gene characterization biases and culturing/genome sequencing biases will inevitably lead to a tendency for RefSeq STIBs to be rich in genes catalogued in KEGG, even when controlling for genome sizes; we henceforth refer to this mechanism as “intentional” biases. Further, mainstream culturing approaches undoubtedly cause additional unintended trait biases, by favoring fast growers or generalists and disfavoring organisms that depend on syntrophic partners to survive. Genes responsible for (or at least correlating with) traits favored by typical culturing approaches will tend to be more prevalent in cultured species than among prokaryotes in general, and will thus be more likely to be characterized and included in databases such as KEGG. In other words, KEGG could be “unintentionally” biased towards genes that correlate with traits that facilitate culturing. That said, we mention that some genes exhibited lower prevalences among RefSeq STIBs than among MAG-SGBs, suggesting that common culturing approaches may also bias against some genes.

To tease apart intentional from unintentional gene prevalence biases in KEGG, we repeated our analyses with an annotation-independent gene database, the evolutionary gene genealogy of non-supervised orthologous groups (eggNOGs³⁴). If biases in KEGG are mostly intentional (as defined above), then one would expect eggNOGs to display a much weaker over-prevalence in RefSeq STIBs than KEGG orthologs do (after adjusting for genome size distributions). In contrast, if biases in KEGG are mostly unintentional (as defined above), then one would expect eggNOGs to display a similar over-prevalence in RefSeq-STIBs as KEGG orthologs do. We found that, when adjusting for genome size distributions, eggNOG MPRs were substantially smaller than KEGG MPRs in 3 out of 6 environments (human, other animals and soil), dropping as low as 0.97 for human-associated species and 1.1 for other animals (Supplemental Fig. S11). This suggests that in these environments the biases in KEGG are to a great extent intentional. In contrast, for bioreactors, ocean and lakes, eggNOG MPRs were greater than KEGG MPRs, suggesting that in these environments the biases in KEGG are largely unintentional and driven by current culturing abilities. That said, further research is needed to better understand the roles of intentional and unintentional biases in the composition of KEGG and other gene databases.

To further illustrate the discovered gene prevalence biases from a functional perspective, we examined genes involved in metabolic functions of particular industrial interest, such as lignin, mannan, xylan and cellulose degradation^35,36,37, genes involved in functions of particular environmental interest, such as dissimilatory nitrogen and sulfur metabolisms and methanogenesis, as well as genes conferring antibiotic resistance (Fig. 3). For almost all of these functions and regardless of environment, the mean gene prevalences in RefSeq STIBs were considerably higher than in MAG-SGBs. One notable exception was methanogenesis, which was substantially underrepresented in RefSeq STIBs for all environments except bioreactors (where the difference was only minor). This later observation is consistent with the fact that methanogens have been historically difficult to culture^38,39,40, and suggests than methanogenesis is a much more common trait in prokaryotes in natural environments than one would expect based on reference genomes.

Gene-dependent coverage biases

To more precisely quantify the biases for or against various genes in RefSeq, we estimated the conditional probability that a prokaryotic species is covered by (i.e., represented in) RefSeq given that it either has or does not have a given gene. We henceforth denote these two conditional probabilities by q₁ and q₀, respectively. If a gene is neither biased for nor against, we expect q₀ = q₁, while a bias for or against organisms exhibiting the gene would imply q₁ > q₀ or q₁ < q₀, respectively. We estimated q₁ and q₀ separately for each gene by fitting a probabilistic model that accounts for MAG incompleteness, gene presence/absence across MAG-SGBs and matches between MAG-SGBs and RefSeq genomes at ≥95% ANI. To facilitate comparisons between genes and between environments, we considered a composite variable that we termed “coverage bias”, denoted β and computed using the estimated q₀, q₁ (see Eq. 4 in Methods for details). The coverage bias is always between −1 and 1, with negative values implying a bias against a gene, positive values implying a bias towards a gene and zero implying no bias (q₀ = q₁). A useful property of β is that it only depends on the ratio q₁/q₀ but not on the overall coverage of MAG-SGBs in RefSeq nor on a gene’s prevalence (α). We mention that β could not be reliably estimated for all genes, since some genes were too rare (α ≈ 0) or too prevalent (α ≈ 1) for estimating the conditional probabilities q₁ or q₀, respectively.

We found that in all environments a substantial fraction (50–82%) of considered genes exhibited a statistically significant coverage bias (β≠0, P < 0.05), with the clear majority of significant coverage biases being positive (Fig. 4 and Supplemental Table S4). Accordingly, the median coverage bias across all genes was positive for all environments (0.27–0.53). This pattern is consistent with our previous observation that gene prevalences tend to be inflated in RefSeq STIBs. In fact, for almost all environments (except human) the distribution of coverage biases exhibited a clear spike at a value of β = 1 (i.e., q₀ = 0). For example, among soil-associated prokaryotes 133 out of 4335 considered genes (e.g., carA, gpsA, rimP, gidB) exhibited a coverage bias of 1 (details in File KOfam_gene_prevalences_and_biases.tsv.gz on Figshare⁴¹). Hence, the absence of these genes strongly reduces the probability of a species being represented in RefSeq. That said, we point out that we also found many genes with a significant negative coverage bias (e.g., 14% of genes for soil), which means that species exhibiting these genes are underrepresented in RefSeq. When measured in terms of the median absolute coverage bias (median |β|), we observed the strongest coverage biases for the ocean and soil (median |β| = 0.62) and the weakest coverage biases for humans (0.38) and bioreactors (0.40). This suggests that existing culturing and genome sequencing pipelines impose weaker trait biases for human- and bioreactor-associated taxa than for ocean- and soil-associated taxa. This observation is perhaps not surprising, given the generally stronger medical and industrial relevance of the first two groups, which probably increases the motivation to culture a broader spectrum of organisms. We also found that coverage biases varied strongly between environments, with the smallest consistency (in terms of the Pearson correlation) seen between humans and soil (r = 0.41) and between humans and the ocean (r = 0.44, Supplemental Fig. S15).

Strong differences were also observed between different gene categories, as defined by the KEGG KO hierarchy (Fig. 5 and Supplemental Fig. S15). Genes associated with membrane transport, cell motility and nucleotide metabolism generally exhibited the highest median |β|, although the precise order depended on the environment considered (also note that only a subset of particularly interesting gene categories was included in this comparison). For ocean and soil, which exhibited the highest median |β|, the two considered gene categories with highest median |β| were cell motility and membrane transport (median |β| between 0.68 and 0.77). This suggests that traits related to cell motility and membrane transport are particularly strong predictors of culturing success in these two environments.

Discussion

We have analyzed the distribution of thousands of genes across a large culture-independent collection of prokaryotic genomes, and have quantified gene-specific biases in RefSeq reference genomes. Given the general close association between prokaryotic culturing and the existence of a reference genome in RefSeq, as well as the close association between gene content and functional traits in prokaryotes, we expect that our conclusions largely translate to trait biases in prokaryotic cultures. We mention that MAG datasets may exhibit their own biases, for example towards more abundant organisms, or against organisms with multiple hard-to-assemble regions such as 16S SSU rRNA genes, and these biases could in principle influence some of the patterns reported here. However, there is little reason to believe at this point that these biases are driven by traits in a manner that is consistent across locations; in other words, we expect that these biases will tend to average out across locations, and thus not be a substantial driver of the trait biases in RefSeq relative to MAGs.

Genome-resolved metagenomics is greatly accelerating the discovery of novel microbial diversity^19,20,42,43. Notwithstanding these breakthroughs, it should be remembered that culturing and associated whole genome sequencing remain essential tools for understanding the physiology and ecology of prokaryotes, and for decoding their genotype-phenotype mapping^1,44. Our results suggest that the current culturing and associated whole genome sequencing efforts are heavily biased towards or against a variety of traits, particularly among ocean- and soil-associated prokaryotes. These biases lead to a distorted distribution of annotated genes/traits in reference databases, which in turn compromises the ecological conclusions one may draw from these distributions. These biases will also inevitably influence gene content and trait predictions for novel clades, for example in 16S rRNA amplicon sequencing studies, when these predictions are performed based on reference genome sets (as is common practice, e.g.^13,45). In fact, the estimated coverage biases of many genes vary strongly between environments (Supplemental Fig. S15), suggesting that site-specific trait databases and statistical corrections may be needed for accurate trait estimation in novel clades. In addition, coverage biases may also delay the discovery of new industrially useful clades, such as novel methanogens for biofuel production, which we showed were severely underrepresented in RefSeq for all natural environments examined. The gene prevalences and coverage biases estimated in this study (provided as file KOfam_gene_prevalences_and_biases.tsv.gz on Figshare⁴¹) could help alleviate these issues in the future. For example, the conditional coverages (q₀ and q₁) estimated here can be used to correct for biases in phylogenetic trait prediction algorithms¹⁶, and help steer future culturing efforts towards under-explored traits.

Methods

MAGs

MAGs from 203 distinct studies were either downloaded from NCBI GenBank or from other locations provided by published studies^{46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212}. Only studies in which all MAGs were obtained from a similar environment or in which the environment was specified individually for each MAG were considered. For efficiency in MAG collection, we focused on studies with at least 30 MAGs. Studies focusing on a single clade (e.g., Thaumarchaeota) or trait (e.g., only methanogens) were omitted. An overview of included studies, including accession numbers and publication references, is provided in file project_metadata.tsv on Figshare⁴¹. With the exception of one study (“Genomes from Earth’s Microbiomes” or GEM¹³⁷,), all other studies focused on specific environments. The GEM study itself comprised MAGs recovered from a multitude of environments across the world. Ten GEM MAGs were omitted because of missing taxonomic information. MAG qualities were determined based on the presence or absence of multiple universal single-copy genes using checkM2 v0.1.3²¹³. MAGs with completeness below 80% or contamination above 5% were omitted; thus 116,884 MAGs were kept for our analyses. An overview of completeness and contamination levels of kept MAGs is shown in Supplemental Fig. S1. We mention that these quality criteria do not exactly match commonly suggested conventions for “high” or “medium” quality MAGs²¹⁴. Instead, the chosen thresholds are based on a reasonable balance between dropping too many MAGs and ensuring sufficient quality in the remaining MAG set. For example, a completeness of at least 90% suggested for “high quality” MAGs by²¹⁴ would be too stringent for some environments such as soil, while a completeness of at least 50% suggested for “medium quality” MAGs would be too low for reliably estimating gene prevalences.

The full size of the genome represented by a MAG (i.e., correcting for MAG incompleteness) was estimated by dividing the size of the MAG (in base pairs) by the completeness determined using checkM2. The taxonomic identities of MAGs were determined using the GTDB-Tk v1.4.1 workflow classify_wf²¹⁵, except for the GEM MAGs for which taxonomic identities were already provided by the original study. All software mentioned above were used with default options unless mentioned otherwise. An overview of represented prokaryotic phyla is shown in Supplemental Fig. S2.

MAGs were associated with various (not necessarily mutually exclusive) environments of interest based on the description in the publication associated with each study (if available), or based on the project description on GenBank, or — in the case of GEM genomes — based on the metadata table provided by the GEM study¹³⁷. MAGs classified as human- and other animal-associated were explicitly excluded from the other environmental categories. Environments associated with each MAG are listed in MAG_metadata_QF.tsv.gz on Figshare⁴¹.

To avoid species-level redundancies within the MAG dataset, we clustered MAGs into species genome bins (SGBs) at an ANI cutoff of 95% using a similar approach as described by²¹⁶, separately for each environment. Specifically, ANIs between all MAGs from a given environment were calculated using mash v2.3²¹⁷, with sketch size 5000 and otherwise default options. The average nucleotide divergence (AND) between any two MAGs was defined as 1-ANI. Bifurcating trees were constructed based on pairwise ANDs and using the hierarchical clustering algorithm implemented in the R package fastcluster v1.2.3 (function hclust with average linkage)²¹⁸. For computational efficiency, prior to clustering, MAGs were split into smaller disjoint subsets of moderately to closely related MAGs, based on an AND cutoff threshold of 15%. Hierarchical clustering trees were rooted via the midpoint method²¹⁹, using the R package castor v1.7.2²²⁰. Note that each tip in a tree corresponded to a MAG. Next, tips in the hierarchical clustering trees were grouped into SGBs based on a maximum pairwise distance of 5% AND, using the function collapse_tree_at_resolution in the R package castor²²⁰. From each SGB, a single representative MAG was kept, chosen to be the MAG with the highest completeness. A total of 29,531 SGBs were thus obtained. An overview of MAGs and SGBs from each environment is given in Supplemental Table S2.

RefSeq genomes

Genomes were downloaded from the NCBI RefSeq database on October 7, 2021. All genomes whose genome_rep was “Full”, whose gap_fraction was below 0.1, and whose assembly_level was one of “Complete Genome”, “Contig”, “Scaffold”, “Chromosome”, were downloaded. RefSeq genomes were grouped into various (not necessarily mutually exclusive) environments based on the associated biosample’s metadata “geo_loc”, “biosample_organism_name”, “metagenome_source”, “env_local_scale”, “isolation_source” and “isolation_site”, as follows. Genomes whose aforementioned metadata contained any of the words “soil”, “rhizosphere”, “rhizoplane”, “root nodule” or “permafrost” were classified as soil-associated. Genomes whose aforementioned metadata contained any of the words “ocean”, “marine” or “estuary” were classified as ocean-associated. Genomes whose aforementioned metadata contained any of the words or phrases “lake”, “lakewater”, “freshwater sediment”, “freshwater mat” or “pond” were classified as lake-associated. Genomes whose aforementioned metadata contained any of the words or phrases “bioreactor”, “wastewater treatment”, “digester”, “digestor”, “reactor” or “activated sludge” were classified as bioreactor-associated. Genomes either identified as human-associated using FAPROTAX v1.2.4¹⁴ or whose biosample “host” metadata was “homo sapiens”, were classified as human-associated. Genomes either identified as animal-associated using FAPROTAX v1.2.4¹⁴ or whose biosample “host” metadata was identified as a metazoan (based on metazoan latin names in the Open Tree of Life v13.4²²¹ and a custom list of animal common names), and not already classified as human-associated, were classified as non-human-animal-associated (in this study simply “animal-associated” for brevity). Note that FAPROTAX provides a convenient means to identify human- and animal-associated species, and is used here for the sole reason of increasing the accuracy of the environmental classifications of genomes. Genomes classified as human- and other animal-associated were subsequently excluded from the other environmental categories. A total of 184,131 genomes could be associated with at least one of the above environments. The number of genomes associated with each environment is given in Supplemental Table S2. The environments associated with each RefSeq genome are listed in file RefSeq_genome_metadata_QF.tsv.gz on Figshare⁴¹.

To avoid species-level redundancies in some of our subsequent analyses, we clustered RefSeq genomes into species-level bins (STIBs) based on their provided species-level taxon identity (species_taxid field). When choosing STIB representatives we prioritized genomes based on their contig-N50 quality metric. An overview of RefSeq genomes and STIBs from each environment is given in Supplemental Table S2.

It is possible that contaminations exist in some RefSeq genomes, with reportedly isolate genomes actually originating from co-cultures, due to the difficulties of growing bacteria (notably Cyanobacteria) axenically^222,223. Such contaminations, if widespread, could in principle introduce errors in our gene prevalence estimates for RefSeq, although Cyanobacteria only constitute a very small fraction of the RefSeq genomes and we are unaware of any evidence suggesting that such issues are common across RefSeq. Future similar studies could avoid these issues (as well as assembly/binning issues discussed below) by utilizing single-cell amplified genomes²²⁴.

Gene detection

Protein-coding genes were predicted for each MAG using prodigal v2.6.3, and were subsequently annotated (matched to KEGG orthologs) using the KOfam Hidden Markov Model database²⁷ (release 2020-04-02, comprising 21461 genes) and hmmsearch v3.3.2²²⁵. A similar approach was used to detect and annotate genes in prokaryotic RefSeq genomes. To reduce computation time, we only annotated a random subset of 137,726 genomes, however all STIB representatives were explicitly included in this subset. Genes not found in any MAG nor any functionally annotated RefSeq genome were omitted from subsequent analyses, as most of these genes are likely eukaryote specific. Thus, a total of 12,454 genes were kept. The number of MAGs, MAG-SGBs, RefSeq genomes and RefSeq STIBs in which each was gene found is listed in file KOfam_gene_prevalences_and_biases.tsv.gz on Figshare⁴¹. Histograms of the number of MAG-SGBs and RefSeq STIBs in which each gene was detected are shown in Supplemental Figs. S3, S4, respectively. The average number of predicted protein-coding genes per MAG and per genome, as well as the fraction of such genes that could be functionally annotated using KOfam, are listed in Supplemental Table S1.

To examine the role of potential biases in the KEGG database we also matched predicted genes to the eggNOG 5.0 database of orthologous groups³⁴, as follows. Protein sequences of all eggNOG orthologs were downloaded from the eggNOG website at http://eggnog5.embl.de/download/eggnog_5.0 (file e5.proteomes.faa.gz). Amino acid sequences predicted in MAGs or RefSeq genomes were then matched to the downloaded protein sequence database using diamond v2.0.15.153²²⁶, and only hits with an e-value below 10⁻¹⁰ were kept. Matched sequences were converted to eggNOG ortholog IDs using a lookup table downloaded from the eggNOG website (file all_members.tsv.gz). For computational tractability, only 50,000 randomly selected eggNOG orthologs were considered for subsequent analysis. Note that the focus of this article is on KEGG orthologs (KOs), hence unless specified otherwise “gene” refers to a KO rather than an eggNOG ortholog.

Estimating gene prevalences in MAG-SGBs

To estimate the prevalence of a given gene (KEGG ortholog or eggNOG) in populations represented by our MAG-SGB set, i.e., the probability α that the population represented by a randomly chosen MAG-SGB exhibits the gene, we proceeded as follows. Throughout the analysis described below, we only considered the single representative MAG of each SGB. We assumed that the probability of a gene being detected in a randomly chosen MAG (using hmmsearch as described earlier) is given by the product α·C, where α is the true prevalence of the gene across species and C is the completeness of the MAG. We also assumed that the false positive and false negative detection rate of genes in MAGs are negligible. Hence, if M₀ is the set of MAGs in which the gene was not detected, and M₁ the set of MAGs where the gene was detected, the total likelihood of our dataset (for a given α) is given by the product of probabilities:

$$L=\prod _{m\in {M}_{0}}\left(1-\alpha {C}_{m}\right)\cdot \prod _{m\in {M}_{1}}\alpha {C}_{m},$$

(1)

where C_m is the completeness of the m-th MAG. The maximum-likelihood estimate of α, denoted $\widehat{\alpha }$, can be found by demanding that the derivative ∂L/∂α is zero, which is equivalent to the following equation:

$$\left|{M}_{1}\right|=\sum _{m\in {M}_{0}}\frac{\widehat{\alpha }{C}_{m}}{1-\widehat{\alpha }{C}_{m}},$$

(2)

where is the cardinality of |M₁|. Note that if all MAGs were complete (C_m = 1 for all m), the solution to Eq. (2) would simply be $\widehat{\alpha }=\left|{M}_{1}\right|/\left(\left|{M}_{0}\right|+\left|{M}_{1}\right|\right)$, however in reality most MAGs were incomplete, making an analytical solution difficult. Equation (2) was thus solved numerically for $\widehat{\alpha }$ in python, using the bisection method. Confidence intervals were obtained using parametric bootstrapping, i.e., based on gene presences/absences generated randomly across the MAGs according to the above statistical model and the fitted α. We mention that, like most bioinformatics analyses in this paper, checkM2 -based completeness estimates may not be fully accurate. Errors in MAG completeness estimates would introduce errors in the gene prevalence estimates, although these errors are suspected to be small based on typical checkM2 errors (mean average error ~3%)²¹³. Further, we mention that the MAGs and genomes analyzed were originally generated using a variety of alternative sequencing platforms, assembly and binning tools. This methodological variation could in principle impact contig assembly lengths, which ORFs are included/split on those contigs, the frequency of chimeric contigs, and which contigs get included in each bin, which would by extension impact the recovery of ORFs and the estimation of gene prevalences and coverage biases. That said, our analysis of gene prevalences restricted to MAG-SGBs matched by RefSeq genomes (see main article and Supplemental Fig. S9) indicated that ORF recovery and gene detection efficiency did not noticeably differ between MAG-SGBs and their matched RefSeq genomes.

Estimating gene prevalences in RefSeq STIBs

Prevalences of genes (KEGG orthologs or eggNOGs) across RefSeq STIBs were estimated using a similar approach as for MAG-SGBs, the main difference being that the completeness of a RefSeq genome was computed based on the associated gap fraction listed in RefSeq (that said, we mention that the gap fraction was negligible for the majority of RefSeq genomes considered). To further examine how the gene prevalence estimates across RefSeq STIBs would change in the absence of any genome size biases (relative to the MAG dataset), i.e., correcting for the distribution of genome sizes, we proceeded as follows. We binned MAG-SGBs based on their estimated full genome size (i.e., correcting for MAG incompleteness) into 0.5 Mbp size intervals or “strata” (i.e., 0–0.5 Mbp, 0.5–1 Mbp, 1–1.5 Mbp, …). Similarly, we binned RefSeq STIBs based on their full genome size (attribute total_length) into the same size intervals, and independently estimated gene prevalences separately for each size interval, i.e, treating each set of STIBs in a size interval as a separate dataset. For each gene, we then computed the weighted average prevalence across all size intervals, weighting each size interval by the number of MAG-SGBs in that interval. This weighting adjustment is commonly performed in demographic surveys in which different strata are sampled at different proportions³². For comparisons of gene prevalence estimates between MAG-SGBs and RefSeq STIBs see Fig. 1, and Supplemental Figs. S8–S11.

Estimating coverage biases

For the following analysis, MAGs have been deduplicated at the SGB level, i.e., we considered only one representative MAG per SGB. We say that a MAG “matches” a RefSeq genome if its ANI to that genome (as calculated using mash) was at least 95%. By “coverage” we mean the probability that a randomly chosen MAG matches a RefSeq genome. To investigate the coverages of MAGs, separately for each environment and depending on the presence or absence of specific genes (KEGG orthologs or eggNOGs), we proceeded as follows. For any given environment, let q denote the overall coverage, i.e., the probability that a randomly chosen MAG matches a RefSeq genome. For any given environment and gene, let q₀ and q₁ be the conditional probabilities that a randomly chosen MAG matches a RefSeq genome given that the population represented by the MAG lacked or had the gene, respectively. Note that a gene may be missing from an incomplete MAG even if the represented population had the gene. The two conditional probabilities q₀ and q₁ are a priori unknown, and correspond to the coverage of MAGs in the absence or presence, respectively, of the gene in the represented populations. For example, if q₀ > q₁ then this means that RefSeq is biased towards organisms lacking the gene, whereas if q₀ < q₁ RefSeq would be biased towards organisms having the gene. Note that by mathematical necessity either q₀ ≤ q ≤ q₁ or q₀ ≥ q ≥ q₁, i.e., q₀ and q₁ cannot be both above or both below the overall coverage q. Our first objective was to estimate the q₀ and q₁ based on our MAG dataset. For any given environment and gene, let ${M}_{0}^{0}$ be the set of MAGs in which the gene was not detected and which did not match any RefSeq genome, let ${M}_{0}^{1}$ be the set of MAGs in which the gene was not detected and which did match a RefSeq genome, let ${M}_{1}^{0}$ be the set of MAGs in which the gene was detected but which did not match any RefSeq genome, and let ${M}_{1}^{1}$ be the set of MAGs in which the gene was detected and which did match a RefSeq genome. As before, C_m denotes the completeness of the m-th MAG, and α denotes the probability that the population represented by a randomly selected MAG had the gene. Hence, for example, the probability of detecting a gene in a randomly chosen MAG together with that MAG matching a RefSeq genome is given by the product αC_mq₁. The probability of not detecting the gene in a randomly chosen MAG and that MAG not matching any RefSeq genome is given by the sum of probabilities of two complementary events: either the represented population did not have the gene and its MAG did not match any RefSeq genome (probability (1−α)(1−q₀)), or the population did have the gene but the gene was missing from the MAG (due to incompleteness) and the MAG did not match any RefSeq genome (probability α(1−C_m)(1−q₁)). Similar arguments can be made for all other possible scenarios as well, eventually leading to the following expression for the likelihood of our data:

$$\begin{array}{l}L=\prod _{m\in {M}_{0}^{0}}\left[\left(1-\alpha \right)\left(1-{q}_{0}\right)+\alpha \left(1-{C}_{m}\right)\left(1-{q}_{1}\right)\right]\times \prod _{m\in {M}_{0}^{1}}\left[\left(1-\alpha \right){q}_{0}+\alpha \left(1-{C}_{m}\right){q}_{1}\right]\\ \times \prod _{m\in {M}_{1}^{0}}\alpha {C}_{m}\left(1-{q}_{1}\right)\times \prod _{m\in {M}_{1}^{1}}\alpha {C}_{m}{q}_{1}.\end{array}$$

(3)

The maximum likelihood estimates ${\widehat{q}}_{0}$ and ${\widehat{q}}_{1}$ were obtained by numerically maximizing the log-likelihood ln(L) in python and using the previously obtained maximum-likelihood estimate $\widehat{\alpha }$. To avoid inaccurate estimates of the conditional coverages q₀, q₁, we only considered genes detected in at least 100 MAGs and missing from at least 100 MAGs. Further, optimization of the likelihood failed for a small fraction of genes. Thus, the specific set of genes considered differed somewhat between environments (overview in Supplemental Table S4, details in file KOfam_gene_prevalences_and_biases.tsv.gz on Figshare⁴¹). For an overview of estimated conditional coverages see Supplemental Fig. S13.

To quantify the strength of bias for or against a gene in a way that facilitates comparison between environments, we defined the “coverage bias” as follows:

$$\begin{array}{l}\beta ={\rm{sign}}\left({q}_{1}-{q}_{0}\right)\cdot \left[1-\frac{min\left({q}_{0},{q}_{1}\right)}{max\left({q}_{0},{q}_{1}\right)}\right].\end{array}$$

(4)

The coverage bias β can also equivalently be written as follows:

$$\begin{array}{l}\beta =\left(\begin{array}{cc}1-\frac{{q}_{0}}{q1} & :{q}_{1}\ge {q}_{0}\\ \frac{{q}_{1}}{{q}_{0}}-1 & :{q}_{1}\le {q}_{0}\end{array}\right..\end{array}$$

(5)

Observe that β is always between −1 and 1, and that a positive (or negative) value implies a bias for (or against) the specific gene. A value of β = 0 implies that q₀ = q₁ = q and hence an absence of any bias related to the gene. A value of β = 1 implies that q₀ = 0, which means that the absence of the gene in an organism makes it improbable that the organism is represented in RefSeq (at ANI ≥95%). On the other extreme, a value of β = −1 implies that q₁ = 0, which means that the presence of the gene in an organism makes it improbable that the organism is represented in RefSeq. A useful property of β is that it only depends on the ratio q₁/q₀ but not on the overall coverage q nor on the gene’s prevalence α, thus making it suitable for exploring the effects of the environment on gene-specific coverage biases. For example, if MAGs from populations having the gene are 3 times more probable to match a RefSeq genome compared to MAGs from populations lacking the gene (i.e., q₁ = 3q₀), then $\beta =1-\frac{1}{3}=2/3$ regardless of whether that environment in and of itself is strongly biased for or against in RefSeq, and regardless of the gene’s prevalence in that environment. The two-sided statistical significance of β was determined using parametric bootstrapping under the null model of zero bias (q₀ = q₁), i.e., by randomly re-generating gene presences/absences in the MAGs according to the fitted α and accounting for MAG completeness while ignoring MAG coverages. We mention that the above estimates are not adjusted for the differences in the genome size distributions in RefSeq-STIBs versus MAG-SGBs. For example, a q₁ greater than q₀ may be partly due to the fact that the presence of a given gene in a species will tend to correlate positively with the species’ genome size (all else being equal), which in turn will correlate positively with the inclusion of the species in RefSeq, thus increasing q₁ relative to q₀. For a summary of coverage biases see Fig. 4, Supplemental Fig. S16 and Supplemental Table S4.

Kernel density estimates of genome size distributions

Gaussian kernel density estimates of the distribution of MAG full genome sizes (i.e., accounting for MAG incompleteness) or RefSeq genome sizes (attribute total_length) were computed using the KernelDensity function in the python package scikit-learn v1.0.2²²⁷. The optimal KDE bandwidth was determined separately for each environment, and separately for MAGs and RefSeq genomes, via 5-fold cross-validation using the function GridSearchCV in scikit-learn. The pool of bandwidths considered ranged from 0.001 up to 10 times the total data range. Optimized KDE bandwidths are listed in Supplemental Table S3.

Data availability

All data have been previously published and are publicly available, as described in the Methods. Supplemental files relating to this analysis are available at Figshare⁴¹: MAG sources are given in file project_metadata.tsv, accession numbers for MAGs (where available) are given in file MAG_metadata_QF.tsv.gz, accession numbers for RefSeq genomes are given in file RefSeq_genome_metadata_QF.tsv.gz, analysis results for each gene (KEGG ortholog) are given in file KOfam_gene_prevalences_and_biases.tsv.gz, a table of all KOs found per MAG is given as file KOfams_vs_MAGs.tsv.gz and a table of all KOs found per RefSeq genome is given as file KOfams_vs_RefSeq_genomes.tsv.gz.

Code availability

All software used in this paper have been described in the Methods and are freely available online. A copy of our workflow (bash, python, R code files) is also available at Figshare⁴¹.

References

Overmann, J., Abt, B. & Sikorski, J. Present and future of culturing bacteria. Annual Review of Microbiology 71, 711–730 (2017).
Article CAS PubMed Google Scholar
O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Research 44, D733 (2016).
Article PubMed Google Scholar
Bobay, L. M. & Ochman, H. Biological species are universal across life’s domains. Genome Biology and Evolution 9, 491–501 (2017).
Article PubMed PubMed Central Google Scholar
Magnabosco, C., Moore, K., Wolfe, J. & Fournier, G. Dating phototrophic microbial lineages with reticulate gene histories. Geobiology 16, 179–189 (2018).
Article CAS PubMed PubMed Central Google Scholar
Louca, S. et al. Function and functional redundancy in microbial systems. Nature Ecology & Evolution 2, 936–943 (2018).
Article ADS Google Scholar
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90 K prokaryotic genomes reveals clear species boundaries. Nature Communications 9, 5114 (2018).
Article ADS PubMed PubMed Central Google Scholar
Zhu, Q. et al. Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains bacteria and archaea. Nature Communications 10, 5477 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Royalty, T.M. & Steen, A.D. Quantitatively partitioning microbial genomic traits among taxonomic ranks across the microbial tree of life. mSphere 4 (2019).
Murray, C. S., Gao, Y. & Wu, M. Re-evaluating the evidence for a universal genetic boundary among microbial species. Nature Communications 12, 4059 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Powell, S. et al. eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Research 42, D231–D239 (2014).
Article CAS PubMed Google Scholar
Stoddard, S. F., Smith, B. J., Hein, R., Roller, B. R. & Schmidt, T. M. rrnDB: improved tools for interpreting rRNA gene abundance in bacteria and archaea and a new foundation for future development. Nucleic Acids Research 43, D593–D598 (2014).
Article PubMed PubMed Central Google Scholar
Douglas, G. M. et al. Picrust2 for prediction of metagenome functions. Nature Biotechnology 38, 685–688 (2020).
Article CAS PubMed PubMed Central Google Scholar
Wemheuer, F. et al. Tax4Fun2: prediction of habitat-specific functional profiles and functional redundancy based on 16S rRNA gene sequences. Environmental Microbiome 15, 1–12 (2020).
Article Google Scholar
Louca, S., Parfrey, L. W. & Doebeli, M. Decoupling function and taxonomy in the global ocean microbiome. Science 353, 1272–1277 (2016).
Article ADS CAS PubMed Google Scholar
Wu, D. et al. A phylogeny-driven genomic encyclopaedia of bacteria and archaea. Nature 462, 1056–1060 (2009).
Article ADS CAS PubMed PubMed Central Google Scholar
Louca, S. & Pennell, M. W. A general and efficient algorithm for the likelihood of diversification and discrete-trait evolutionary models. Systematic Biology 69, 545–556 (2020).
Article PubMed Google Scholar
Tyson, G. W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).
Article ADS CAS PubMed Google Scholar
Sharon, I. & Banfield, J. F. Genomes from metagenomics. Science 342, 1057–1058 (2013).
Article ADS CAS PubMed Google Scholar
Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nature Microbiology 2, 1533–1542 (2017).
Article CAS PubMed Google Scholar
Chen, L. X., Anantharaman, K., Shaiber, A., Eren, A. M. & Banfield, J. F. Accurate and complete genomes from metagenomes. Genome Research 30, 315–333 (2020).
Article CAS PubMed PubMed Central Google Scholar
Konstantinidis, K. T. & Tiedje, J. M. Genomic insights that advance the species definition for prokaryotes. Proceedings of the National Academy of Sciences 102, 2567–2572 (2005).
Article ADS CAS Google Scholar
Kim, M., Oh, H. S., Park, S. C. & Chun, J. Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. Journal of Systematic and Evolutionary Microbiology 64, 346–351 (2014).
Article CAS Google Scholar
Shapiro, B.J. What microbial population genomics has taught us about speciation. In Polz, M.F. & Rajora, O.P. (eds.) Population Genomics: Microorganisms, 31–47 (Springer International Publishing, Cham, Switzerland, 2019).
Olm, M. R. et al. Consistent metagenome-derived metrics verify and delineate bacterial species boundaries. mSystems 5, e00731–19 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lagkouvardos, I., Overmann, J. & Clavel, T. Cultured microbes represent a substantial fraction of the human and mouse gut microbiota. Gut Microbes 8, 493–503 (2017).
Article PubMed PubMed Central Google Scholar
Zhang, Z., Wang, J., Wang, J., Wang, J. & Li, Y. Estimate of the sequenced proportion of the global prokaryotic genome. Microbiome 8, 1–9 (2020).
Article Google Scholar
Aramaki, T. et al. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics 36, 2251–2252 (2019).
Article PubMed Central Google Scholar
Mira, A., Ochman, H. & Moran, N. A. Deletional bias and the evolution of bacterial genomes. Trends in Genetics 17, 589–596 (2001).
Article CAS PubMed Google Scholar
Morris, J. J., Lenski, R. E. & Zinser, E. R. The Black Queen Hypothesis: evolution of dependencies through adaptive gene loss. MBio 3, e00036–12 (2012).
Article PubMed PubMed Central Google Scholar
Giovannoni, S. J., Cameron Thrash, J. & Temperton, B. Implications of streamlining theory for microbial ecology. ISME Journal 8, 1553–1565 (2014).
Article PubMed PubMed Central Google Scholar
Nayfach, S., Shi, Z. J., Seshadri, R., Pollard, K. S. & Kyrpides, N. C. New insights from uncultivated genomes of the global human gut microbiome. Nature 568, 505–510 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Gary, P.R. Adjusting for nonresponse in surveys. In Smart, J.C. (ed.) Higher Education: Handbook of Theory and Research, chap. 8, 411–449 (Springer, Dordrecht, Netherlands, 2007).
Maguire, F. et al. Metagenome-assembled genome binning methods with short reads disproportionately fail for plasmids and genomic islands. Microbial Genomics 6, mgen000436 (2020).
Article PubMed PubMed Central Google Scholar
Huerta-Cepas, J. et al. eggnog 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Research 47, D309–D314 (2019).
Article CAS PubMed Google Scholar
Abdel-Hamid, A.M., Solbiati, J.O., Cann, I.K.O., Sariaslani, S. & Gadd, G.M. Insights into lignin degradation and its potential industrial applications, vol. 82, chap. 1, 1–28 (Academic Press, 2013).
El-Bondkly, A.M. Sequence analysis of industrially important genes from trichoderma. In Biotechnology and biology of Trichoderma, chap. 28, 377–392 (Elsevier, 2014).
Dawood, A. & Ma, K. Applications of microbial β-mannanases. Frontiers in Bioengineering and Biotechnology 8 (2020).
Khelaifia, S., Raoult, D. & Drancourt, M. A versatile medium for cultivating methanogenic archaea. PLOS ONE 8, e61563 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Khelaifia, S. et al. Aerobic culture of methanogenic archaea without an external source of hydrogen. European Journal of Clinical Microbiology & Infectious Diseases 35, 985–991 (2016).
Article CAS Google Scholar
Michał, B. et al. Phymet2: a database and toolkit for phylogenetic and metabolic analyses of methanogens. Environmental Microbiology Reports 10, 378–382 (2018).
Article PubMed Google Scholar
Albright, S. & Louca, S. Trait biases in microbial reference genomes, figshare., https://doi.org/10.6084/m9.figshare.c.6055004.v1 (2022).
Castelle, C. J. & Banfield, J. F. Major new microbial groups expand diversity and alter our understanding of the tree of life. Cell 172, 1181–1197 (2018).
Article CAS PubMed Google Scholar
Murray, A. E. et al. Roadmap for naming uncultivated archaea and bacteria. Nature Microbiology 5, 987–994 (2020).
Article CAS PubMed PubMed Central Google Scholar
Palleroni, N. J. Prokaryotic diversity and the importance of culturing. Antonie van Leeuwenhoek 72, 3–19 (1997).
Article CAS PubMed Google Scholar
Langille, M. G. et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nature Biotechnology 31, 814–821 (2013).
Article CAS PubMed PubMed Central Google Scholar
Tran, P. Q. et al. Depth-discrete metagenomics reveals the roles of microbes in biogeochemical cycling in the tropical freshwater Lake Tanganyika. The ISME Journal 15, 1971–1986 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kroeger, M. E. et al. New biological insights into how deforestation in amazonia affects soil microbial communities using metagenomics and metagenome-assembled genomes. Frontiers in Microbiology 9, 1635 (2018).
Article PubMed PubMed Central Google Scholar
Nathani, N. M. et al. 309 metagenome assembled microbial genomes from deep sediment samples in the Gulfs of Kathiawar Peninsula. Scientific Data 8, 194 (2021).
Article PubMed PubMed Central Google Scholar
Irazoqui, J. M., Eberhardt, M. F., Adjad, M. M., Amadio, A. F. & Collado, M. C. Identification of key microorganisms in facultative stabilization ponds from dairy industries, using metagenomics. PeerJ 10, e12772 (2022).
Article PubMed PubMed Central Google Scholar
Hwang, Y. et al. Leave no stone unturned: individually adapted xerotolerant Thaumarchaeota sheltered below the boulders of the Atacama Desert hyperarid core. Microbiome 9, 234 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tully, B., Wheat, C. G., Glazer, B. T. & Huber, J. A dynamic microbial community with high functional redundancy inhabits the cold, oxic subseafloor aquifer. ISME Journal 12, 1–16 (2018).
Article CAS PubMed Google Scholar
Vanwonterghem, I., Jensen, P. D., Rabaey, K. & Tyson, G. W. Genome-centric resolution of microbial diversity, metabolism and interactions in anaerobic digestion. Environmental Microbiology 18, 3144–3158 (2016).
Article CAS PubMed Google Scholar
Glasl, B. et al. Comparative genome-centric analysis reveals seasonal variation in the function of coral reef microbiomes. The ISME Journal 14, 1435–1450 (2020).
Article PubMed PubMed Central Google Scholar
Robbins, S. J. et al. A genomic view of the reef-building coral Porites lutea and its microbial symbionts. Nature Microbiology 4, 2090–2100 (2019).
Article PubMed Google Scholar
Engelberts, J. P. et al. Characterization of a sponge microbiome using an integrative genome-centric approach. The ISME Journal 14, 1100–1110 (2020).
Article CAS PubMed PubMed Central Google Scholar
Bowerman, K. L. et al. Disease-associated gut microbiome and metabolome changes in patients with chronic obstructive pulmonary disease. Nature Communications 11, 5886 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, Y. J. et al. Hydrodynamic disturbance controls microbial community assembly and biogeochemical processes in coastal sediments. The ISME Journal 16, 750–763 (2022).
Article CAS PubMed Google Scholar
Hugerth, L. W. et al. Metagenome-assembled genomes uncover a global brackish microbiome. Genome Biology 16, 279 (2015).
Article PubMed PubMed Central Google Scholar
Alneberg, J. et al. Ecosystem-wide metagenomic binning enables prediction of ecological niches from genomes. Communications Biology 3, 119 (2020).
Article PubMed PubMed Central Google Scholar
Di Cesare, A. et al. Genomic comparison and spatial distribution of different Synechococcus phylotypes in the Black Sea. Frontiers in Microbiology 11, 1979 (2020).
Article PubMed PubMed Central Google Scholar
van Vliet, D. M. et al. The bacterial sulfur cycle in expanding dysoxic and euxinic marine waters. Environmental Microbiology 23, 2834–2857 (2021).
Article PubMed Google Scholar
Dalcin Martins, P. et al. Enrichment of novel Verrucomicrobia, Bacteroidetes, and Krumholzibacteria in an oxygen-limited methane- and iron-fed bioreactor inoculated with Bothnian Sea sediments. MicrobiologyOpen 10, e1175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Stewart, R. D. et al. Compendium of 4,941 rumen metagenome-assembled genomes for rumen microbiome biology and enzyme discovery. Nature Biotechnology 37, 953–961 (2019).
Article CAS PubMed PubMed Central Google Scholar
Segura-Wang, M., Grabner, N., Koestelbauer, A., Klose, V. & Ghanbari, M. Genome-resolved metagenomics of the chicken gut microbiome. Frontiers in Microbiology 12, 726923 (2021).
Article PubMed PubMed Central Google Scholar
Ruuskanen, M. O. et al. Microbial genomes retrieved from High Arctic lake sediments encode for adaptation to cold and oligotrophic environments. Limnology and Oceanography 65, S233–S247 (2020).
Article CAS Google Scholar
Haas, S., Desai, D. K., LaRoche, J., Pawlowicz, R. & Wallace, D. W. R. Geomicrobiology of the carbon, nitrogen and sulphur cycles in Powell Lake: a permanently stratified water column containing ancient seawater. Environmental Microbiology 21, 3927–3952 (2019).
Article CAS PubMed Google Scholar
Spasov, E. et al. High functional diversity among Nitrospira populations that dominate rotating biological contactor microbial communities in a municipal wastewater treatment plant. The ISME Journal 14, 1857–1872 (2020).
Article CAS PubMed PubMed Central Google Scholar
Vigneron, A. et al. Genomic evidence for sulfur intermediates as new biogeochemical hubs in a model aquatic microbial ecosystem. Microbiome 9, 46 (2021).
Article CAS PubMed PubMed Central Google Scholar
Galambos, D., Anderson, R. E., Reveillaud, J. & Huber, J. A. Genome-resolved metagenomics and metatranscriptomics reveal niche differentiation in functionally redundant microbial communities at deep-sea hydrothermal vents. Environmental Microbiology 21, 4395–4410 (2019).
Article CAS PubMed PubMed Central Google Scholar
Stewart, R. D. et al. Assembly of 913 microbial genomes from metagenomic sequencing of the cow rumen. Nature Communications 9, 870 (2018).
Article ADS PubMed PubMed Central Google Scholar
Xing, P. et al. Stratification of microbiomes during the holomictic period of Lake Fuxian, an alpine monomictic lake. Limnology and Oceanography 65, S134–S148 (2020).
Article Google Scholar
Zhang, S., Hu, Z. & Wang, H. Metagenomic analysis exhibited the co-metabolism of polycyclic aromatic hydrocarbons by bacterial community from estuarine sediment. Environment International 129, 308–319 (2019).
Article CAS PubMed Google Scholar
Lin, Y., Wang, L., Xu, K., Li, K. & Ren, H. Revealing taxon-specific heavy metal-resistance mechanisms in denitrifying phosphorus removal sludge using genome-centric metaproteomics. Microbiome 9, 67 (2021).
Article CAS PubMed PubMed Central Google Scholar
Liu, L. et al. High-quality bacterial genomes of a partial-nitritation/anammox system by an iterative hybrid assembly method. Microbiome 8, 155 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kantor, R. S. et al. Bioreactor microbial ecosystems for thiocyanate and cyanide degradation unravelled with genome-resolved metagenomics. Environmental Microbiology 17, 4929–4941 (2015).
Article CAS PubMed Google Scholar
Zhou, Z. et al. Gammaproteobacteria mediating utilization of methyl-, sulfur- and petroleum organic compounds in deep ocean hydrothermal plumes. The ISME Journal 14, 3136–3148 (2020).
Article CAS PubMed PubMed Central Google Scholar
Reysenbach, A. L. et al. Complex subsurface hydrothermal fluid mixing at a submarine arc volcano supports distinct and highly diverse microbial communities. Proceedings of the National Academy of Sciences 117, 32627–32638 (2020).
Article ADS CAS Google Scholar
Hou, J. et al. Microbial succession during the transition from active to inactive stages of deep-sea hydrothermal vent sulfide chimneys. Microbiome 8, 102 (2020).
Article CAS PubMed PubMed Central Google Scholar
Campanaro, S. et al. Metagenomic analysis and functional characterization of the biogas microbiome using high throughput shotgun sequencing and a novel binning strategy. Biotechnology for Biofuels 9, 26 (2016).
Article PubMed PubMed Central Google Scholar
Singleton, C. M. et al. Connecting structure to function with the recovery of over 1000 high-quality metagenome-assembled genomes from activated sludge using long-read sequencing. Nature Communications 12, 2009 (2021).
Article CAS PubMed PubMed Central Google Scholar
Diamond, S. et al. Mediterranean grassland soil C–N compound turnover is dependent on rainfall and depth, and is mediated by genomically divergent microorganisms. Nature Microbiology 4, 1356–1367 (2019).
Article CAS PubMed PubMed Central Google Scholar
Rasigraf, O. et al. Microbial community composition and functional potential in Bothnian Sea sediments is linked to Fe and S dynamics and the quality of organic matter. Limnology and Oceanography 65, S113–S133 (2020).
Article CAS Google Scholar
Rissanen, A. J. et al. Vertical stratification patterns of methanotrophs and their genetic controllers in water columns of oxygen-stratified boreal lakes. FEMS Microbiology Ecology 97, fiaa252 (2021).
Article CAS PubMed Google Scholar
Campanaro, S. et al. New insights from the biogas microbiome by comprehensive genome-resolved metagenomics of nearly 1600 species originating from multiple anaerobic digesters. Biotechnology for Biofuels 13, 25 (2020).
Article CAS PubMed PubMed Central Google Scholar
Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature Biotechnology 39, 105–114 (2021).
Article CAS PubMed Google Scholar
Zhou, Z. et al. Genome- and community-level interaction insights into carbon utilization and element cycling functions of hydrothermarchaeota in hydrothermal sediment. mSystems 5 (2020).
Pachiadaki, M. G. et al. Charting the complexity of the marine microbiome through single-cell genomics. Cell 179, 1623–1635.e11 (2019).
Article CAS PubMed PubMed Central Google Scholar
Martijn, J., Vosseberg, J., Guy, L., Offre, P. & Ettema, T. J. G. Deep mitochondrial origin outside the sampled alphaproteobacteria. Nature 557, 101–105 (2018).
Article ADS CAS PubMed Google Scholar
Greenlon, A. et al. Global-level population genomics reveals differential effects of geography and phylogeny on horizontal gene transfer in soil bacteria. Proceedings of the National Academy of Sciences 116, 15200–15209 (2019).
Article ADS CAS Google Scholar
Hervé, V. et al. Phylogenomic analysis of 589 metagenome-assembled genomes encompassing all major prokaryotic lineages from the gut of higher termites. PeerJ 8, e8614 (2020).
Article PubMed PubMed Central Google Scholar
von Appen, W.J. The expedition PS114 of the research vessel POLARSTERN to the Fram Strait in 2018. Tech. Rep., Alfred Wegener Institute for Polar and Marine Research (2018).
Dombrowski, N., Seitz, K. W., Teske, A. P. & Baker, B. J. Genomic insights into potential interdependencies in microbial hydrocarbon and nutrient cycling in hydrothermal sediments. Microbiome 5, 106 (2017).
Article PubMed PubMed Central Google Scholar
Yu, J. et al. Dna-stable isotope probing shotgun metagenomics reveals the resilience of active microbial communities to biochar amendment in oxisol soil. Frontiers in Microbiology 11, 587972 (2020).
Article PubMed PubMed Central Google Scholar
Forster, S. C. et al. A human gut bacterial genome and culture collection for improved metagenomic analyses. Nature Biotechnology 37, 186–192 (2019).
Article CAS PubMed PubMed Central Google Scholar
Gharechahi, J. et al. Metagenomic analysis reveals a dynamic microbiome with diversified adaptive functions to utilize high lignocellulosic forages in the cattle rumen. The ISME Journal 15, 1108–1120 (2021).
Article CAS PubMed Google Scholar
Meier, D. V., Imminger, S., Gillor, O. & Woebken, D. Distribution of mixotrophy and desiccation survival mechanisms across microbial genomes in an arid biological soil crust community. mSystems 6, e00786–20 (2021).
Article CAS PubMed PubMed Central Google Scholar
Haro-Moreno, J. M. et al. Dysbiosis in marine aquaculture revealed through microbiome analysis: reverse ecology for environmental sustainability. FEMS Microbiology Ecology 96, fiaa218 (2020).
Article CAS PubMed Google Scholar
Haro-Moreno, J. M. et al. Fine metagenomic profile of the Mediterranean stratified and mixed water columns revealed by assembly and recruitment. Microbiome 6, 128 (2018).
Article PubMed PubMed Central Google Scholar
Dong, X. et al. Metabolic potential of uncultured bacteria and archaea associated with petroleum seepage in deep-sea sediments. Nature Communications 10, 1816 (2019).
Article ADS PubMed PubMed Central Google Scholar
Poghosyan, L. et al. Metagenomic profiling of ammonia- and methane-oxidizing microorganisms in two sequential rapid sand filters. Water Research 185, 116288 (2020).
Article CAS PubMed Google Scholar
Paula, D. M., Jeroen, F., Hugh, M. & Meng, M. L. & J., W.M. Wetland sediments host diverse microbial taxa capable of cycling alcohols. Applied and Environmental Microbiology 85, 00189–19 (2019).
Google Scholar
Aromokeye, D. A. et al. Crystalline iron oxides stimulate methanogenic benzoate degradation in marine sediment-derived enrichment cultures. The ISME Journal 15, 965–980 (2021).
Article CAS PubMed Google Scholar
Borchert, E. et al. Deciphering a marine bone-degrading microbiome reveals a complex community effort. mSystems 6, e01218–20 (2021).
Article CAS PubMed PubMed Central Google Scholar
Osvatic, J. T. et al. Global biogeography of chemosynthetic symbionts reveals both localized and globally distributed symbiont groups. Proceedings of the National Academy of Sciences 118, e2104378118 (2021).
Article CAS Google Scholar
Boeuf, D. et al. Biological composition and microbial dynamics of sinking particulate organic matter at abyssal depths in the oligotrophic open ocean. Proceedings of the National Academy of Sciences 116, 11824–11832 (2019).
Article ADS CAS Google Scholar
Woodcroft, B. J. et al. Genome-centric view of carbon processing in thawing permafrost. Nature 560, 49–54 (2018).
Article ADS CAS PubMed Google Scholar
Alqahtani, M. F. et al. Enrichment of Marinobacter sp. and halophilic homoacetogens at the biocathode of microbial electrosynthesis system inoculated with Red Sea brine pool. Frontiers in Microbiology 10, 2563 (2019).
Article PubMed PubMed Central Google Scholar
Haroon, M. F., Thompson, L. R., Parks, D. H., Hugenholtz, P. & Stingl, U. A catalogue of 136 microbial draft genomes from Red Sea metagenomes. Scientific Data 3, 160050 (2016).
Article CAS PubMed PubMed Central Google Scholar
Vavourakis, C. D. et al. A metagenomics roadmap to the uncultured genome diversity in hypersaline soda lake sediments. Microbiome 6, 1–18 (2018).
Article Google Scholar
Cabello-Yeves, P. J. et al. Microbiome of the deep Lake Baikal, a unique oxic bathypelagic habitat. Limnology and Oceanography 65, 1471–1488 (2020).
Article ADS CAS Google Scholar
Vavourakis, C. D. et al. Metagenomes and metatranscriptomes shed new light on the microbial-mediated sulfur cycle in a siberian soda lake. BMC Biology 17, 69 (2019).
Article PubMed PubMed Central Google Scholar
Waterworth, S. C., Isemonger, E. W., Rees, E. R., Dorrington, R. A. & Kwan, J. C. Conserved bacterial genomes from two geographically isolated peritidal stromatolite formations shed light on potential functional guilds. Environmental Microbiology Reports 13, 126–137 (2021).
Article CAS PubMed Google Scholar
Huddy, R. J. et al. Thiocyanate and organic carbon inputs drive convergent selection for specific autotrophic Afipia and Thiobacillus strains within complex microbiomes. Frontiers in Microbiology 12, 643368 (2021).
Article PubMed PubMed Central Google Scholar
Emerson, J. B. et al. Diverse sediment microbiota shape methane emission temperature sensitivity in Arctic lakes. Nature Communications 12, 5815 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Chiri, E. et al. Termite gas emissions select for hydrogenotrophic microbial communities in termite mounds. Proceedings of the National Academy of Sciences 118, e2102625118 (2021).
Article CAS Google Scholar
Gong, G., Zhou, S., Luo, R., Gesang, Z. & Suolang, S. Metagenomic insights into the diversity of carbohydrate-degrading enzymes in the yak fecal microbial community. BMC Microbiology 20, 302 (2020).
Article PubMed PubMed Central Google Scholar
Zhou, S. et al. Characterization of metagenome-assembled genomes and carbohydrate-degrading genes in the gut microbiota of Tibetan pig. Frontiers in Microbiology 11, 595066 (2020).
Article PubMed PubMed Central Google Scholar
Tully, B. J., Graham, E. D. & Heidelberg, J. F. The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans. Scientific Data 5, 170203 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lavrinienko, A. et al. Two hundred and fifty-four metagenome-assembled bacterial genomes from the bank vole gut microbiota. Scientific Data 7, 312 (2020).
Article CAS PubMed PubMed Central Google Scholar
Peng, X. et al. Genomic and functional analyses of fungal and bacterial consortia that enable lignocellulose breakdown in goat gut microbiomes. Nature Microbiology 6, 499–511 (2021).
Article CAS PubMed PubMed Central Google Scholar
Dudek, N. K. et al. Novel microbial diversity and functional potential in the marine mammal oral microbiome. Current Biology 27, 3752–3762.e6 (2017).
Article CAS PubMed Google Scholar
Pinto, A. J. et al. Metagenomic evidence for the presence of comammox nitrospira-like bacteria in a drinking water system. mSphere 1, e00054–15 (2015).
PubMed PubMed Central Google Scholar
Zaremba-Niedzwiedzka, K. et al. Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature 541, 353–358 (2017).
Article ADS CAS PubMed Google Scholar
Nobu, M. K. et al. Catabolism and interactions of uncultured organisms shaped by eco-thermodynamics in methanogenic bioprocesses. Microbiome 8, 111 (2020).
Article CAS PubMed PubMed Central Google Scholar
Butterfield, C. N. et al. Proteogenomic analyses indicate bacterial methylotrophy and archaeal heterotrophy are prevalent below the grass root zone. PeerJ 4, e2687 (2016).
Article PubMed PubMed Central Google Scholar
Castelle, C. J. et al. Protein family content uncovers lineage relationships and bacterial pathway maintenance mechanisms in DPANN Archaea. Frontiers in Microbiology 12, 660052 (2021).
Article PubMed PubMed Central Google Scholar
Alteio, L. V. et al. Complementary metagenomic approaches improve reconstruction of microbial diversity in a forest soil. mSystems 5, e00768–19 (2020).
Article PubMed PubMed Central Google Scholar
Shaiber, A. et al. Functional and genetic markers of niche partitioning among enigmatic members of the human oral microbiome. Genome Biology 21, 292 (2020).
Article PubMed PubMed Central Google Scholar
Jungbluth, S. P., Amend, J. P. & Rappé, M. S. Metagenome sequencing and 98 microbial genomes from Juan de Fuca Ridge flank subsurface fluids. Scientific Data 4, 170037 (2017).
Article CAS PubMed PubMed Central Google Scholar
Sheik, C. S. et al. Dolichospermum blooms in Lake Superior: DNA-based approach provides insight to the past, present and future of blooms. Journal of Great Lakes Research 48, 1191–1205 (2022).
Article CAS Google Scholar
Barnum, T. P. et al. Genome-resolved metagenomics identifies genetic mobility, metabolic interactions, and unexpected diversity in perchlorate-reducing communities. The ISME Journal 12, 1568–1581 (2018).
Article CAS PubMed PubMed Central Google Scholar
Julian, D. et al. Coastal ocean metagenomes and curated metagenome-assembled genomes from Marsh Landing, Sapelo Island (Georgia, USA). Microbiology Resource Announcements 8, e00934–19 (2019).
Google Scholar
Breister, A. M. et al. Soil microbiomes mediate degradation of vinyl ester-based polymer composites. Communications Materials 1, 101 (2020).
Article ADS Google Scholar
Fu, H., Uchimiya, M., Gore, J. & Moran, M. A. Ecological drivers of bacterial community assembly in synthetic phycospheres. Proceedings of the National Academy of Sciences 117, 3656–3662 (2020).
Article ADS CAS Google Scholar
Nobu, M. K. et al. Thermodynamically diverse syntrophic aromatic compound catabolism. Environmental Microbiology 19, 4576–4586 (2017).
Article CAS PubMed Google Scholar
Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019).
Article CAS PubMed PubMed Central Google Scholar
Nayfach, S. et al. A genomic catalog of Earth’s microbiomes. Nature Biotechnology 39, 499–509 (2021).
Article CAS PubMed Google Scholar
Li, Z. et al. Deep sea sediments associated with cold seeps are a subsurface reservoir of viral diversity. The ISME Journal 15, 2366–2378 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bay, S. K. et al. Trace gas oxidizers are widespread and active members of soil microbial communities. Nature Microbiology 6, 246–256 (2021).
Article CAS PubMed Google Scholar
Seyler, L. M., Trembath-Reichert, E., Tully, B. J. & Huber, J. A. Time-series transcriptomics from cold, oxic subseafloor crustal fluids reveals a motile, mixotrophic microbial community. The ISME Journal 15, 1192–1206 (2021).
Article CAS PubMed Google Scholar
Herold, M. et al. Integration of time-series meta-omics data reveals how microbial ecosystems respond to disturbance. Nature Communications 11, 5281 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Dong, X. et al. Thermogenic hydrocarbon biodegradation by diverse depth-stratified microbial populations at a Scotian Basin cold seep. Nature Communications 11, 5825 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Thompson, L. R. et al. Metagenomic covariation along densely sampled environmental gradients in the Red Sea. The ISME Journal 11, 138–151 (2017).
Article CAS PubMed Google Scholar
Dominik, S., Daniela, Z., Anja, P., Katharina, R. & Rolf, D. Metagenome-assembled genome sequences from different wastewater treatment stages in Germany. Microbiology Resource Announcements 10, e00504–21 (2021).
Google Scholar
Langwig, M. V. et al. Large-scale protein level comparison of Deltaproteobacteria reveals cohesive metabolic groups. The ISME Journal 16, 307–320 (2022).
Article CAS PubMed Google Scholar
Rezaei Somee, M. et al. Distinct microbial community along the chronic oil pollution continuum of the Persian Gulf converge with oil spill accidents. Scientific Reports 11, 11316 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Gilroy, R. et al. Metagenomic investigation of the equine faecal microbiome reveals extensive taxonomic diversity. PeerJ 10, e13084 (2022).
Article PubMed PubMed Central Google Scholar
Bhattarai, B., Bhattacharjee, A. S., Coutinho, F. H. & Goel, R. K. Viruses and their interactions with bacteria and archaea of hypersaline Great Salt Lake. Frontiers in Microbiology 12, 701414 (2021).
Article PubMed PubMed Central Google Scholar
Liu, L. et al. Microbial diversity and adaptive strategies in the Mars-like Qaidam Basin, North Tibetan Plateau, China. Environmental Microbiology Reports (2022).
Lin, H. et al. Mercury methylation by metabolically versatile and cosmopolitan marine bacteria. The ISME Journal 15, 1810–1825 (2021).
Article CAS PubMed PubMed Central Google Scholar
Martnez-Pérez, C. et al. Lifting the lid: nitrifying archaea sustain diverse microbial communities below the Ross Ice Shelf. SSRN (2020).
Zhang, L. et al. Metagenomic insights into the effect of thermal hydrolysis pre-treatment on microbial community of an anaerobic digestion system. Science of The Total Environment 791, 148096 (2021).
Article ADS CAS PubMed Google Scholar
Starr, E. P. et al. Stable-isotope-informed, genome-resolved metagenomics uncovers potential cross-kingdom interactions in rhizosphere soil. mSphere 6, e00085–21 (2021).
Article CAS PubMed PubMed Central Google Scholar
Matthew, C. et al. Archaeal and bacterial metagenome-assembled genome sequences derived from pig feces. Microbiology Resource Announcements 11, 01142–21 (2022).
Google Scholar
Wang, Y., Zhao, R., Liu, L., Li, B. & Zhang, T. Selective enrichment of comammox from activated sludge using antibiotics. Water Research 197, 117087 (2021).
Article CAS PubMed Google Scholar
Gilroy, R. et al. Extensive microbial diversity within the chicken gut microbiome revealed by metagenomics and culture. PeerJ 9, e10941 (2021).
Article PubMed PubMed Central Google Scholar
Chen, Y. H. et al. Salvaging high-quality genomes of microbial species from a meromictic lake using a hybrid sequencing approach. Communications Biology 4, 996 (2021).
Article CAS PubMed PubMed Central Google Scholar
Beach, N. K., Myers, K. S., Donohue, T. J. & Noguera, D. R. Metagenomes from 25 low-abundance microbes in a partial nitritation anammox microbiome. Microbiology Resource Announcements 11, 00212–22 (2022).
Article CAS Google Scholar
Solanki, V. et al. Glycoside hydrolase from the GH76 family indicates that marine Salegentibacter sp. Hel_I_6 consumes alpha-mannan from fungi. The ISME Journal 16, 1818–1830 (2022).
Article CAS PubMed PubMed Central Google Scholar
Hiraoka, S. et al. Diverse DNA modification in marine prokaryotic and viral communities. Nucleic Acids Research 50, 1531–1550 (2022).
Article CAS PubMed PubMed Central Google Scholar
Haryono, M.A.S. et al. Recovery of high quality metagenome-assembled genomes from full-scale activated sludge microbial communities in a tropical climate using longitudinal metagenome sampling. Frontiers in Microbiology 13 (2022).
Rodrguez-Ramos, J.A. et al. Microbial genome-resolved metaproteomic analyses frame intertwined carbon and nitrogen cycles in river hyporheic sediments. Research Square (2021).
Kim, M., Cho, H. & Lee, W. Y. Distinct gut microbiotas between southern elephant seals and Weddell seals of Antarctica. Journal of Microbiology 58, 1018–1026 (2020).
Article CAS PubMed Google Scholar
Voorhies, A. A. et al. Cyanobacterial life at low O2: community genomics and function reveal metabolic versatility and extremely low diversity in a Great Lakes sinkhole mat. Geobiology 10, 250–267 (2012).
Article CAS PubMed Google Scholar
McDaniel, E. A. et al. Tbasco: trait-based comparative ‘omics identifies ecosystem-level and niche-differentiating adaptations of an engineered microbiome. ISME Communications 2, 111 (2022).
Article PubMed Central Google Scholar
Wang, W. et al. Contrasting bacterial and archaeal distributions reflecting different geochemical processes in a sediment core from the Pearl River Estuary. AMB Express 10, 16 (2020).
Article PubMed PubMed Central Google Scholar
Mandakovic, D. et al. Genome-scale metabolic models of Microbacterium species isolated from a high altitude desert environment. Scientific Reports 10, 5560 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. Seasonal prevalence of ammonia-oxidizing archaea in a full-scale municipal wastewater treatment plant treating saline wastewater revealed by a 6-year time-series analysis. Environmental Science & Technology 55, 2662–2673 (2021).
Article ADS CAS Google Scholar
Bulzu, P. A. et al. Casting light on Asgardarchaeota metabolism in a sunlit microoxic niche. Nature Microbiology 4, 1129–1137 (2019).
Article CAS PubMed Google Scholar
Karen, J. et al. Hydrogen-oxidizing bacteria are abundant in desert soils and strongly stimulated by hydration. mSystems 5, e01131–20 (2020).
Google Scholar
Rust, M. et al. A multiproducer microbiome generates chemical diversity in the marine sponge Mycale hentscheli. Proceedings of the National Academy of Sciences 117, 9508–9518 (2020).
Article ADS CAS Google Scholar
Podowski, J. C., Paver, S. F., Newton, R. J. & Coleman, M. L. Genome streamlining, proteorhodopsin, and organic nitrogen metabolism in freshwater nitrifiers. mBio 13, e02379–21 (2022).
Article PubMed PubMed Central Google Scholar
Coutinho, F. H. et al. New viral biogeochemical roles revealed through metagenomic analysis of Lake Baikal. Microbiome 8, 163 (2020).
Article CAS PubMed PubMed Central Google Scholar
Philippi, M. et al. Purple sulfur bacteria fix N2 via molybdenum-nitrogenase in a low molybdenum Proterozoic ocean analogue. Nature Communications 12, 4774 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Katie, S. et al. Eight metagenome-assembled genomes provide evidence for microbial adaptation in 20,000- to 1,000,000-year-old Siberian permafrost. Applied and Environmental Microbiology 87, e00972–21 (2021).
Google Scholar
Mert, K. et al. Unexpected abundance and diversity of phototrophs in mats from morphologically variable microbialites in Great Salt Lake, Utah. Applied and Environmental Microbiology 86, e00165–20 (2020).
Google Scholar
Patin, N. V. et al. Gulf of Mexico blue hole harbors high levels of novel microbial lineages. The ISME Journal 15, 2206–2232 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wang, J., Tang, X., Mo, Z. & Mao, Y. Metagenome-assembled genomes from Pyropia haitanensis microbiome provide insights into the potential metabolic functions to the seaweed. Frontiers in Microbiology 13, 857901 (2022).
Article PubMed PubMed Central Google Scholar
Burgsdorf, I. et al. Lineage-specific energy and carbon metabolism of sponge symbionts and contributions to the host carbon pool. The ISME Journal 16, 1163–1175 (2022).
Article CAS PubMed Google Scholar
Suarez, C. et al. Disturbance-based management of ecosystem services and disservices in partial nitritation-anammox biofilms. npj Biofilms and Microbiomes 8, 47 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kumar, D. et al. Textile industry wastewaters from Jetpur, Gujarat, India, are dominated by Shewanellaceae, Bacteroidaceae, and Pseudomonadaceae harboring genes encoding catalytic enzymes for textile dye degradation. Frontiers in Environmental Science 9, 720707 (2021).
Article ADS Google Scholar
Seitz, V. A. et al. Variation in root exudate composition influences soil microbiome membership and function. Applied and Environmental Microbiology 88, e00226–22 (2022).
Article PubMed PubMed Central Google Scholar
Lindner, B. G. et al. Toward shotgun metagenomic approaches for microbial source tracking sewage spills based on laboratory mesocosms. Water Research 210, 117993 (2022).
Article CAS PubMed Google Scholar
Yancey, C. E. et al. Metagenomic and metatranscriptomic insights into population diversity of microcystis blooms: Spatial and temporal dynamics of mcy genotypes, including a partial operon that can be abundant and expressed. Applied and Environmental Microbiology 88, e02464–21 (2022).
Article PubMed PubMed Central Google Scholar
Liu, L. et al. Charting the complexity of the activated sludge microbiome through a hybrid sequencing strategy. Microbiome 9, 205 (2021).
Article CAS PubMed PubMed Central Google Scholar
Speth, D. R. et al. Microbial communities of Auka hydrothermal sediments shed light on vent biogeography and the evolutionary history of thermophily. The ISME Journal 16, 1750–1764 (2022).
Article CAS PubMed PubMed Central Google Scholar
Blyton, M. D. J., Soo, R. M., Hugenholtz, P. & Moore, B. D. Maternal inheritance of the koala gut microbiome and its compositional and functional maturation during juvenile development. Environmental Microbiology 24, 475–493 (2022).
Article CAS PubMed Google Scholar
Nuccio, E. E. et al. Niche differentiation is spatially and temporally regulated in the rhizosphere. The ISME Journal 14, 999–1014 (2020).
Article CAS PubMed PubMed Central Google Scholar
Jaffe, A. L. et al. Long-term incubation of lake water enables genomic sampling of consortia involving planctomycetes and candidate phyla radiation bacteria. mSystems 7, e00223–22 (2022).
Article PubMed PubMed Central Google Scholar
Cabral, L. et al. Gut microbiome of the largest living rodent harbors unprecedented enzymatic systems to degrade plant polysaccharides. Nature Communications 13, 629 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Blyton, M. D. J., Soo, R. M., Hugenholtz, P. & Moore, B. D. Characterization of the juvenile koala gut microbiome across wild populations. Environmental Microbiology 24, 4209–4219 (2022).
Article CAS PubMed Google Scholar
Xu, B. et al. A holistic genome dataset of bacteria, archaea and viruses of the Pearl River estuary. Scientific Data 9, 49 (2022).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Royo-Llonch, M. et al. Compendium of 530 metagenome-assembled bacterial and archaeal genomes from the polar Arctic Ocean. Nature Microbiology 6, 1561–1574 (2021).
Article CAS PubMed Google Scholar
Sun, J., Prabhu, A., Aroney, S. T. N. & Rinke, C. Insights into plastic biodegradation: community composition and functional capabilities of the superworm (Zophobas morio) microbiome in styrofoam feeding trials. Microbial Genomics 8, 000842 (2022).
CAS Google Scholar
Kim, M. et al. Higher pathogen load in children from Mozambique vs. USA revealed by comparative fecal microbiome profiling. ISME Communications 2, 74 (2022).
Article ADS PubMed Central Google Scholar
Kelly, J. B., Carlson, D. E., Low, J. S. & Thacker, R. W. Novel trends of genome evolution in highly complex tropical sponge microbiomes. Microbiome 10, 164 (2022).
Article CAS PubMed PubMed Central Google Scholar
Bray, M. S. et al. Phylogenetic and structural diversity of aromatically dense pili from environmental metagenomes. Environmental Microbiology Reports 12, 49–57 (2020).
Article CAS PubMed Google Scholar
Cabello-Yeves, P. J. et al. α-cyanobacteria possessing form IA RuBisCO globally dominate aquatic habitats. The ISME Journal 16, 2421–2432 (2022).
Article CAS PubMed PubMed Central Google Scholar
Berben, T. et al. The Polar Fox Lagoon in Siberia harbours a community of Bathyarchaeota possessing the potential for peptide fermentation and acetogenesis. Antonie van Leeuwenhoek 115, 1229–1244 (2022).
Article CAS PubMed PubMed Central Google Scholar
Tamburini, F. B. et al. Short- and long-read metagenomics of urban and rural South African gut microbiomes reveal a transitional composition and undescribed taxa. Nature Communications 13, 926 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Kantor, R. S., Miller, S. E. & Nelson, K. L. The water microbiome through a pilot scale advanced treatment facility for direct potable reuse. Frontiers in Microbiology 10, 993 (2019).
Article PubMed PubMed Central Google Scholar
Muratore, D. et al. Complex marine microbial communities partition metabolism of scarce resources over the diel cycle. Nature Ecology & Evolution 6, 218–229 (2022).
Article Google Scholar
Zhou, Y. L., Mara, P., Cui, G. J., Edgcomb, V. P. & Wang, Y. Microbiomes in the challenger deep slope and bottom-axis sediments. Nature Communications 13, 1515 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, H. et al. Metagenome sequencing and 768 microbial genomes from cold seep in South China Sea. Scientific Data 9, 480 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhuang, J. L., Zhou, Y. Y., Liu, Y. D. & Li, W. Flocs are the main source of nitrous oxide in a high-rate anammox granular sludge reactor: insights from metagenomics and fed-batch experiments. Water Research 186, e116321 (2020).
Article Google Scholar
Shiffman, M. E. et al. Gene and genome-centric analyses of koala and wombat fecal microbiomes point to metabolic specialization for eucalyptus digestion. PeerJ 5, 4075 (2017).
Google Scholar
Murphy, S. M. C., Bautista, M. A., Cramm, M. A. & Hubert, C. R. J. Diesel and crude oil biodegradation by cold-adapted microbial communities in the Labrador Sea. Applied and Environmental Microbiology 87, e00800–21 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Suarez, C. et al. Metagenomic evidence of a novel family of anammox bacteria in a subsea environment. Environmental Microbiology 24, 2348–2360 (2022).
Article CAS PubMed PubMed Central Google Scholar
Dharamshi, J.E. et al. Genomic diversity and biosynthetic capabilities of sponge-associated chlamydiae. The ISME Journal (2022).
Florian, P. O., Hugo, R. & Mathieu, A. Recovery of metagenome-assembled genomes from a human fecal sample with pacific biosciences high-fidelity sequencing. Microbiology Resource Announcements 11, e00250–22 (2022).
Google Scholar
Bloom, S. M. et al. Cysteine dependence of Lactobacillus iners is a potential therapeutic target for vaginal microbiota modulation. Nature Microbiology 7, 434–450 (2022).
Article CAS PubMed Google Scholar
Aylward, F. O. et al. Diel cycling and long-term persistence of viruses in the ocean’s euphotic zone. Proceedings of the National Academy of Sciences 114, 11446–11451 (2017).
Article ADS CAS Google Scholar
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research 25, 1043–1055 (2015).
Article CAS PubMed PubMed Central Google Scholar
Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature biotechnology 35, 725 (2017).
Article CAS PubMed PubMed Central Google Scholar
Chaumeil, P. A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2020).
CAS Google Scholar
Louca, S. The rates of global bacterial and archaeal dispersal. ISME Journal 16, 159–167 (2021).
Article ADS PubMed PubMed Central Google Scholar
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using minhash. Genome Biology 17, 132 (2016).
Article PubMed PubMed Central Google Scholar
Müllner, D. fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python. Journal of Statistical Software 53, 1–18 (2013).
Article Google Scholar
Kinene, T., Wainaina, J., Maina, S., Boykin, L.M. & Kliman, R.M. Methods for rooting trees, vol. 3, 489–493 (Academic Press, Oxford, 2016).
Louca, S. & Doebeli, M. Efficient comparative phylogenetics on large trees. Bioinformatics 34, 1053–1055 (2018).
Article CAS PubMed Google Scholar
Rees, J. A. & Cranston, K. Automated assembly of a reference taxonomy for phylogenetic data synthesis. Biodiversity Data Journal 5, e12581 (2017).
Article Google Scholar
Heck, K. et al. Evaluating methods for purifying cyanobacterial cultures by qPCR and high-throughput Illumina sequencing. Journal of Microbiological Methods 129, 55–60 (2016).
Article CAS PubMed Google Scholar
Cornet, L. et al. Consensus assessment of the contamination level of publicly available cyanobacterial genomes. PLOS ONE 13, e0200323 (2018).
Article PubMed PubMed Central Google Scholar
Alneberg, J. et al. Genomes from uncultivated prokaryotes: a comparison of metagenome-assembled and single-amplified genomes. Microbiome 6, 173 (2018).
Article PubMed PubMed Central Google Scholar
Eddy, S. R. Accelerated profile HMM searches. PLoS Computational Biology 7, e1002195 (2011).
Article ADS MathSciNet CAS PubMed PubMed Central Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12, 59–60 (2014).
Article PubMed Google Scholar
Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar

Download references

Acknowledgements

S.L. was supported by a startup grant by the University of Oregon and by an Alfred P. Sloan Research Fellowship.

Author information

Authors and Affiliations

Department of Biology, University of Oregon, Eugene, USA
Sage Albright & Stilianos Louca
Institute of Ecology and Evolution, University of Oregon, Eugene, USA
Stilianos Louca

Authors

Sage Albright
View author publications
You can also search for this author in PubMed Google Scholar
Stilianos Louca
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Both authors contributed to the data compilation, analyses and manuscript writing. Correspondence and requests for materials should be addressed to S.L.

Corresponding author

Correspondence to Stilianos Louca.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Albright, S., Louca, S. Trait biases in microbial reference genomes. Sci Data 10, 84 (2023). https://doi.org/10.1038/s41597-023-01994-7

Download citation

Received: 23 June 2022
Accepted: 31 January 2023
Published: 09 February 2023
DOI: https://doi.org/10.1038/s41597-023-01994-7

This article is cited by

Using custom-built primers and nanopore sequencing to evaluate CO-utilizer bacterial and archaeal populations linked to bioH2 production
- İlayda Akaçin
- Şeymanur Ersoy
- Mine Güngörmüşler
Scientific Reports (2023)
Taxonomic and environmental distribution of bacterial amino acid auxotrophies
- Josep Ramoneda
- Thomas B. N. Jensen
- Noah Fierer
Nature Communications (2023)