Global metadata distribution of ruminant microbiome samples
A dataset with 47,628 sample metadata was obtained from ten farmed ruminant species (Fig. 1). Cattle (including Bos taurus and Bos taurus indicus) represented 71.2% of the samples followed by sheep at 18.9%. Other species were goat (3.9%), yak (2.7%), and buffalo (2.1%). The rest of the samples (~ 1.2%) are from four camelid species and from bison. Samples from live animals were dominant compared to those from in vitro experiments (93.5% vs. 6.5%, respectively). Present estimates of the worldwide farmed ruminant population are about 4.2×109 heads, including yak [31] and bison [32] populations that are not counted in FAOSTAT [29]. Cattle (36.45%), sheep (30.17%) and goats (26.95%) account for the largest populations, followed by buffalo (4.86%), camelids (0.92% Old World and 0.21% New World), yak (0.42%) and bison (0.01%). A comparison between the proportion of samples and head numbers for each ruminant species to identify gaps in the global research effort in regards to some ruminant populations related to others is prone to criticism. Factors such as economic and regional importance should be considered for a finer interpretation. The use of head numbers or livestock units [33] will also modify the results. Nevertheless, samples from sheep, goats and buffalo seem clearly underrepresented. This is even more evident considering that these three species are particularly abundant and economically important in African and Asian countries [34], which have a low overall contribution of samples (see below).
Figure 1.
Geographic location was a frequent metadata attribute that allowed us to identify the country of origin of the sample. We identified a total of 52 countries with China, the USA and Canada contributing more than half of the samples. Other countries contributing more than 1% of the samples were nine European countries, New Zealand, Australia, Israel, Brazil, and Japan. The remaining 31 countries contributed 5.6% of the samples (Supplementary Table 1).
For a better understanding of the microbiome metadata representation and given that cattle and sheep represent about 90% of the total samples, we analyzed the data separately for each of these two species. We then used the geographic location attribute, and along with information on cattle populations in countries worldwide, we evaluated the representativeness of sampling efforts on a global scale. To obtain a clear picture, we filtered the dataset by removing in vitro samples and countries that had a low number of samples (< 10). The latest available data for the worldwide cattle population is 1.53×109 heads [29]. One animal out of four in the world is from only two countries, Brazil and India. Other countries with large cattle populations are the USA (6.1%), Ethiopia (4.3%), China (3.9%), Argentina (3.5%), Pakistan (3.4%), Mexico (2.4%), Chad (2.2%) and Sudan (2.1%). However, the samples mainly originated from the USA (25.4%), Canada (13.2%), China (12.1%), Austria (6.5%), the UK (5.9%), and Israel (5.1%) (Supplementary Fig. 1). Countries with a low to moderate cattle population, for example, Israel, Austria, Denmark, Finland, Sweden, Canada, Japan, and the UK, were overrepresented. In contrast, out of the 25 countries with the largest cattle populations, 21 are underrepresented (Fig. 2A). Furthermore, out of the 190 countries reported with cattle populations, 144 have zero samples reported in this database.
As for cattle, the geographic location and information for the worldwide sheep population were analyzed. Our results showed that, although the sheep population from the USA, Canada, New Zealand, and Ireland did not exceed 3% of the total, 55% of the sheep microbiome samples originated from these four countries. Consequently, these countries were overrepresented (Fig. 2B). China has the largest sheep population in the world (13.7%) and accounts for 32.3% of the total samples; hence, as well as the UK and France, they are considered well-represented countries. In contrast, 7 of the top ten countries in sheep populations (not including China, India, or the UK) did not register any samples (Supplementary Fig. 2). Likewise, India, Brazil, South Africa, Spain, and Egypt were ranked as the most underrepresented countries. Remarkably, out of the 173 countries reported with sheep populations, 127 have zero samples reported in this database.
Figure 2.
Sample metadata information from the three most abundant ruminant species
Regarding the body site of origin, the vast majority of samples (~ 87%) come from the gastrointestinal tract (GIT), particularly from rumen and feces, and were prevalent in all ten ruminant species. Other body sites and biological matrices represented about 13% of the samples. These are in decreasing order of importance from respiratory system, milk, fetal tissue, skin, and reproductive system categories (Table 2). Samples from body sites other than the gut and feces were mainly found in cattle and sheep. Minor categories represented less than 1% of the total samples (listed in Supplementary Table 2).
Table 2
Sample metadata distribution by body site and ruminant species
Category/subcategory | Sample counts | % | Ruminant species |
Gut | 30452 | 63.9 | C, S, G, Y, Bu, BC, DC, A, LL and Bi |
Esophageal | 5 | | C |
Rumen | 26652 | | C, S, G, Y, Bu, BC, DC, A, LL and Bi |
Reticulum | 131 | | C, S, G, Y and Bu |
Omasum | 150 | | C, S, G, Y and Bu |
Abomasum | 252 | | C, S, G, Y, Bu and BC |
Duodenum | 374 | | C, S, G, Y, Bu and A |
Jejunum | 567 | | C, S, G, Y, Bu and A |
Ileum | 496 | | C, S, G, Y, Bu, BC and A |
Cecum | 405 | | C, S, G, Y, Bu and A |
Colon | 658 | | C, S, G, Y, and Bu |
Rectum | 525 | | C, S, G, Y, Bu and DC |
Anus | 73 | | S and G |
Gut* | 164 | | C, BC, and DC |
Feces | 10825 | 22.7 | C, S, G, Y, Bu, BC, DC, A, LL and Bi |
Respiratory system | 1759 | 3.7 | C, S, Y and DC |
Milk | 1389 | 2.9 | C, S and Bu |
Fetal tissue | 1001 | 2.1 | C and S |
Skin | 752 | 1.6 | C and S |
Reproductive system | 624 | 1.3 | C and S |
*Sample metadata tagged as gut.
C = Cattle, S = Sheep, G = Goat, Y = Yak, Bu = Buffalo, BC = Bactrian camel, DC = Dromedary camel, A = Alpaca, LL = Llama, and Bi = Bison.
Cattle represented 71% of all sample metadata, and the body site was the attribute where the information was most complete. However, the information was not straightforward, and it was only recovered after refining the search on the attribute "description" of the bioproject or by manually searching the associated publications. We found 13 categories for the body site attribute. The categories Gut and Feces were also dominant, representing about 8 out of 10 samples (Fig. 3A). Other relevant categories were: respiratory system, fetal tissue, milk, reproductive system, skin, liver, oral, mammary gland, blood, eye and musculoskeletal system (Supplementary Table 3). The breed is an important descriptive information in any animal study but it was not reported in the majority of sample metadata (57.3%). In spite of the limited availability of breed attribute data, Holstein was the dominant breed (70.0%), followed by Aberdeen Angus, Angus × Hereford crossbreed, Holstein × Jersey crossbreed, and Black Japanese (which refers mainly to the Wagyu breed) (Fig. 3B). Similar to breed, fundamental attributes for reusability and reinterpretation of sequencing data such as production system, age, and sex were poorly completed. No information on these attributes was found in 40 to 58% of the samples. The available data should be interpreted with caution but there is a predominance of sample metadata from dairy versus meat production systems (74% vs. 26%, respectively) (Fig. 3C), which is opposite to the global cattle structure, 17% for dairy cattle and 83% for beef cattle [29, 35]. Furthermore, samples from adult animals are higher than those from calves but otherwise they can be considered equilibrated (Fig. 3D). Whereas, the female category (Fig. 3E) is more abundant than the male category, which seems logical considering the sex ratio in commercial cattle herds.
Figure 3.
For sheep, a total of 9,003 sample metadata were found. As in cattle, the gut and feces categories of the body site predominated (90.9%) over the other categories (Fig. 4A) (Supplementary Table 3). Likewise, for the breed attribute, there was a high percentage of missing data (56.3%). We found a total of 31 breeds, and the most abundant were the Lacaune (20.2%), Romney (14.5%), and Hu sheep (14.0%) breeds. Most breeds were poorly represented (< 1%) (Fig. 4B). Finally, for the attributes age and sex, although there was a high percentage of samples with missing data, lambs and adults were the most represented categories (Fig. 4C), and similar proportions were observed for males and females (Fig. 4D).
Figure 4.
Goat results showed only two body site categories, gut and feces (Supplementary Table 3). Although 29 breeds were identified, about 50% of the samples lacked this attribute (Supplementary Table 4). The predominant breeds were: Liuyang black, Boer, Black fattening, and Xiangdong black. Approximately half of the breeds that were informed in the metadata have a Chinese origin, as 90% of the samples originated from China (Supplementary Table 5). Seventeen other countries registered samples, but they represented less than 10%. We found no or few samples from countries with large populations of goats (e.g., India, Nigeria, Pakistan, Bangladesh, and Ethiopia). Finally, although the kids and female categories predominated in the age and sex attributes, respectively, there was a higher percentage of missing data (45.5 to 67.8%) (Supplementary Tables 6 and 7).
Sample metadata information from minor ruminant species
Outside of the major ruminant species, the number of total samples from other ruminants (yak, buffalo, camel, camelid, and bison) were equilibrated compared to their worldwide population (~ 6%). Regardless of the ruminant species, the gut and feces categories were the most prevalent among these seven ruminant species (Supplementary Table 8). Likewise, some respiratory system and milk samples were reported from yaks, camels, and buffaloes. Sample metadata originated mainly from the Asian continent (91%). China and India had the largest number of samples (Supplementary Table 9); China was highlighted by the number of samples of yak (1,280 samples), and both countries contributed 916 samples of buffalo. For the Dromedary camel, India, Egypt, Iran, and other countries contributed 151, 108, 44, and 11 samples, respectively. There were 79 samples from Bactrian camels originating from Russia, China, Italy, and Denmark. Likewise, for bison, 58 samples were reported from the USA, Canada, and Mexico. It is noted that for New World camelids most samples were from outside the main geographic area of production and origin. There were 123 alpaca samples from the USA and New Zealand, and only eight llama samples, six from Argentina and two from France.
Database representation and FAIR principles
Our results, based on the number of scientific papers (Fig. 5A) and sample metadata evolution (Fig. 5B), suggest a growing interest in ruminant microbiome studies with the aim of understanding the function of the holobiont organism and its linkages with animal health, production efficiency, and environmental impact [11, 13]. Additionally, advances, and cost reductions, in high-throughput sequencing technologies have contributed to the increased data volume in the last decade [10]. The results indicate that the GIT is the most studied body site in farmed ruminants (Supplementary Fig. 3). This is explained by the importance of the GIT microbiota to the major challenges facing ruminant production, namely reducing greenhouse gas emissions, increasing feed efficiency, and preserving animal health [36–38]. In addition, the number of samples from the respiratory tract, milk, skin, reproductive tract, and fetal tissue has increased exponentially over the past decade, reflecting the increased interest in better understanding how resident microbiota are associated with health problems, such as mastitis [39], lameness [40] and respiratory disease [41].
Figure 5.
The quality and depth of the microbiome data from farmed ruminants is steadily improving, allowing us to explore their connection to essential biological processes relevant to production and health. Several projects and international initiatives [e.g., 22,23] are contributing data, expanding the ruminant microbiome. However, the existing metadata and samples mainly originated from production systems prevalent in high-income countries, and there is still a large number of regions with large ruminant populations where metadata were scarce or nonexistent, e.g., countries from South America and the Caribbean, Western Asia, Eastern Europe, and the African continent.
It is, therefore, urgent to rethink and encourage ruminant microbiome studies in underrepresented countries worldwide. It is imperative to obtain information from indigenous breeds and less represented ruminants reared under harsh environmental conditions from low- and middle-income countries where they contribute to food security [7]. These regions are where ruminant populations are increasing and where ruminants contribute the most to the economic and environmental sustainability (adaptation and mitigation to climate change) of local human populations. We also consider that the vast but underexplored genetic diversity of ruminant microbiomes could be mined for the discovery of new genes and potentially valuable new microbial products for the biotechnology industry [42]. Finally, a better understanding of pathogenic microbes and their interactions with other microbiomes in ruminants and their environment may contribute not only to the development of healthy and sustainable livestock, but also to improved public health following the “One Health” approach [43, 44].
A main result of this study was the poor quality of the available metadata. For instance, there was no global consensus for the taxonomic assignment of the sample metadata to a ruminant species since much of the data were manually retrieved from generic taxonomies such as metagenome or gut metagenome, which include the vast majority of animal species. Likewise, we found samples of sheep and yak in the bovine metagenome and bovine gut metagenome taxonomies. All of this made it much more difficult to find and retrieve metadata. A further issue when refining the metadata information was the difficulty of distinguishing the nature of the samples. For instance, samples from in vitro studies were difficult to distinguish from in vivo because these were not explicitly defined in the metadata. Therefore, we classified samples as in vitro when they were associated with the reactor, culture, RUSITEC, or in vitro, and the remaining samples were considered in vivo. It is also important to know that in vitro anaerobic culture samples are taken from bottles or tubes, which often come from three or four individual animals or their mixture [45]. For this reason, it was important to exclude them from the proportional representativeness analysis as they do not truly represent a sample from an individual animal per se. Similarly, it is likely that some samples come from longitudinal studies, as this type of information was not found in the list of attributes. Given the growing interest in studying the long-term impact of dietary interventions and the gut microbiome in early life [46–48], it is therefore likely that the number of longitudinal samples will increase, and it is important that the nature of the samples be clearly defined in the metadata. An additional key point regarding data quality was incomplete (basic, but essential) host information. Although the associated bioprojects in the literature and those with more information on their attributes allowed us to complete basic host information, most of the samples did not have complete information on breed, sex, age, and production system, which was missing in more than 40% of samples. Therefore, our results related to host attributes, except for ruminant species, country, and body site, which did contain complete information, are partial and should be interpreted accounting for this caveat.
The completeness and standardization of metadata using a common language (ontology) are essential not only to ensure the quality of the available data, but also to ensure transparency, reproducibility, and reusability of data for secondary studies (meta-analyses and reviews, among others) [49]. To address these issues, there is a checklist with the minimum information about any (x) sequence (MIxS) required to be completed in the repositories [50], and international initiatives are underway to improve the quality of metadata, e.g., The National Microbiome Data Collaborative (NMDC) [51], the Genomic Standards Consortium (GSC) [52], and the Agricultural Microbiome Data [53]. However, we did not observe major progress, even in more recent studies, toward incorporating these recommendations into metadata information from ruminant microbiome research. Although some issues related to metadata quality could be related to legal concerns (e.g., intellectual property protection), we believe that the major drawback is the lack of a common ontology that correctly describes the host organism and that insufficient emphasis is placed on metadata as an indissociable element of the sequencing data to follow FAIR (findable, accessible, interoperable, and reusable) principles [54]. Finding the correct ontology of animal-associated microbiomes to submit metadata is therefore a challenge to improve metadata quality. One possibility to facilitate the search for nonredundant ontology is to hierarchize the data structure for the ruminant microbiome, as was suggested for the plant-associated microbiome [49], and to adopt some categories of metadata (i.e., production system, productive and health traits, sampling method, processing and storage for host samples and sequenced materials) suggested in the checklist of the Agricultural Microbiome Data [53]. Host information on the (ruminant) species, breed, age, and sex are obvious basic information that should be a minimum prerequisite to deposit microbiome sequencing data. Furthermore, adopting and using livestock-specific ontologies that define animals in their environment, such as the Animal Trait Ontology of Livestock (www.atol-ontology.com), and others related to productive and health traits such as the Animal QTLdb database (https://www.animalgenome.org/QTLdb), would provide much-needed information for data reuse. Given that it is well known that the GIT microbiota is modulated primarily by the type and quality of the diet [55], further information on the type of diet and its possible associations with productive and health traits in the global microbiome database would be interesting. The animal research microbial community should improve its compliance with open data and FAIR principles that are required by international and national funding agencies. Training focused on quality standards, FAIR principles, and ontology for microbiome data could help promote adoption.
A recent work compiled public animal metagenome data (which included pigs, horses, cattle, sheep, and wild animals) from the NCBI database [56]. These authors used a different approach to data searching and reported 3.6 times fewer cattle samples than we found in this work. This indicates that there are samples of animal metagenome incorrectly deposited in generic taxonomies, stressing the need for the correct identification of samples to the animal taxonomy. Nevertheless, we found some similarities, e.g., the samples mainly came from the GIT, and from countries such as the USA, China, Canada, the UK and Austria, although they included other animal species. It is also important to note that our results are limited to databases from the International Nucleotide Sequence Database Collaboration [57], which includes the EMBL-EBI European Nucleotide Archive [58], the GenBank database of the NCBI [59] from the USA, and the DNA Data Bank of Japan [60]. Therefore, it is likely that different global representation patterns exist in other databases, such as Metagenomics RAST (MG-RAST) [61], Genome Sequence Archive (GSA) [62], Global Catalogue of Metagenomics (gcMeta) [63], and Genomes Online Database (GOLD) [64], although their orders of magnitude are small, and redundant in some cases (e.g., GSA and GOLD) compared to the International Nucleotide Sequence Database Collaboration.