Mass Spectrometry-Guided Genome Mining as a Tool to Uncover Novel Natural Products

Renata Sigrist; Bruno  S. Paulo; Célio  F. F. Angolini; Luciana G. De Oliveira

doi:10.3791/60825

Chemistry

Mass Spectrometry-Guided Genome Mining as a Tool to Uncover Novel Natural Products

Published: March 12, 2020 doi: 10.3791/60825

Renata Sigrist¹, Bruno S. Paulo¹, Célio F. F. Angolini², Luciana G. De Oliveira¹

¹Department of Organic Chemistry, Institute of Chemistry, University of Campinas (UNICAMP), ²Center for Natural and Human Sciences, Federal University of ABC (UFABC)

Summary

A mass spectrometry-guided genome mining protocol is established and described here. It is based on genome sequence information and LC-MS/MS analysis and aims to facilitate identification of molecules from complex microbial and plant extracts.

Abstract

The chemical space covered by natural products is immense and widely unrecognized. Therefore, convenient methodologies to perform wide-ranging evaluation of their functions in nature and potential human benefits (e.g., for drug discovery applications) are desired. This protocol describes the combination of genome mining (GM) and molecular networking (MN), two contemporary approaches that match gene cluster-encoded annotations in whole genome sequencing with chemical structure signatures from crude metabolic extracts. This is the first step towards the discovery of new natural entities. These concepts, when applied together, are defined here as MS-guided genome mining. In this method, the main components are previously designated (using MN), and structurally related new candidates are associated with genome sequence annotations (using GM). Combining GM and MN is a profitable strategy to target new molecule backbones or harvest metabolic profiles in order to identify analogues from already known compounds.

Introduction

Investigations of secondary metabolism often consist of screening crude extracts for specific biological activities followed by purification, identification, and characterization of the constituents belonging to active fractions. This process has proved to be efficient, promoting the isolation of several chemical entities. However, nowadays this is seen as unfeasible, mainly due to the high rates of rediscovery. As the pharmaceutical industry revolutionized without knowledge of the roles and functions of specialized metabolites, their identification was carried out under laboratory conditions that did not accurately represent nature¹. Today, there is a better understanding of natural signaling influences, secretion, and the presence of most targets at undetectably low concentrations. Additionally, regulation of the process will help the academic community and pharmaceutical industry to take advantage of this knowledge. It will also benefit research involving the direct isolation of metabolites related to silent biosynthetic gene clusters (BGCs)².

In this context, advances in genomic sequencing have renewed interest in screening microorganism metabolites. This is because analyzing the genomic information of uncovered biosynthetic clusters can reveal genes encoding novel compounds not observed or produced under laboratory conditions. Many microbial whole genome projects or drafts are available today, and the number is growing every year, providing massive prospects for uncovering novel bioactive molecules through genome mining³^,⁴.

The Atlas of Biosynthetic Gene Clusters is the current largest collection of automatically mined gene clusters as a component of the Integrated Microbial Genomes Platform of the Joint Genome Institute (JGI IMG-ABC)². Most recently, the Minimum Information for Biosynthetic Gene Clusters (MIBiG) Standardization Initiative has promoted the manual reannotation of BGCs, providing a highly curated reference dataset⁵. Nowadays, plenty of tools are available to enable computational mining of genetic data and their connection to known secondary metabolites. Different strategies have also been developed to access new bioactive natural products (i.e., heterologous expression, target gene deletion, in vitro reconstitution, genomic sequence, isotope-guided screening [genomisotopic approach], manipulation of local and global regulators, resistance target-based mining, culture independent mining, and, more recently, MS-guided/code approaches²^,⁶^,⁷^,⁸^,⁹^,¹⁰^,¹¹^,¹²^,¹³^,¹⁴^,¹⁵).

Genome mining as a singular strategy requires efforts to annotate a single or small group of molecules; thus, gaps in the process remain in which new compounds are prioritized for isolation and structure elucidation. In principle, these approaches target only one biosynthetic pathway per experiment, thereby resulting in a slow discovery rate. In this sense, using GM along with a molecular networking approach represents an important advance for natural product research¹⁴^,¹⁵.

The versatility, accuracy, and high sensitivity of liquid chromatography-mass spectrometry (LC-MS) make it a good method for compound identification. Currently, several platforms have invested algorithms and software suites for untargeted metabolomics¹⁶^,¹⁷^,¹⁸^,¹⁹^,²⁰. The core of these programs includes feature detection (peak picking)²¹ and peak alignment, which allows match of identical features across a batch of samples and searching for patterns. MS pattern-based algorithms²²^,²³ compare characteristic fragmentation patterns and match MS² similarities generating molecular families sharing structural features. These features can then be highlighted and clustered, conferring the ability to rapidly discover known and unknown molecules from a complex biological extract by tandem MS²^,²⁴^,²⁵. Therefore, tandem MS is a versatile method to gain structural information of several chemotypes contained in a large amount of data simultaneously.

The Global Natural Products Social Molecular Networking (GNPS)²⁶ algorithm uses the normalized fragment ions intensity to construct multidimensional vectors, in which similarities are compared using a cosine function. The relationship between different parent ions are plotted in a diagram representation, in which each fragmentation is visualized as a node (circles), and the relatedness of each node is defined by an edge (lines). The global visualization of molecules from a single source is defined as a molecular network. Structurally divergent molecules that fragment uniquely will form their own specific cluster or constellation, whereas related molecules cluster together. Clustering chemotypes allows the hypothetical connection of similar structural features to their biosynthetic origins.

Combining both chemotype-to-genotype and genotype-to-chemotype approaches is powerful when creating bioinformatics links between BGCs and their small molecule products²⁷. Therefore, MS-guided genome mining is a rapid method and low material-consuming strategy, and it helps bridge parent ions and biosynthetic pathways revealed by WGS of one or more strains under diverse metabolic and environmental conditions.

The workflow of this protocol (Figure 1) consists of feeding WGS data into a biosynthetic gene cluster annotation platform such as antiSMASH²⁸^,²⁹^,³⁰. It helps estimate the variety of compounds and class of compounds encoded by the genome. A strategy to target a biosynthetic gene cluster encoding a chemical entity of interest must be adopted, and culture extracts from a wild type strain and/or heterologous strain containing the BGC can be analyzed to generate clustered ions based on similarities using GNPS²⁶^,³¹. Consequently, it is possible to identify new molecules that associate with the targeted BGC and are unavailable in the database (mainly unknown analogues, sometimes produced in low titers). It is relevant to consider that users can contribute to these platforms and that the availability of bioinformatics and MS/MS data is increasing rapidly, driving to a constant development and upgrade of effective computational tools and algorithms to guide efficient connections of complex extracts with molecules.

Figure 1: Overview of the entire workflow. Shown is an illustration of the bioinformatic, cloning, and molecular networking steps involved in the described MS-guided genome mining approach to identify new metabolites. Please click here to view a larger version of this figure.

This protocol describes a rapid and efficient workflow to combine genome mining and molecular networking as starting point for the natural product discovery pipeline. Although many applications are able to visualize the composition and relatedness of MS-detectable molecules in one network, several are adopted here to visualize structurally similar clustered molecules. Using this strategy, novel cyclodepsipeptide products observed in metabolic extracts of Streptomyces sp. CBMAI 2042 are successfully identified. Guided by genome mining, the whole biosynthetic gene cluster encoding for valinomycins is recognized and cloned into the producer strain Streptomyces coelicolor M1146. Finally, following a MS pattern-based molecular networking, the molecules detected by MS are correlated with BGCs responsible for their biogenesis³².

Subscription Required. Please recommend JoVE to your librarian.

Protocol

1. Genome mining for biosynthetic gene clusters

Perform whole genome sequencing (WGS) as the first step to electing a biosynthetic gene cluster (BCG) for MS-guided genome mining. The whole genome draft of the strain of interest (bacteria) can be obtained by Illumina MiSeq technology using the following with high quality genomic DNA: shotgun TruSeq PCR-Free library prep and Nextera Mate Pair Library Preparation Kit³³.
NOTE: After sequencing, the Illumina shotgun library and Illumina mate pair library can be assembled using the Newbler v3.0 (Roche, 454) assembler program (found at <https://ngs.csr.uky.edu/Newbler>) and annotated using a pipeline based on FgeneSB (found at <http://www.softberry.com/berry.phtml?topic=fgenesb_annotator&group=help&subgroup=pipelines>), as described previously³³. Microbiology Resource Announcements (MRA) is a fully open access journal with articles publishing the availability of any microbiological resource deposited in an available repository (found at <https://mra.asm.org>). The candidate protein-coding genes are identified using the RAST server annotation³⁴, and the Whole Genome Shotgun (WGS) project is deposited in the DDBJ/ENA/GenBank (found at <https://www.ncbi.nlm.nih.gov/genbank/>) and Gold (found at <https://gold.jgi.doe.gov>) sequence databases.
To obtain in silico information about secondary metabolism gene clusters annotations from a complete sequenced genome, submit the sequence file (GenBank/EMBL or FASTA format) to an antiSMASH platform (found at <https://antismash.secondarymetabolites.org/>).
Select the gene cluster of interest from output data (Figure 2) based on the most similar known cluster.
NOTE: First, it is routine to explore gene-by-gene and conduct individual searches (blastp) to evaluate which functions are associated with the desired biosynthetic gene groups. This procedure can also help to determine which BGC is likely associated with the production of a desired compound, even if it is a low percentage. An antiSMASH prediction considers all genes within a cluster to make percentage coverage, which can represent a global low percentage of similarity for the aimed BGC. However, when analyzing gene-by-gene, it is possible to obtain more accurate information using the most similar known cluster. Second, antiSMASH has two options to refine a search: 1) detection strictness: the degree of strictness to which the biosynthetic gene cluster must be to be considered a hit. For this option, the user should use the following parameters: a) strict: detects exclusively well-defined clusters containing all required regions, insusceptible to errors about genetic information; b) relaxed: detects partial clusters missing one or more functional region, which also works for detecting the strict feature; or c) loose: detects poorly defined clusters and clusters that likely match primary metabolites, which can lead to appearance of false positives or poorly defined BGCs. The other option is 2) extra features: the type of information the platform must search for and show in the output. In general, these two options can save time after the prediction. However, the antiSMASH job requires a longer time period.

Figure 2: Output from antiSMASH platform. Secondary metabolism in silico analysis from whole genome sequence annotation. Please click here to view a larger version of this figure.

Based on DNA sequence information of the BGC, design primers (20–25 nt) flanking the gene cluster for ESAC (E. coli/Streptomyces Artificial Chromosome) library screening.
NOTE: Different methods³⁵^,³⁶ can be used to capture the whole biosynthetic gene cluster from DNA. Here, the method used is construction of a representative ESAC library³⁷^,³⁸ from Streptomyces sp. CBMAI 2042 containing clones with average size fragments of ~95 kb.

2. Heterologous expression of whole biosynthetic gene cluster from the ESAC library

Move the ESAC vector from E. coli DH10B to E. coli ET12567 by triparental conjugation³².
1. Inoculate E. coli ET12567 (CamR), TOPO10/pR9604 (CarbR), DH10B/ESAC4H (AprR) in 5 mL of Luria-Bertani (LB) medium containing chloramphenicol (25 µg/mL), carbenicillin (100 µg/mL), and apramycin (50 µg/mL).
2. Incubate the culture overnight at 37 °C and 250 rpm.
3. Inoculate 500 µL of the overnight culture in 10 mL of LB medium containing a half-concentration of antibiotics.
4. Incubate the culture at 37 °C and 250 rpm until reaching an A₆₀₀ of 0.4–0.6.
5. Harvest the cells by centrifugation at 2,200 x g for 5 min.
6. Wash the cells twice with 20 mL of LB medium.
7. Resuspend the cells in 500 µL of LB medium.
8. Mix 20 μL of each strain in a microcentrifuge tube and drip into an agar plate with LB medium lacking antibiotics.
9. Incubate the plates at 37 °C overnight.
10. Streak the grown cells onto a fresh LB agar plate containing antibiotics and incubate at 37 °C overnight.

3. Streptomyces/E. coli conjugation

To obtain the recombinant heterologous organism, perform conjugation³² between E. coli ET12567 containing the ESAC vector, helper plasmid pR9604, and Streptomyces coelicolor M1146 or another selected host strain³⁹.
Day 1: Inoculate isolated colonies of S. coelicolor M1146 in 25 mL of TSBY medium in a 250 mL Erlenmeyer flask fitted with an inox-spring at 30 °C and 200 rpm for 48 h.
Day 2/3: Inoculate ET12567/ESAC/pR9604 in 5 mL of LB medium containing chloramphenicol (25 µg/mL), carbenicillin (100 µg/mL), and apramycin (50 µg/mL) overnight at 37 °C and 250 rpm.
Day 3/4: Inoculate 500 µL of the overnight culture in 10 mL of 2TY (in a 50 mL conical tube) containing half-working concentrations of antibiotics. Incubate at 37 °C and 250 rpm until reaching an A₆₀₀ of 0.4–0.6.
Centrifuge the cultures (ET12567/ESAC/pR9604 and M1146) at 2200 x g for 10 min.
Wash the pellets 2x in 20 mL of 2TY medium and resuspend in 500 µL of 2TY.
Aliquot 200 µL of the S. coelicolor M1146 suspension and dilute in 500 µL of 2TY (suspension A).
Aliquot 200 µL of suspension A and dilute in 500 µL of 2TY (suspension B).
Aliquot 200 µL of suspension B and dilute in 500 µL of 2TY (suspension C).
Aliquot 200 µL of the ET12567/ESAC/pR9604 suspension and mix with 200 µL of suspension C.
Plate 150 µL of the conjugation mixture on an SFM agar plate lacking antibiotics.
Incubate at 30 °C for 16 h.
Cover plates with 1 mL of antibiotic solution (according to plasmid resistance). After drying, incubate at 30 °C for 4–7 days.
NOTE: Here, a solution containing 1.0 mg/mL thiostrepton and 0.5 mg/mL nalidixic acid was prepared.
Streak putative exconjugants onto SFM agar plates containing thiostrepton (50 mg/mL) and nalidixic acid (25 mg/mL). Incubate at 30 °C.
Streak exconjugants onto an SFM agar containing only nalidixic acid.
Perform PCR analysis with isolated colonies to confirm that the entire gene cluster has been transferred to the S. coelicolor M1146 host.

4. Strain cultivation

To obtain the metabolic profile, inoculate 1/100 of the strain's pre-culture in appropriate fermentation media and under the appropriate culture conditions.
Centrifuge cultures at 2200 x g for 10 min.
Perform the extraction according to the class of the compound of interest⁴⁰.

5. Acquiring mass spectra and preparation for GNPS analysis

To acquire MS/MS data, program suitable HPLC and mass spectrometry methods using the control software. Both high and low resolution data-dependent mass spectrometry analysis (DDA) can be analyzed.
NOTE: Generally, a 1 mg/mL solution of complex crude extract samples is ideal. Dilutions are needed for less complex extracts. It should be noted that MS/MS networking is the detectable molecular network under the given mass spectrometric conditions.
Convert mass spectra to .mzXML format using MSConvert from Proteowizard (found at <http://proteowizard.sourceforge.net/>). The input parameters for the conversion are illustrated in Figure 3. Data from software of almost all companies are compatible.

Figure 3: Using MsConvert to convert MS files to mzXML extension. The correct parameter for GNPS analysis is displayed. The instructions are as follows: add all MS files in box 1 and add the filter Peak Picking in box 2; for this filter, use the algorithm vendor; press start and the processes of conversion will follow. Please click here to view a larger version of this figure.

Upload the converted LC-MS/MS files into the GNPS database. Two options are available: using a file transfer protocol (FTP) or directly in a browser through the online platform.
NOTE: Detailed information on how to install and transfer data to GNPS is available at <https://ccms-ucsd.github.io/GNPSDocumentation/fileupload/>.

6. GNPS analysis

After creating an account in GNPS (found at <https://gnps.ucsd.edu/>), log in to the created account select Create Molecular Network. Add a job title.
Basic options: select the mzXML files to perform the molecular network. They can be organized into up to six groups. Select the libraries for the dereplication routine (Figure 4).
NOTE: These groups do not interfere with molecular network construction. This information will be used only for the graphical representation.

Figure 4: Using online GNPS platform to perform molecular network analysis. Selection of mzXML files is done by clicking in box 1. In the open dialog box, the files can be selected from personal folder (box 2) or be uploaded in the second tab using the drag-and-drop file uploader (less than 20 MB). The files can be grouped into up to six groups. Please click here to view a larger version of this figure.

Select the precursor ion mass tolerance and fragment ion mass tolerance of 0.02 Da and 0.05 Da, respectively.
NOTE: GNPS has different types of strictness available based on 1) how accurate the MS/MS data is and 2) how accurate the association must be. Basic options: in this folder, it is possible to set Precursor Ion Mass Tolerance and Fragment Ion Mass Tolerance. These parameters are used as a guide to determine how precise the precursor ion and fragment ion must be. The selected mass tolerances depend on the resolution and accuracy of the mass spectrometer that is used.
Advanced network options: select the parameters according to Figure 5. These parameters directly influence the network cluster size and form. Another parameter in the remaining tabs section are for advanced users; thus, leave the default values.
NOTE: Advanced parameters can be read in GNPS documentation (found at <https://ccms-ucsd.github.io/GNPSDocumentation/>).

Figure 5: Using GNPS to perform molecular network analysis (advanced options). Min Pair Cos will directly influence the size of clusters, as high values will result in combining closely-related compounds and low values in combining distantly-related compounds. Using values that are too low should be avoided. Minimum matched fragment ions represent the number of shared fragments between two fragmentation spectra to be linked in the network. Together, both parameters guide the network format; lower values will cluster more distantly-related compounds and vice-versa. Using the proper values will greatly help the compound elucidation. Please click here to view a larger version of this figure.

Choose an e-mail address to receive an alert when the work is done, and submit the job.

7. Analysis of GNPS results

Log in to GNPS. Select Jobs > Published job > Done to open the job. A webpage will open as illustrated in Figure 6. All results obtained from molecular networking will be displayed.
Select View Spectral Families (In Browser Network Visualizer) to see all network clusters (red box, Figure 6).

Figure 6: Using GNPS to visualize molecular network results. All related compound clusters can be seen in view spectral families (red box). To visualize only library hits, "view all library hits" (blue box) should be selected. For better graphical representation of molecular network results, "Direct Cytoscape Preview" (yellow box) should be downloaded, and the latest version of Cytoscape should be used. Please click here to view a larger version of this figure.

A list will be displayed with all generated molecular networking clusters. If a library search was selected to generate the findings, tentative molecules identification will be displayed in AllIDs. Select Show to visualize them.
NOTE: The data analyses can be driven for other results (i.e., genome mining, biological assays, library dereplication molecules, etc.).
To analyze the molecular network cluster, select Visualize Network.
NOTE: Each cluster is composed of nodes (circles) and edges, which represents molecules and molecular similarity, respectively. Dereplicated molecules will be highlighted as a blue node in the online browser network visualizer.
In the node labels box, select parent mass (red box, Figure 7).
In the edge labels box, select Cosine or DeltaMZ to observe node similarity or mass difference between nodes, respectively (yellow box, Figure 7).
In the case of multigroup analyses, click Draw pies in the node coloring box to observe the frequency at which each node appears in each group (blue box, Figure 7).
NOTE: Other choices are possible, but those suggested above are optimal for annotating cluster nodes and unraveling their structures.

Figure 7: Using GNPS to visualize molecular cluster results. After opening the molecular clusters for better data visualization, the following should be chosen: "Parent mass" as node labels (red box); "DeltaMZ" as edge labels (yellow box); and "Draw pies" as node coloring (blue box). Navigate through the molecular cluster and try to annotate all nodes. Please click here to view a larger version of this figure.

To see all library hits, select View all library hits (blue box, Figure 7).
NOTE: Also, the MNW can be downloaded in "Direct Cytoscape Preview/Download" (yellow box, Figure 7), and the file can be opened in the Cytoscape platform (found at <https://cytoscape.org/>) for more options in graphical structure.
Manual confirmation of dereplicated compounds and structure elucidation of related compounds are needed. Open the fragmentations spectra directly in the GNPS platform or in original raw files.

Subscription Required. Please recommend JoVE to your librarian.

Representative Results

The protocol was successfully exemplified using a combination of genome mining, heterologous expression, and MS-guided/code approaches to access new specialized valinomycin analogues molecules. The genome-to-molecule workflow for the target, valinomycin (VLM), is represented in Figure 8. Streptomyces sp. CBMAI 2042 draft genome was analyzed in silico, and the VLM gene cluster was then identified and transferred to a heterologous host. Heterologous and wild type strains were cultivated in triplicate using proper fermentation conditions, partitioned with ethyl acetate, and concentrated to generate the crude extract. From the product, MS/MS data was acquired to generate a tandem MS metabolite profile for molecular networking. Figure 9 represents the clustered ions obtained from MS/MS data from Streptomyces sp. CBMAI 2042 crude extract, in which characteristic fragmentation patterns and corresponding MS similarities suggest the occurrence of a molecular family sharing structural features². Following known biosynthetic logic and bioinformatics insights, and supported by pattern-based MS/MS spectra, the structure of four originally reported cyclodepsipeptides were elucidated, and their origins were correlated with the same biosynthetic gene clusters responsible for VLM assembly³².

Molecular networking data (found at <https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=6f97aa4addfa4d20b505fdb4328b088c>) was processed in a GNPS platform and deposited in a MASSIVE repository (MSV000083709). For dereplication, two strategies were selected to populate the network with previously described compounds: 1) Dereplicator (found at <https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=1a55e768d02649aaa09d78d0d4778ef3>) and 2) a peptide natural product identification tool called VarQuest (found at Our previous publication provides further details³².

Figure 8: Workflow from in silico genome sequence analysis to MS data acquisition. (A) A draft from Streptomyces sp. CBMAI 2042 genome is obtained by Illumina MiSeq sequencing. (B) Valinomycin BGC identification and annotation. (C) After transferring the whole gene cluster to an appropriate host, the strain is cultivated. The ethyl acetate extract from culture is analyzed by LC to obtain a profile of produced secondary metabolites. The chromatogram shows that valinomycin, montanastatin, and five analogues are produced by VLM BGC expression in a heterologous host. Please click here to view a larger version of this figure.

Figure 9: Molecular networking results. (A) Molecular networking from Streptomyces sp. CBMAI 2042 extract. Molecular networking ions corresponding to valinomycin, an already known compound with the corresponding BGC annotated in Streptomyces sp. CBMAI 2042 genome, are clustered with ions related to analogues firstly described for VLM BGC. (B) MS spectra and chemical structures for valinomycin and related analogues are shown. Please click here to view a larger version of this figure.

Subscription Required. Please recommend JoVE to your librarian.

Discussion

The strongest advantage of this protocol is its ability to rapidly dereplicate metabolic profiles and bridge genomic information with MS data in order to elucidate the structures of new molecules, especially structural analogues². Based on genomic information, different natural products chemotypes can be investigated, such as polyketides (PK), nonribosomal peptides (NRP), and glycosylated natural products (GNP), as well as cryptic BGCs. Metabolomic screening yields evidence of activated BGC profiles and chemical diversity produced by a specific strain under laboratory conditions. Thus, a BGC can be cloned to direct production of a new compound or unknown analogues related to an already known BGC, facilitated by similarities discovered by molecular networking. Therefore, this procedure helps to distinguish valuable compounds produced by natural sources and can be used as a guide for future isolation steps, which are common in natural product pipelines.

MS-guided genome mining was firstly described in the fields of peptidogenomics⁴¹ and glycogenomics⁴². To estimate the extent of peptide natural product chemical diversity, Dorrestein and colleagues developed an automated method using MS and genomics to visualize the connection between expressed natural products (chemotype) and their gene clusters (genotype). The concept of MS-guided genome mining was then described while using peptide specialized metabolites. Here, a method for the identification of microbial glycosylated natural products (GNP) using a GM approach and tandem MS was applied as tool to rapidly connect GNP chemotypes (from microbial metabolomes) with their corresponding biosynthetic genotypes following sugar footprints.

The concept of peptidogenomics has been applied to reveal stenothricin gene clusters in Streptomyces roseosporus, providing the first insights into the broad utility of GNPS as a platform⁴³. Pattern-based genome mining and molecular networking was finally combined with the GNPS platform²⁶ to facilitate the dereplication of new compounds, known compounds, detection of new analogs, and structure elucidation of 35 Salinispora strains. This led to the isolation and characterization of retimycin A, a quinomycin-type depsipeptide⁴⁴. After the introduction of GNPS, integrated metabolomics and genome mining approaches have become the most versatile avenue to connect molecular networks with biosynthetic capabilities⁴⁵^,⁴⁶^,⁴⁷^,⁴⁸^,⁴⁹^,⁵⁰.

This protocol reinforces the feasibility of using genomic and metabolomic analyses to investigate the production of known and unknown chemically analogous compounds in a few steps while consuming low levels of materials. The model presented here is related to valinomycin analogue identification from crude extracts through molecular networking dereplication. The structure of analogues is deduced by MS/MS fragmentation and follows the biosynthetic logic of cloned VLM BGCs.

Different software is available for mining secondary metabolite biosynthetic gene clusters⁵¹ and for metabolite elucidation, but open source options have the advantages because of constant updates, and they are open to the scientific community. In this sense, antiSMASH and the GNPS platform are the most popular choices.

This general procedure can be modified for other extraction methodologies based on the natural source explored. More than one method of extraction can also be combined according to metabolite properties (i.e., polarity, hydrophobicity, the capability to form micelles), and even similar properties, different solvents, or resin can achieve enhanced results. Usually, extracts are prepared from liquid medium cultivation, but there is a plethora of extraction methods available to isolate enriched extracts and screen any biological sample of interest.

When acquiring MS data, data dependent acquisition (DDA) analysis should be used. This issue is important when a larger number of compounds are being evaluated in a single injection. While performing DDA, the maximum number of MS/MS spectra of each precursor ion and maximum number of different precursor ion should be compensated. When using fast scan rate equipment, this can be achieved with higher scan rates (~6–10 MS/MS scans per cycle). However, in lower scan rate equipment, MN performance can be only increased with better chromatographic resolution. The most comprehensive data to populate the molecular networking should be obtained. For MS data acquisition, fixed collision energy is possible, but ramp energies are suitable to yield improved results. There are no optimal conditions that will perfectly work for all samples. Achieving sufficient MS analysis is crucial to the following steps. Henceforth, the molecular network clusters should be generated and dereplicated according to the procedure.

A frequent troubleshooting error is missing intensities for masses. Normally, this can be solved by introducing higher collision energy during analysis. Sometimes, no correlations are observed between the spectra and GNPS library, which is very uncommon. In this case, ensure that the folder opens properly in the previsualization MS software as errors can sometimes be created during the conversion step to .mzXML files.

Regarding genome mining, the most precise output from gene cluster annotation platforms will be provided for higher quality whole genome sequencing for both, single strain, or culture independent mining. High quality sequencing will generate high quality bioinformatic insights for dereplication of biosynthetic pathways. In contrast, although BGC prediction bioinformatics software has been rapidly developing, exact predictions of gene function and putative products is still difficult, especially when investigating novel biosynthetic pathways and features that cannot be predicted in silico. Also, some biosynthetic machinery is strikingly conserved, while enzymology that is involved in hybrid systems, trans-AT modular PKs, and NRPSs are recognized as exceptions of the colinearity rule. In this sense, heterologous expression and refinements in bioinformatic output software can help elucidate unpredictable enzyme functions and unusual biochemistry⁵²^,⁵³^,⁵⁴. The enrichment of public databases will lead to more precise predictions and discovery of novel specialized metabolites, as the cost for WGS does not represent the handicap for genome mining.

Finally, the strongest advantages of integrated metabolomic and genome mining approaches are related to their feasibility to perform genotype and chemotype dereplication via automated and high throughput analysis connecting genomic, transcriptomic, and metabolomic data to efficiently connect genes with molecules.

Subscription Required. Please recommend JoVE to your librarian.

Disclosures

The authors have nothing to disclose.

Acknowledgments

The financial support for this study was provided by São Paulo Research Foundation - FAPESP (2019/10564-5, 2014/12727-5 and 2014/50249-8 to L.G.O; 2013/12598-8 and 2015/01013-4 to R.S.; and 2019/08853-9 to C.F.F.A). B.S.P, C.F.F.A., and L.G.O. received fellowships from the National Council for Scientific and Technological Development - CNPq (205729/2018-5, 162191/2015-4, and 313492/2017-4). L.G.O. is also grateful for the grant support provided by the program For Women in Science (2008, Brazilian Edition). All authors acknowledge CAPES (Coordination for the Improvement of Higher Education Personnel) for supporting the post-graduation programs in Brazil.

Materials

Name	Company	Catalog Number	Comments
Acetonitrile	Tedia	AA1120-048	HPLC grade
Agar	Oxoid	LP0011	NA
Apramycin	Sigma Aldrich	A2024	NA
Carbenicillin	Sigma Aldrich	C9231	NA
Centrifuge	Eppendorf	NA	5804
Chloramphenicol	Sigma Aldrich	C3175	NA
Column C18	Agilent Technologies	NA	ZORBAX RRHD Extend-C18, 80Å, 2.1 x 50 mm, 1.8 µm, 1200 bar pressure limit P/N 757700-902
Kanamycin	Sigma Aldrich	K1377	NA
Manitol P.A.- A.C.S.	Synth	NA	NA
Microcentrifuge	Eppendorf	NA	5418
Nalidixic acid	Sigma Aldrich	N4382	NA
Phusion Flash High-Fidelity PCR Master Mix	ThermoFisher Scientific	F548S	NA
Q-TOF mass spectrometer	Agilent technologies	NA	6550 iFunnel Q-TOF LC/MS
Sacarose P.A.- A.C.S.	Synth	NA	NA
Shaker/Incubator	Marconi	MA420	NA
Sodium Chloride	Synth	NA	P. A. - ACS
Soy extract	NA	NA	NA
Sucrose	Synth	NA	P. A. - ACS
Thermal Cycles	Eppendorf	NA	Mastercycler Nexus Gradient
Thiostrepton	Sigma Aldrich	T8902	NA
Tryptone	Oxoid	LP0042	NA
Tryptone Soy Broth	Oxoid	CM0129	NA
UPLC	Agilent Technologies	NA	1290 Infinity LC System
Yeast extract	Oxoid	LP0021	NA