Introduction

In recent years, there has been growing appreciation of the need to apply systems biology approaches that go beyond the genome level to the study of plant science (Cui et al. 2008; Long et al. 2008; Nelson et al. 2008). This is due to the realization that the assignment of gene function and the understanding of dynamic molecular phenotypes depend greatly on the ability to measure changes occurring beyond the level of gene expression. One method of achieving this is through the measurement and characterization of the proteins being expressed within a biological system (i.e., proteomics). Proteomics represents a rapidly developing technical discipline that encompasses a wide range of activities such as the analysis of changing protein abundance (Bachi and Bonaldi 2008), posttranslational modifications (de la Fuente van Bentem et al. 2008), and functional protein interaction networks (Collura and Boissy 2007). However, proteomics methods continue to be underutilized in the area of plant biology outside of their application in well-defined model systems like Arabidopsis thaliana and Oryza sativa (Chen and Harmon 2006; Pan et al. 2005). As a rapidly developing discipline, it is through the creation of new tools that the value of these methods will be unlocked in other plant systems.

Proteomics research relies heavily on the use of tandem mass spectrometry, and an average dataset typically consists of tens to hundreds of thousands of individual mass spectra. By extension, proteomics research is critically dependent upon the availability of sequence databases for the rapid and unsupervised interpretation of these spectra to provide meaningful peptide sequence assignments and the associated protein identifications. Organisms for which the genome has not been sequenced have typically been at a disadvantage with respect to the practical application of proteomics methods. These organisms typically rely on searching against sequences from related species that share sequence identity with the organism under study. For species of spruce (Picea spp.) and other conifers, the most closely related genomes are all from evolutionarily distant angiosperm species (e.g., A. thaliana, rice, poplar, grapevine). However, it has been shown that distantly related sequences function poorly in the interpretation of proteomics data (Huang et al. 2006). The spruce proteome database (DB) described here was assembled from the sequence data produced during a large-scale expressed sequence tag (EST) and full-length complementary DNA (FLcDNA) sequencing project in spruce (Ralph et al. 2008) with representative sequences taken from Picea sitchensis (Sitka spruce), Picea glauca (white spruce), and Picea glauca × engelmannii (interior spruce). Spruce proteome DB is an expansion of the databases used in prior proteomics studies performed in these conifer species (Lippert et al. 2005, 2007, 2009) and consists of a set of related protein databases representing these three spruce species and hybrids studied in the Treenomix project (www.treenomix.ca). Spruce proteome DB is, to our knowledge, the most comprehensive and appropriate sequence resource for studying conifer and other gymnosperm proteomes. Spruce proteome DB complements other database resources that provide general information on conifers (e.g., The Gymnosperm Database; http://www.conifers.org/index.html) and conifer genomics (e.g., TreeGenes, http://dendrome.ucdavis.edu/treegenes/).

Database construction

Spruce proteome DB was constructed according to the pipeline illustrated in Fig. 1. The final database was assembled using a set of 437,705 ESTs from Sitka spruce (168,424), white spruce (242,931), and interior spruce (26,350) in addition to 10,579 FLcDNA sequences from Sitka spruce (Pavy et al. 2005; Ralph et al. 2006, 2008). EST sequences from each species were clustered independently using parallel contig assembly program (Huang et al. 2003), resulting in a separate set of contigs and singletons for each of the three species. The FLcDNAs were used as is. All nucleotide sequences were compared sequentially to the Arabidopsis protein database followed by all the plant sequences in NCBInr using BLASTx (Altschul et al. 1990). Matches were accepted with e values less than 1 × 10−5 and the annotations were associated with the relevant query sequences. Arabidopsis annotations were chosen preferentially over NCBInr annotations when both were obtained. Sequences that were not matched using BLAST were subsequently analyzed using GENEMARK-E (Besemer and Borodovsky 2005; Borodovsky et al. 2003). GENEMARK-E is an ab initio gene finder that attempts to identify potential genes in eukaryotic sequences that have no known homolog. In spruce, roughly 10% of the ESTs fall into this category. The fact that these sequences were obtained from cDNAs (i.e., transcribed genes) suggests that they may represent proteins that are unique to spruce and possibly other conifers or gymnosperms. To this end, GENEMARK-E was used in order to uncover putative open reading frames (ORFs) within these sequences. Matches were labeled as Genemark XX YY, where XX was the position of the ORF start site and YY was the position of the stop site within the original EST sequence. These entries were trimmed so as to only contain the sequence of the putative ORF. All remaining unmatched EST and contig sequences were translated in six frames and each frame was then included in the final build of spruce proteome DB as a separate entry derived from that nucleotide sequence annotated as “Hypothetical Protein based on EST sequence.” The structure of the sequence annotations is shown in Fig. 2. The description line for each sequence contains up to four different pieces of information, depending on the outcome of the annotation process for that sequence. These consist of a short string representing the species of origin, a number unique to the contig or EST sequence, the annotation obtained from BLASTx or the GENEMARK-E algorithm, and the expect value of any BLAST match that was found.

Fig. 1
figure 1

Flowchart illustrating the series of steps performed during the construction of spruce proteome DB. Sequence inputs are listed as are the output subset databases. The size of each database is indicated by the number of individual entries contained within each sequence set

Fig. 2
figure 2

The information contained in the definition line for each database entry describes the source of the entry. An example is shown here and explanatory notes are listed for each component of the definition line. The species from which the sequence was taken is indicated at the beginning of the line. This is followed by a set of symbols that indicate whether the sequence is a contig, EST, or full-length cDNA and the reading frame of the original nucleotide sequence that is represented in the entry. Finally, the result of any BLASTx-derived annotation or GENEMARK-E result is listed along with an expect value if appropriate

In addition to the main databases described above, decoy versions of each database have been produced that contain head-to-tail reversed sequences. These decoy databases are provided separately and can be combined with spruce proteome DB to assess the level of false-positive protein identification that is obtained from any proteomics dataset following a database search. The implementation of this approach for the analysis of proteomics data has been previously described (Huttlin et al. 2007). In brief, matches to reverse sequences represent random incorrect matches and the scores that are obtained against these reverse sequences can be used to empirically determine an appropriate score cutoff when interpreting the result of a proteomics database search.

Database implementation and access

Spruce proteome DB can be accessed and used for the interpretation of tandem mass spectrometry data through an instance of the global proteome machine (GPM; Craig and Beavis 2004) at http://treenomix.ca/Home/ResearchActivities/FunctionalGenomics/ProteinProfiling/SpruceDB.aspx (username: tggreview; password: treenomix). Users can upload their peak extracted data in any GPM compatible format (e.g., .mgf, .mzxml). The complete spruce proteome DB can also be obtained by direct download for use with other proteomics analysis software platforms. The database is provided in fasta format and should be compatible with all commercial and open-source platforms. At present, the database has been successfully tested with both Mascot (Perkins et al. 1999) and ProteinPilot (Applied Biosystems, Foster City, CA, USA) as alternative search engines.

Database performance

The utility of spruce proteome DB was assessed by comparing it against the set of plant protein sequences available in the NCBI database for the interpretation of peptide tandem mass spectrometry data. The data used were taken from a previously completed study performed in Norway spruce (Picea abies) and represents the liquid chromatography–tandem mass spectrometry analysis of a set of 22 cation exchange fractions from the separation of a protein digest. A more detailed description of the Norway spruce protein sample source and method of preparation can be found elsewhere (Lippert et al. 2009). Since Norway spruce is not well represented within the available EST resources, the dataset was searched against the Sitka spruce portion of spruce proteome DB, including both the EST and FLcDNA sequences. For comparison, the same data were analyzed using the same search criteria against two different publicly available sequence databases. One search was performed against the plant translated UniGene sequences from NCBI and the second was performed specifically against the translated Sitka spruce UniGene sequences contained within the larger NCBI database. Spruce proteome DB contains 47,208 unique Sitka spruce protein sequences as compared with 3,492,795 translated plant UniGene sequences from the NCBI database and 100,601 translated Sitka spruce UniGene sequences (Table 1). The searches were performed using default parameters for the type of mass spectrometer used for data collection, and a brief summary of the results has been tabulated. With spruce proteome DB, 492 proteins were identified with a log(expect) value less than or equal to −3.0 from the test dataset. In comparison, 393 proteins were identified using the much larger NCBI database and 407 proteins were identified when using only the Sitka spruce sequences from the NCBI database. There was also a corresponding increase in the confidence with which these proteins were identified in spruce proteome DB as indicated by the log(expect) values of the most confidently identified protein (−196.7 for spruce proteome DB vs. −128.4 using NCBI and −145.7 using the Sitka spruce sequences from NCBI). This result reflects an increase in both the specificity for individual peptide matches in addition to an increase in the number of peptides identified per protein. There is a clear benefit to the use of spruce proteome DB for the interpretation of proteomics data even for related species not directly represented in the database itself.

Table 1 Performance comparison of spruce proteome DB and NCBInr plant for Norway spruce protein identification from tandem mass spectrometry data

Conclusion

In its present form, spruce proteome DB provides a resource tailored to the analysis of proteomics data in species of spruce, which is one of the largest species groups in the conifers including many of the economically and ecologically most important forest tree species of the northern hemisphere. This database may also provide benefit in the analysis of proteomics data from other conifers and gymnosperms. Future versions of this database will expand upon the depth of spruce proteome coverage as new sequencing efforts are undertaken but will also attempt to gather and process sequences from other conifer and gymnosperm species to expanding both the size, diversity of species included, and the general utility of the database for the analysis of gymnosperm proteomics.