Introduction

Cotton is not only an important economic crop worldwide but also a natural fiber and oil crop that can sustainably provide renewable resources and has a strong ability to survive in various environments. Upland cotton has strong environmental adaptability and high fiber production. It has been planted in large quantities worldwide, and its output accounts for approximately 95% of all planted cotton1. However, fiber quality, such as fiber length, uniformity index, micronaire value, breaking strength, and fiber elongation, is relatively ordinary. The elongation and development process of cotton fiber is a complex and orderly regulation process involving multiple genes and pathways. The yield and quality of cotton fiber are more sensitive to the external environment2. Because the quality of cotton fiber directly determines the quality of textiles, improving the yield and quality of cotton fiber has always been the focus of attention in the process of human cotton domestication. With the rapid development of science and technology and continuous scientific exploration, sequencing technology has gradually become a very widely used methodology in scientific research. With the development of mRNA sequencing technology, genome sequencing technology, resequencing technology and phenotype evaluation methods for cotton, important technologies and resources for studying the biological mechanisms of cotton fiber elongation have been developed3. Therefore. Breeders can use new biotechnology to develop new varieties of upland cotton with excellent fiber quality and high yield, which is of great benefit not only to upland cotton breeding but also to the global textile industry.

Recently, RNA deep sequencing technology has provided a platform for the analysis of differences in gene expression4. RNA-seq technology has been widely used in transcriptome studies of Arabidopsis thaliana5, Populus trichocarpa6, Glycine max7, Oryza sativa8, Medicago sativa9, Gossypium hirsutum10, Brassica napus11, Triticum aestivuml12, and Zea mays13, among others. The extensive application of RNA-seq technology has promoted the study of fiber elongation and development and provided strong technical support for the identification of genes with fiber-specific or fiber-dominant expression in cotton. In the study of cotton fiber, this technology is mainly applied for the investigation of differentially expressed genes (DEGs) between certain tissues of different cotton varieties14, between different tissues of the same cotton variety15, and during different developmental periods in the same cotton tissue16. By analyzing a transcriptome database, we can identify significant DEGs and then screen key genes that play important roles in different tissues or different developmental stages as candidate research objects for detailed functional analysis. For example, Padmalatha et al. sequenced the transcriptome of cotton fibers at different developmental stages under drought stress treatment and found that pectin modification and cytoskeletal protein-related genes play important roles in the initial differentiation stage of fiber primordial cells. These research results will help researchers develop drought-tolerant cotton cultivars without compromising fiber quality17. Using Gossypium hirsutum and Gossypium barbadense as materials, Li et al. sequenced the transcriptomes of fiber samples from different developmental stages (10 DPA, 15 DPA, 20 DPA, 25 DPA and 28 DPA) using Illumina HiSeq 2000 sequencing technology. A total of 1801 DEGs were identified, including 902 upregulated and 899 downregulated DEGs, which were mainly involved in the cell wall, cytoskeleton, transcription, and translation regulation. These findings lay a solid foundation for improving the fiber yield and quality18. Hu et al. performed transcriptome sequencing using 0 DPA and 5 DPA fibers from the Xu 142 cultivar and its mutant Xu 142fl. A total of 2641 new genes, 35,802 long noncoding RNAs (lncRNAs), and 2262 circular RNAs were identified. Three lncRNAs were selected as research objects in studies involving virus-induced gene silencing (VIGS) technology. It was found that these lncRNAs play an important role in the development of cotton fiber elongation19. Parekh et al. performed transcriptome sequencing of Gossypium herbaceum fibers from different stages and obtained 20,125 unigenes. They predicted some transcription factors that play an important role in the development of cotton fiber elongation15. Xu et al. revealed the evolution of the reactive oxygen species (ROS) network and the regulation of fiber development in cotton. It was found that the ROS network-mediated signaling pathway enhanced the regulatory mechanisms of fiber elongation and development in cotton20.

Many studies have used RNA-seq technology to discover genes that are specifically or predominantly expressed in association with cotton fiber development. For example, Wan et al. searched for genes that were specifically expressed or regulated fiber elongation by sequencing the transcriptome of Gossypium hirsutum Xu 142 and its fuzzless mutant Xu 142fl21. Man et al. used fibers from different developmental stages of Gossypium hirsutum and Gossypium barbadense as research objects and performed transcriptome sequencing analysis to screen for excellent genes determining fiber quality22. Li et al. searched for genes determining excellent fiber properties of cotton via transcriptome sequencing analysis of two inbred lines from a Gossypium hirsutum × Gossypium barbadense backcross23. However, when analyzing transcriptome sequencing data from Xu 142 and the Xu 142fl mutant, many DEGs were found, which might be due to random mutations causing marked differential expression of many genes that were not related to fiber development. This occurrence makes it difficult to screen genes for superior fiber traits. In addition, analysis of transcriptome sequencing data from different developmental stages of Gossypium hirsutum and Gossypium barbadense revealed that many genes with significant expression differences could be screened due to the very different genetic backgrounds of these species. However, these DEGs may result from different genetic backgrounds rather than being associated with good fiber quality and yield traits. Therefore, it is difficult to select excellent fiber quality genes by comparing the expression of fiber quality- and yield-related genes between Gossypium hirsutum and Gossypium barbadense.

In this study, transcriptome sequencing was performed on different tissues of Gossypium hirsutum Coker 312, and the DEGs were compared between fiber and nonfiber tissues to effectively narrow the selection range of candidate genes related to good quality traits in fiber development. Thus, information on genes that directly participate in fiber synthesis or regulate fiber development during fiber elongation and development could be obtained quickly. This study provides important genetic resources for breeding new cotton germplasms with excellent fiber quality.

Results

Detection of total RNA quality in different cotton tissues

Seven different tissues of upland cotton Coker 312 were sampled (Fig. 1). Samples of roots and leaves were collected at 15, 25 and 35 days after germination (Fig. 1a,b). Samples of anthers and stigmas were collected at − 3, − 2 and − 1 days before anthesis (Fig. 1c–e). Samples of fibers were collected at 7, 14 and 26 DPA (Fig. 1f–h). Detection of total RNA by agarose gel electrophoresis showed that the RNA from seven tissues exhibited two complete rRNA bands and that the 28S rRNA band was approximately twice as bright as the 18S rRNA band. The results indicated that the integrity of these total RNAs was relatively complete, and no degradation was observed (Fig. S1). The concentration and purity of the total RNA were measured with a Nanodrop 2000 spectrophotometer. The concentration of each sample was higher than 150 ng/μL, the OD260/280 was approximately 1.9, and the OD260/230 was greater than 2.0. Therefore, the RNA concentration and purity of these samples were relatively high, reaching the standard for A-level library construction (Table S1). The next step of database building and sequencing was then conducted.

Figure 1
figure 1

Seven different tissues of upland cotton Coker 312 were sampled. (a, b): Samples of roots and leaves were collected at 15, 25 and 35 days after germination; (ce): samples of anthers and stigmas were collected at − 3, − 2 and − 1 days before anthesis; (fh): samples of fibers were collected at 7, 14 and 26 DPA. The length of the red ruler line represents the actual measured length of 1 cm.

Transcriptome sequencing and comparison with the reference genome

On the Illumina HiSeq™ 2000 platform, high-throughput RNA sequencing generated 343 million raw reads from seven cotton tissues, with more than 46.59 million reads per tissue. A total of 332 million clean reads (96.79%) were obtained from the total raw reads after discarding adapters, low-quality reads and raw reads filtered according to an N greater than 5%. Two biological replicates were evaluated for each line, and the results indicated that the RNA-seq data were in good agreement (0.936 < R2 < 0.982). For all sequence data, the average Q20, Q30, and GC contents were 98.88%, 93.39% and 43.86%, respectively, and the error rates for all samples ranged from 0.01 to 0.02%. Overall, 83.33%-88.22% of the high-quality clean reads were mapped to the reference genome of Gossypium hirsutum TM-1 using TopHat29 (Table 1). The distribution of the numbers of genes expressed in each tissue sample (Table S2) and the sequencing coverage of the gene transcripts (Fig. S2) were analyzed. There were 3774–4481 genes expressed in high abundance. Among the mapped reads, 78.75–88.03% were distributed in exon regions, 1.54–7.81% were located in introns, and 10.09–17.26% were located in intergenic regions. Three types of mapping data were obtained: (1) multiple mapping (21–25.66%) and unique mapping reads (60.24–65.4%), (2) forward mapping (36.84–39.38%) and reverse mapping reads (36.75–39.22%), and (3) nonspliced reads (56.68–61.93%) and spliced reads (13.23–21.65%) (Table 1).

Table 1 Statistical table of transcriptome sequencing data.

A total of 76,772 genes were identified, and no fewer than 52,565 genes were expressed in the seven libraries. To test the correlations among the experimental samples, the expression of the genes in these seven libraries was analyzed using the Pearson correlation coefficient (PCC) (Fig. S3a). The results showed that 7 DPA fibers presented the greatest similarity to 14 DPA fibers (similarity as high as 0.732), followed by 26 DPA fibers (0.605), while the similarity to nonfiber tissues (root, leaf, anther and stigma) was lower. Therefore, gene expression was most similar in fibers from different elongation periods (7, 14 and 26 DPA); however, the gene expression between different tissues showed significant changes.

To further confirm the relationships among these seven different tissue libraries, principal component analysis (PCA) was performed on the expressed genes (Fig. S3b). The results showed that the expression of genes differed mainly among different tissues, except for the genes that were expressed in different periods in the fiber tissue. In the PCA, the gene expression pattern was different between different tissues, which was conducive to screening for cotton fiber superiority or specifically expressed genes and to some extent indirectly verified the reliability of our RNA-seq data.

Type and abundance of alternative transcript splicing

Many genes can produce multiple mRNA transcripts via alternative splicing, and different mRNAs can be translated into different proteins. Thus, a gene can produce multiple proteins through alternative splicing, thereby greatly increasing the diversity of proteins. The alternative splicing events of transcripts predicted by StringTie (v1.0.4, http://ccb.jhu.edu/software/stringtie/) using ASprofile software (http://ccb.jhu.edu/software/ ASprofile/ index.shtml) were classified and counted as shown in Fig. 2. Many alternative splicing events were present in the seven different tissue samples, and the results showed that the overall pattern of alternative splicing events was largely similar across all samples, with more than 79% being concentrated in the alternative 3′ last exon (TTS), alternative 5′ first exon (TSS), and alternative exon end (AE) categories. The results showed no difference between samples from different tissues of cotton, indicating that alternative splicing events proceeded steadily in different tissues or in different periods in the same tissue.

Figure 2
figure 2

Types and numbers of variable splicing statistics. AE Alternative exon ends, IR intron retention, MIR multi-IR, MSKIP multiexon SKIP, SKIP skipped exon, TSS alternative 5′ first exon, TTS alternative 3′ last exon, XAE approximate AE, XIR approximate IR, XMIR approximate MIR, XMSKIP approximate MSKIP, XSKIP approximate SKIP.

Screening of DEGs and overall transcriptome sequencing analysis

To screen DEGs in different tissues of cotton, we used the fragments per kilobase of transcript per million fragments (FPKM) method to measure the gene expression levels. According to the different FPKM values of the expressed genes in each tissue, the screening parameters for DEGs were set as follows: p value < 0.05 and |Log2(Fold Change)| ≥ 2. If the |Log2Fold Change)| value was larger, it meant that the expression difference of the gene between samples was higher, that is, high-abundance expression. According to the FPKM value of each gene in different tissues, cluster analysis and differential gene expression profiling were performed. From our data, a volcano plot and a histogram showing detailed information about the number of DEGs in each pairwise comparison were generated (Fig. 3). According to the sequencing results, volcano plots were drawn and screened according to the significant difference criteria (difference in gene expression > 2 and FDR ≤ 0.05), and statistically significant differences in gene expression were measured. The volcano plots showed that many genes were upregulated and downregulated in the pairwise comparisons (Fig. 3a–c). Histograms were generated to summarize the significant DEGs identified in the pairwise comparisons of all samples (Fig. 3d).

Figure 3
figure 3

Upregulation and downregulation of DEGs between different tissues. (ac) Volcano plot of DEGs (red dots represent upregulated DEGs, blue dots represent downregulated DEGs, gray dots represent genes that are not differentially expressed, the abscissa represents the fold change in gene expression in different samples, and the ordinate represents statistically significant differences in gene expression). (d) Comparison of the upregulation and downregulation of DEGs between each pair of samples in different tissues (red columns represent upregulated DEGs, blue columns represent downregulated DEGs, the abscissa represents the name of the sample comparison group, and the ordinate represents the number of DEGs).

Subsequently, the transcriptome data from the pairwise comparisons were compared with gene annotation databases; 889–5017 genes showed matches in the Gene Ontology (GO) database, and 765–4329 genes showed matches in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Table 2).

Table 2 Summary of annotated genes in each database for the pairwise comparisons.

Screening of DEGs from Venn diagrams between fiber and nonfiber tissues at different developmental stages

To identify genes that were specifically or predominantly expressed during the period of fiber development and elongation, transcriptome data were used to generate heatmaps of DEGs, and Venn diagrams showing detailed information about the numbers of DEGs in each pairwise comparison were produced (Fig. 4). Overall, 7 DPA and 14 DPA fibers showed high similarity in their gene expression profiles, but they presented less similarity to the 26 DPA fibers; however, the similarity between the fiber profiles was higher than that between the nonfiber tissues (Fig. 4a). These results were consistent with the results of the PCC. Comparative analysis of DEGs between fiber and nonfiber tissues revealed 1411 DEGs in 7 DPA fibers (Fig. 4b), 1405 DEGs in 14 DPA fibers (Fig. 4c) and 1219 DEGs in 26 DPA fibers (Fig. 4d). Among the DEGs, there were 1205, 1135 and 937 upregulated DEGs (Fig. S4a–c) and 124, 179 and 213 downregulated DEGs (Fig. S4d–f) in 7 DPA, 14 DPA and 26 DPA fibers, respectively. The dynamic changes in the DEGs identified from comparative transcriptome analysis between fiber and nonfiber samples might reveal the regulatory mechanisms of key genes in fiber elongation development and quality formation.

Figure 4
figure 4

DEGs identified in seven different tissues from comparisons between fiber and nonfiber tissues. (a) Heatmap of DEGs (p value < 0.05, |Log2(Fold Change)| ≥ 2 at one sampling point); (b): Venn diagram analysis of the DEGs in 7 DPA fiber and nonfiber tissues; (c): Venn diagram analysis of the DEGs in 14 DPA fiber and nonfiber tissues; (d): Venn diagram analysis of the DEGs in 26 DPA fiber and nonfiber tissues. A anther, L leaf, R root, S stigma, F_7 7 DPA fiber, F_14 14 DPA fiber, F_26 26 DPA fiber.

Functional annotations of DEGs in fiber and nonfiber transcriptome libraries

To identify genes related to fiber development and elongation, we analyzed DEGs from transcriptome data between fiber and nonfiber (root, leaf, anther and stigma) tissues. Compared with the transcriptome libraries of nonfiber tissues, 1411, 1405 and 1219 DEGs were identified in 7, 14 and 26 DPA fibers (Fig. 4b–d), respectively. Functional annotation analysis of the DEGs was performed according to GO terms and KEGG metabolic pathways.

In the GO term analysis, 576, 556 and 451 DEGs with functional annotations were enriched in 32, 31 and 31 GO groups in 7, 14 and 26 DPA fibers (Fig. 5a–c), respectively. During fiber elongation and development, among biological processes, the DEGs were significantly enriched in metabolic processes, cellular processes, single organism processes, and biological regulation processes. In the cell component category, the DEGs were mainly distributed in cells, membranes, and organelles. Among molecular functions, the DEGs were mainly enriched in functional groups, such as catalytic activity, binding, transport activity, and molecular function regulation. These results showed that the above functional groups could play an important role in fiber development and elongation.

Figure 5
figure 5

Functional annotation analysis of DEGs was conducted according to GO terms. (a) Enriched GO terms in 7 DPA fiber vs. nonfiber tissues; (b) enriched GO terms in 14 DPA fiber vs. nonfiber tissues; (c) enriched GO terms in 26 DPA fiber vs. nonfiber tissues.

In the KEGG metabolic pathway analysis, 455, 431 and 333 DEGs were significantly clustered into 30, 30 and 33 metabolic pathways in 7, 14 and 26 DPA fibers (Fig. 6a–c), respectively, which mainly included metabolic pathways such as those involved in signal transduction, carbohydrate metabolism, protein translation and processing, transportation and catabolism. These results showed that the above functional groups and metabolic pathways could play an important role in the stage of fiber development and elongation.

Figure 6
figure 6

Functional annotation analysis of DEGs was carried out through KEGG metabolic pathway analysis. (a) Enriched KEGG metabolic pathways in 7 DPA fiber vs. nonfiber tissues; (b) enriched KEGG metabolic pathways in 14 DPA fiber vs. nonfiber tissues; (c) enriched KEGG metabolic pathways in 26 DPA fiber vs. nonfiber tissues.

Analysis of related functional genes during fiber elongation and development

Fiber cell metabolism is very active during the fiber development and elongation stage, which is mainly manifested in the overexpression of three major groups of genes: (1) cytoskeleton-related genes; (2) cell wall structure and cellulose biosynthesis-related genes; and (3) energy and carbohydrate metabolism-related genes. A number of related functional genes that were differentially expressed during fiber elongation and development were selectively analyzed. The expression patterns of genes associated with fiber elongation development are shown in Fig. 7 and Table S3. Many upregulated genes were associated with cytoskeletal components, such as actin (GhACT), tubulin (GhTUBA4, GhTUBB2, GhTUBB6), kinesin (GhPSS1), clathrin (GhCLC2), and dynamin-related proteins GhDRP1), and these genes were predominantly or specifically expressed in cotton fiber tissues. The metabolic activities of various carbohydrate substances promote the rapid elongation of fibers, involving enzymes such as cellulose synthase (GhCESA5), sucrose synthase (GhSUS), glucosidase (GhGBA), phosphate uridylyltransferase (GhUGP1), polygalacturonase (GhADPG1) and other carbohydrate synthase or modification enzymes. Compared with nonfiber tissues, these genes were significantly upregulated at different fiber elongation stages. We also identified a number of genes involved in fatty acid metabolism and secondary metabolic processes, such as ultralong-chain enoyl coenzyme A reductase (GhECR2), fatty acyl coenzyme A reductase (GhFAR3), long-chain acyl coenzyme A synthase (GhLACS1), ultralong-chain-3-hydroxyacyl coenzyme A dehydratase (GhHACD3), alkane hydroxylase (GhMAH1), and phenylpropane compounds (GhACO1). In particular, most of the fatty acid metabolism-related genes were significantly upregulated in the early stage of fiber elongation (7 DPA), indicating that these genes play important roles in the early stage of cotton fiber elongation.

Figure 7
figure 7

Heatmap of related functional genes during fiber elongation and development (p value < 0.05, |Log2(Fold Change)| ≥ 2 at one sampling point).

Analysis of transcription factors during fiber elongation and development

Transcription factors can recognize and bind DNA sequences in a sequence-specific manner to regulate gene expression, forming a complex system that guides genome expression. Based on the criterion of |Log2(Fold Change)| ≥ 2 at one sampling point, a total of 1467 transcription factors from 46 transcription factor families were annotated between fiber and nonfiber tissues, including WRKY (136), AP2-EREBP (118), bZIP (97), bHLH (92), MYB (82), NAC (62), PIF (56), GRAS (36), AUX/IAA (28), TGA (22), and CCAAT (18) transcription factors (Fig. 8a). Among these transcription factors, 148 were significantly upregulated in fibers. Putative homologs of genes related to fiber development were identified in seven different cotton tissues (Fig. 8b–e). GhMYB114, GhMYB1, GhMYB3 and GhCPC-like belong to the MYB transcription factor family. Among these genes, GhMYB114, GhMYB1 and GhMYB3 presented similar expression patterns and were significantly upregulated in 14 DPA fibers, with lower expression in the other fiber periods, while their expression levels in nonfiber tissues were very low. GhCPC-like was significantly upregulated in 7 DPA fibers. PTI-6 belongs to the AP2/EREBP transcription factor family, and three PTI-6-homologous genes (GhERF34, GhERF38 and GhERF84) showed the same expression pattern and were upregulated in fiber tissue. However, their expression was significantly upregulated in the early stage of fiber elongation, and the expression level gradually decreased with the passage of developmental time. Two other AP2/EREBP transcription factors, RAP2 (GhTINT) and ERF (GhWIN1) also presented this expression pattern. Similarly, in the bHLH, bZIP and WRKY families, many transcription factors were found to be upregulated in fibers but showed gradually decreased expression with fiber development and were expressed at low levels or not at all in nonfiber tissues.

Figure 8
figure 8

Statistics of transcription factor expression in different tissues of cotton. (a) Quantity and classification of transcription factor families. A total of 1467 different transcription factors were annotated in 46 transcription factor families. The numbers represent the percentages of transcription factor genes. (be) Detailed illustration of the expression of transcription factors related to fiber elongation and development. The x-axis indicates the distribution of transcription factors in seven different tissues of cotton. The y-axis represents the FPKM value of each transcription factor in different tissues.

Analysis of dominant or specific expression genes during fiber elongation and development

A large number of genes were expressed in different developmental stages and participated in the regulation of fiber cell elongation and development. In this study, some functional genes specifically or preferentially expressed in the process of fiber elongation and development were analyzed. Based on the p value ≤ 0.05, the screening parameters between fibers and nonfibers were |Log2(Fold Change)| ≥ 2, and the screening parameters between fibers at different elongation and development stages were 0.5 < |Log2(Fold Change)| < 1.5. Corresponding parameters were set to screen fiber-dominant genes. The expression pattern of dominant expression genes in fiber elongation development is shown in Table S4 and Fig. 9. A total of 330, 128 and 278 highly expressed genes were screened in Fiber_7, Fiber_14 and Fiber_26, respectively; among them, 154 (46.67%), 43 (33.60%) and 96 (34.53%) had clear gene annotations, and the rest were genes with unknown function. Simultaneously, 206 genes with high-abundance expression in the whole elongation and development period of fiber were also screened, of which 114 (55.34%) had clear gene annotations. Therefore, there were a large number of fiber-dominant or fiber-specifically expressed genes in the process of fiber elongation and development, which has not been reported previously.

Figure 9
figure 9

Heatmap of highly abundant genes expressed in different periods of fiber elongation and development (based on a p value < 0.05, the screening parameters between fibers and nonfibers were |Log2(Fold Change)| > 2, and the screening parameters between fibers at different elongation and development stages were 0.5 < |Log2(Fold Change)| < 1.5.). (a) Heatmap of highly abundant genes expressed in 7 DPA fibers; (b) heatmap of highly abundant genes expressed in 14 DPA fibers; (c) heatmap of highly abundant genes expressed in 26 DPA fibers; (d) heatmap of highly abundant genes expressed in the whole period of fiber elongation and development.

Validation of RNA-seq data by qRT-PCR

To further validate the reliability of the RNA-seq results and the accuracy of the DEGs, twelve upregulated DEGs were randomly selected from each of the 7 DPA, 14 DPA and 26 DPA fiber transcriptome libraries, for a total of 36 genes. qRT-PCR was used to analyze the differential expression of genes between different tissues of cotton. Cotton Sad1 (NCBI Reference Sequence: NM_001327106.1) was used for relative gene expression normalization24. The qRT-PCR results showed that the 36 upregulated genes were all expressed specifically or predominantly in fibers (Fig. 10 and Fig. S5), among which 19 were specifically or preferentially expressed in 7 DPA fibers (e.g., CotAD_98043, CotAD_14327, CotAD_51137). Six upregulated genes were specifically or predominantly expressed in 14 DPA fibers, including CotAD_46959, CotAD_27919, CotAD_22244, CotAD_23413, CotAD_10228 and CotAD_36479. The remaining 11 upregulated genes were expressed specifically or preferentially in 26 DPA fibers. Comparative analysis of the qRT-PCR results and RNA-seq data revealed slight differences, but the expression trends of DEGs in different tissues were highly similar between the two groups of data. This result showed that the identification of genes that were specifically or predominantly expressed in fiber by comparing DEGs between different tissues resulted in improved accuracy. In conclusion, the transcriptome database of different cotton tissues constructed in this study presented high reliability.

Figure 10
figure 10

Validation of RNA-seq data by qRT-PCR. Root root, Leaf leaf, Anther anther, Stigma stigma, Fiber_7 7 DPA fiber, Fiber_14 14 DPA fiber, Fiber_26 26 DPA fiber; columns represent the Zresults of qRT-PCR, and zigzag lines represent the results of transcriptome sequencing.

Discussion

Transcriptome analysis of fiber and nonfiber tissues in cotton

Cotton fiber is an ideal model for studying cell elongation and cell wall construction in plants. Cotton fiber elongation is regulated in a complex, orderly manner involving multiple genes and pathways. Prominent progress in molecular biology research on cotton fiber development is the isolation of some cotton fiber-specific or abundantly expressed genes. With the development of new technology and bioinformatics, expression profile analysis with RNA-seq technology as the main method has played an important role in cotton fiber development research. In this study, we used RNA-seq technology to sequence the transcriptome of seven different cotton tissues (root, leaf, anther, stigma, 7 DPA fiber, 14 DPA fiber, and 26 DPA fiber) to screen for highly abundant genes during fiber elongation and development.

In this study, the analysis of DEGs was performed from the transcriptome data of 7 DPA fiber and nonfiber tissues. There were 1,205 upregulated genes with significant expression differences found in 7 DPA fibers; many of these genes had been previously reported in fibers, such as E625, Flb2A26, GhAGP427, GhFLA128, GhEXPA29, GhMYB230, GhACT131, Rac132, GhTUB133, and GhCesA34. GO enrichment analysis was performed on genes that were significantly upregulated in 7 DPA fibers. These gene products are mainly localized in the membrane, organelles, cell wall and other cell components and participate in molecular functions such as catalytic functions, binding, transport activity and molecular function regulation. The results were consistent with those reported by Qin35, Liu36 and Huang37, indicating that genes related to catalytic activity, lipid metabolism and the cell membrane may play an important role in the early elongation stage of fiber development. Subsequently, through analysis of KEGG metabolic pathways, it was found that the upregulated genes in fiber were mainly enriched in the categories of signal transduction, carbohydrate metabolism, protein translation and processing, transportation and catabolism, followed by energy metabolism, lipid metabolism, glycan biosynthesis and metabolism, cell growth and other metabolic pathways. This finding was consistent with the results reported by Fang et al., who found that carbohydrate metabolism, protein translation and processing, signal transduction and lipid metabolism played an important role in fiber elongation38. Similarly, DEGs were analyzed between fiber and nonfiber tissues at 14 DPA and 26 DPA, resulting in the identification of 1135 and 937 upregulated DEGs, respectively. The results of GO enrichment and KEGG functional analyses were similar to those of the cluster analysis of 7 DPA fibers, indicating that lipid metabolism, signal transduction, catalytic activity, the cell wall, and the cytoskeleton played important roles in the rapid and late elongation of cotton fibers.

Fiber development-related transcription factors regulate cotton fiber elongation and development

A cotton fiber consists of a single epidermal cell of the ovule that undergoes specialized initial differentiation, elongation, thickening and dehydration maturation to form a mature epidermal fiber39. A large number of genes are required for fiber differentiation and development, but it is thus far unclear how these genes control and regulate fiber development. Transcription factors play an important regulatory role in the growth and evolution of cotton fiber. In this study, a total of 1,467 transcription factors were identified from families such as MYB, bHLH, bZIP, and TCP. Among these transcription factors, 148 were significantly upregulated in fibers. GhMYB7, an R2R3-MYB transcription factor, was highly expressed during fiber elongation. A cross-sectional assay of basal stems revealed that the cell wall thickness of vessels and interfascicular fibers was higher in transgenic lines overexpressing GhMYB7 than in the wild type. This gene may be involved in regulating secondary cell wall biosynthesis in cotton fibers40. GhMYB25 and GhMYB25-like were mainly expressed at high levels during the initial differentiation and early elongation of the fiber, and their expression decreased with the onset of rapid elongation of the fiber. The silencing of GhMYB25 is associated with the production of short fibers in cotton41, while the suppression of GhMYB25-like produces fiber-less cotton42. GhMYB46 showed the highest expression in 20 DPA fibers. Overexpression of GhMYB46 leads to ectopic secondary cell wall (SCW) deposition in transgenic plants and could activate the promoter of the SCW cellulose synthase gene to regulate the elongation and development of cotton fiber43. GhMYB109 is specifically expressed in the differentiation and elongation stages of cotton fiber primordial cells, and antisense-mediated suppression of GhMYB109 leads to a substantial reduction of fiber length44. GhDEL65, a transcription factor of the bHLH protein family from upland cotton, is a homologous gene of Arabidopsis GLABRA3 (GL3), which is expressed at a significant level in the early stage of fiber elongation; ectopic expression partially rescues hair body development. Ectopic expression of GhDEL65 partially rescues the development of trichomes in the Arabidopsis gl3 mutant45. GhFSN1, one of the transcription factors of the NAC family, has been found to exhibit high levels of transcript accumulation in 15–28 DPA fibers, but the expression of this gene is undetectable or very weak in other tissues of cotton. This gene regulates the formation of the SCW in cotton fiber46. GhTCP14a, a TCP transcription factor, is expressed at a high level during the rapid elongation period (6–12 DPA) of cotton fibers and regulates the rapid elongation development of the cotton fiber47. In this study, a large number of transcription factors were expressed either preferentially or specifically in cotton fibers. However, further research is required to determine how transcription factors regulate the elongation and development of cotton fibers.

Fiber development-related functional genes regulate cotton fiber elongation and development

Researchers have successfully cloned and identified many cotton genes with fiber-dominant expression and studied their gene functions. These genes were all in the high-abundance expression database of fibers screened in this study (Table S3). For example, GhTUB1 is preferentially expressed in the elongation and development stages of fibers33. GhACT1 is mainly expressed in fiber cells, and its suppression disrupts the actin cytoskeleton and causes reduced fiber elongation, indicating that GhACT1 plays an important role in the period of fiber elongation but does not play a role in the initial period of fiber cell development31. GhPEL76 is a pectate lyase-like gene. Expression analysis (qRT-PCR) results showed that GhPEL76 is mainly expressed in cotton fibers, and its expression levels are significantly different in long- and short-fiber varieties. Virus-induced GhPEL76 silencing shortens fiber length, suggesting that GhPEL76 has a positive regulatory effect on fiber elongation48. GhLTPG1, a GPI-anchored lipid transporter gene, was cloned from Gossypium hirsutum and shown to be significantly expressed in the period of rapid fiber elongation. After heterologous expression in Arabidopsis, the number of leaf epidermal hairs was significantly increased. In contrast, silencing of the GhLTPG1 gene in upland cotton by using RNAi technology resulted in a significantly shortened fiber length, reduced polar lipid content and inhibited expression of genes related to fiber elongation49. GhFIM2 is an actin gene that is expressed predominantly in the overlapping portion of the fiber elongation and SCW synthesis phases. In cotton, overexpression of the GhFIM2 gene increases the expression level in the actin bundle in the fiber elongation period and accelerates the rate of fiber elongation so that the length of mature fibers is increased50. GhEXPA8 is an EXPANSIN family gene that is expressed at a high level during the rapid elongation (7–25 DPA) of fiber. Overexpression of GhEXPA8 can increase the fiber length and mark value to enhance fiber tenacity51. In this study, a large number of genes were screened for significantly high expression levels in the fiber elongation and development period, providing many new gene resources for genetic engineering for cotton fiber quality improvement.

Analysis of highly abundant expressed genes in fiber elongation development

Through analysis of fibers and nonfibers transcriptome data, a total of 1324 highly expressed genes in fiber were screened, of which 330, 128, and 278 genes were predominantly expressed in 7, 14 and 26 dpa fibers. A total of 407 genes had clear functional annotations, but only a few genes have been reported in research. After sorting out the data and analyses, we found a large number of unreported new genes that are expressed in high abundance in fibers (Table S4). For example, CotAD_24107 belongs to the proline-rich cell wall protein family genes, and its homologous gene PdPRP is preferred in immature poplar. Overexpression of PdPRP promotes secondary wall deposition and induces the expression of genes involved in microfibril angle and secondary wall biosynthesis52. In this study, we found that this gene was expressed in high abundance in cotton fibers, especially in the period of fiber secondary wall thickening (14 DPA). CotAD_27598 is a protodermal factor gene. In Arabidopsis, AtPDF2 is specifically expressed in bud epidermal cells and plays an indispensable role in bud differentiation53,54. This study showed that this gene was expressed in extremely high abundance during the differentiation stage of fibroblasts (7 DPA), and it was speculated that it played an important role in the initial stage of fibroblast development. The homologous gene MtKCS of CotAD_37982 was predominantly expressed in the epidermal cells of the bud apical meristem, leaf primordium, and floral organs of Medicago truncatula and was mainly involved in the biosynthesis of very-long-chain fatty acids. However, overexpression of very-long-chain fatty acids can enhance the biosynthesis of cytokinins and promote the process of cell differentiation55. In this study, it was found that this gene was specifically expressed in fibers and was expressed in high abundance during the differentiation period of fiber primordial cells (7 DPA). It was speculated that this gene had the function of promoting fibroblast differentiation. CotAD_37677 is a bidirectional sugar transporter gene. Its homologous gene OsSWEET3a functions as a glucose transporter and is predominantly expressed in the basic vascular bundles of rice seedlings56. This gene was mainly secondary to cotton fiber development. It was expressed in abundance during the wall thickening period (26 DPA), and we speculated that it transported glucose to fiber cells for cellulose biosynthesis. CotAD_60133 encoded a skewing-related protein. Its homologous gene AtSPR1 is related to directional cell expansion and functions by regulating cortical microtubule dynamics57,58, indicating that this gene may be regulated by microtubule dynamics to regulate the expansion and change of fiber cells. In conclusion, this study used transcriptome data analysis of fiber and nonfiber tissues to screen a large number of fiber dominant expression genes, which can be used as candidate genes for the improvement of cotton fiber quality.

To further verify the comprehensiveness and reliability of the transcriptome sequencing data from different cotton tissues obtained in this study, we selected 36 genes from the 7, 14 and 26 DPA genes that were significantly upregulated in the fiber to analyze the expression patterns among different cotton tissues by qRT-PCR. The significantly upregulated genes identified in fiber were all specifically or predominantly expressed in fiber. Among these genes, CotAD_46044 (E6)25, CotAD_46959 (GhPRP1)59, CotAD_27919 (GhEXPA)51, CotAD_14327 (GhGASL3)60, CotAD_63563 (GhGLP1)61, CotAD_05318 (GhXTH7)62, CotAD_01886 (GhPEL)63, CotAD_49061 (GhKCS)64, CotAD_20528 (GhFb)26, CotAD_69249 (GhFLA7) and CotAD_1612 (GhFLA12)37 have been previously reported to be specifically or preferentially expressed in the period of fiber elongation and secondary wall thickening and play an important role in cotton fiber elongation. The remaining 25 genes that were specifically or predominantly expressed in fiber had not been previously reported, so they could be used as candidate genes for further functional studies. By comparing the expression patterns of the selected fiber-upregulated genes between different tissues according to the qRT-PCR and RNA-seq data, we found that the gene expression patterns revealed by the two types of analysis were very similar. Therefore, the transcriptome libraries of different tissues constructed in this study were comprehensive and reliable and provided new genetic resources for the genetic engineering of cotton fiber quality improvement.

Materials and methods

Plant materials

Coker 312 was upland cotton cultivar and preserved in our laboratory, which was commonly used as test material in cotton research. In our study, using Coker 312 as material to identify the genes preferentially expressed in fiber development in Gossypium hirsutum. Our research contents complied with local and national regulations. Coker 312 was planted in a greenhouse. The roots and leaves were collected at 15, 25, and 35 days after germination, and the surface of the materials was cleaned with ddH2O. The root materials from the three periods were mixed together as the root tissue material and wrapped with aluminum foil, and the leaf material was subjected to the same procedure. The material was frozen in liquid nitrogen for 5 min and then transferred to a − 80 °C refrigerator for storage. Anthers were collected from cotton at − 3 ~ 0 DPA, stigmas were collected on the day of flowering, and fibers were collected at 7, 14, and 26 DPA (removing ovules). These samples were quickly placed in liquid nitrogen for freezing treatment for approximately 10 min and then transferred to a − 80 °C freezer for storage.

RNA extraction, cDNA library construction and RNA sequencing

Total RNA was extracted from each sample using the RNAprep Pure Plant Kit (Polysaccharides & Polyphenolics-rich) (TIANGEN, Beijing, China) following the manufacturer’s protocol. We collected 14 samples from two biological replicates of each sample. Electrophoresis in a 1% agarose gel was used to determine whether the total RNA presented genomic DNA contamination, degradation or impurity. The concentration, purity and RNA integrity of the total RNA were further determined using a Kaiao K5500 spectrophotometer (Kaiao, Beijing, China) and an Agilent 2100 Bioanalyzer (Agilent Technologies, CA, USA). After the total RNA samples were qualified, oligo (dT) magnetic beads were used to enrich the mRNA. Fragmentation buffer (Agilent, CA, USA) was added to the obtained mRNA to generate short fragments. Then, the first strand of cDNA was synthesized with six-base random primers, and the second strand of cDNA was synthesized by adding buffer, dNTPs, RNase H and DNA polymerase I (NEB, MA, USA). The QIAQuick PCR kit was used for purification, and elution was conducted with EB buffer (QIAGEN, Germany). The purified double-stranded cDNA was then treated by terminal repair and A base and sequencing adapter addition (Illumina, CA, USA). Subsequently, a fragment of approximately 150 bp was recovered by agarose gel electrophoresis, and PCR amplification was performed to complete the library preparation. Finally, the constructed cDNA libraries were sequenced on the Illumina HiSeq 2000 platform.

Bioinformatics analysis of RNA-seq data to identify DEGs

The initial results of transcriptome sequencing were in the form of original images, which could be employed for base recognition and transformation using CASAVA (v1.8) software to obtain the original sequence data65. The quality of the original sequence was evaluated by detecting the base sequencing error rate and G+C distribution. Higher-quality clean reads were obtained by collating and filtering to remove adaptor tags, reads with an N content greater than 5% and lower-quality ultrashort reads. In this study, the whole-genome data of Gossypium hirsutum TM-1 were used as a reference genome sequence (http://grand.cricaas.com.cn/home), and clean reads from seven different tissues were used for genome location analysis using TopHat2 software66. StringTie software was used to reassemble all clean reads for the prediction of new transcripts67. Gene expression level quantification was estimated via the FPKM (fragments per kilobase of transcript sequence per million fragments mapped) method using HTSeq software68. DEGseq software was used to identify the DEGs of different tissues according to a p value ≤ 0.05 and |Log2(Fold Change)| ≥ 269. Finally, the statistical results for the DEGs among tissues were obtained.

Functional classification of DEGs

The functional analysis of DEGs was conducted with GO and KEGG annotation. The GOseq R (v4.0.2) software package70 (https://www.r-project.org/)was used for the GO enrichment analysis of DEGs. We used KOBAS (v3.0) software (http://kobas.cbi.pku.edu.cn/kobas3/) to test the statistical abundance of DEGs in the KEGG database. The GO terms and KEGG pathways with corrected p values ≤ 0.05 were considered the thresholds to determine the significant enrichment of DEGs. The main functions and metabolic pathways of DEGs were preliminarily hypothesized.

Verification of RNA-seq results by qRT-PCR

To verify the accuracy of the obtained differential gene expression patterns between fiber and nonfiber tissues, we screened 36 genes that were significantly upregulated in fibers for verification analysis by qRT-PCR. These DEG-specific primers were designed with the Primer3 online tool (http://bioinfo.ut.ee/primer3-0.4.0/) (Table S5). According to the manufacturer's instructions, cDNA synthesis was performed from 1 μg of total RNA in a 20 μL reaction mixture using a PrimerScript RT kit (TAKARA, Dalian, China). The 20 μL reactions were performed using 10 μL of SYBR Premix Ex Taq II (TLi RanseH Plus) (TAKARA, Dalian, China), 0.8 μL of 10 mM forward and reverse primers each, 7.4 µL of ddH2O and 1 μL of cDNA template, after which amplification reactions were conducted. The cotton Sad1 gene was used as an internal reference gene. The qRT-PCR conditions were as follows: 95 °C for 30 s, 40 cycles of 95 °C for 5 s and 60 °C for 34 s. Three biological and technical replicates were performed for each sample to verify the results of the qRT-PCR test, and relative gene expression levels were quantified via the 2-ΔΔCt method.