1 Introduction

All cultures have their roots in agriculture. The agricultural sector employs approximately one billion individuals worldwide, which accounts for approximately 28% of the employed population (Anon.2018a). India, China and United States are the major cultivators globally, having the highest net cropped area (Anon. 2018b). Every aspect of crop cultivation must be considered to ensure a higher yield throughout the year. A global agricultural harvest is expected to be lost by 10–16% annually due to plant diseases, costing USD 220 billion. The global problem of food contamination induced by plant diseases is an ongoing issue that plant pathologists cannot afford to ignore (Yadav et al. 2021). Currently, fungi account for around 83% of plant-contagious diseases, phytoplasmas, and viruses for 9%, and bacteria for more than 7% (Pavlovskaya et al. 2018). Agricultural diseases are a particularly critical issue in crop production, affecting every field.

A plant disease refers to any modification that disrupts the inherent physiological processes of the plant. These pathogens can knowingly reduce the overall quantity and quality of harvest, affecting agricultural output. Several methods have been developed for diagnosing diseases to minimize serious harm. Molecular biology techniques, in particular, offer precise identification of pathogenic factors. Many farmers may not have access to these methods, though, and they are expensive and resource-intensive to obtain or execute (Sahu et al. 2021).

Therefore, it is preferable to detect diseases precisely and promptly to prevent such losses. The detection of plant diseases can be done manually or via computer-based technologies. The most evident symptoms of plant diseases to the human eye are spots on the leaves. On the other hand, some diseases do not manifest themselves on the leaves. In contrast, others manifest themselves in the latter stages and after the plants have already suffered significant damage.

Timely management of major crops necessitates careful oversight, including monitoring for diseases that could turn them into waste and addressing various issues promptly. Diseases can diminish plant productivity, and since each plant is susceptible to unique diseases, managing them with conventional disease control methods demands significant effort and skilled personnel. Farmers in most areas don’t have enough resources and knowledge to consult professionals. In many situations, relying solely on visual observation for diagnosis is insufficient. The accuracy of diagnosis decreases when it is based solely on a single morphological characteristic. Computer vision (CV) and other methods should also be applied simultaneously to advance the accuracy of the diagnosis (Khakimov et al. 2022). The professional consultations are quite a costly and time-consuming process, which adds an extra financial strain to the farmer (Ferentinos 2018).

In such cases, it is advised that computerized systems would be the only way to quickly identify the disease utilizing various sophisticated algorithms and analytical tools, particularly with the aid of potent microscopes and other equipment. The main goal of this article is to give a comprehensive scenario of current findings in the field of plant disease detection using computer vision and deep learning. The scientific databases PubMed, Scopus, and Web of Science have been used to examine the research in the relevant fields in prior years. The implementation of digitalization in agriculture using deep learning and AI has progressed beyond the early conceptual stage. This study also highlights the challenges connected with implementing deep learning and computer vision techniques, along with their technical specifics. This article will support the adoption of deep learning and computer vision-based systems on farms and help elucidate how these technologies can be integrated into agricultural operations.

In order to continue the conversation about certain topics that have made important contributions to the field of precision agriculture, this paper will focus on presenting and analyzing some recent and practical advancements in deep learning and computer vision for the purpose of detecting plant diseases. Understanding the future progress in this discipline requires analyzing the work that has already been done. By the time the conclusion of this review is reached, comprehension will be strengthened in the context of having a wide range of knowledge regarding the applications of deep learning in agriculture. The purpose of this article is to provide solutions to the following research questions to gain a better considerate of the developments in this field:

  • Why is plant disease detection significant in precision agriculture?

  • How far has deep learning and computer vision reached in this field, and what challenges remain?

  • Which techniques and models show substantial development in plant disease detection?

  • What are the main aspects that need to be considered for effective implementation in precision agriculture?

2 Review methodology and literature selection

2.1 Literature selection criteria and process

The criteria for inclusion and exclusion were developed to form the foundation of a rigorous literature review process, ensuring that the selected studies were relevant, current, and of high quality (Mahmud et al. 2023). In this comprehensive review of deep learning and computer vision in plant disease detection for precision agriculture, these criteria were carefully crafted to encapsulate the breadth and depth of existing research while maintaining a focus on practical and experimental implementations.

The inclusion criteria used to filter the literature for inclusion in the review are detailed below:

2.1.1 Inclusion and exclusion criteria

The inclusion and exclusion criteria were designed to form the foundation of a rigorous literature review process, ensuring that the selected studies were relevant, current, and of high quality. In this comprehensive review of deep learning and computer vision in plant disease detection for precision agriculture, these criteria were carefully crafted to encapsulate the breadth and depth of existing research while maintaining a focus on practical and experimental implementations.

The inclusion criteria used to filter the literature for inclusion in the review are detailed below:

  • Relevance to topic: The primary criterion for inclusion was the relevance of the study to the topic of plant disease detection using deep learning and computer vision. Studies had to directly address one or more aspects of this topic, such as the development of DL models for disease identification, the application of computer vision techniques to analyze plant images, or the integration of these technologies in precision agriculture systems. Relevance was determined by examining the study’s objectives, methodology, and findings to ensure alignment with the research focus.

  • Practical implementation and experimental evidence: In order to provide a full overview of the current state of the art, studies offering practical implementations and experimental evidence of their findings were included. This includes papers describing the development, testing, and deployment of deep learning models and computer vision systems in real-world agricultural settings. Experimental evidence was crucial as it demonstrated the feasibility, effectiveness, and potential impact of the proposed methods. Studies including case studies, field trials, or extensive simulations were prioritized.

  • Methodological rigor: The methodological rigor of a study was another essential inclusion criterion. The robustness of the data collection methods, research design, and analysis techniques used in each study was assessed. Studies employing well-established methodologies, large and diverse datasets, and thorough validation procedures were favored. This criterion ensured that the included studies provided reliable and generalizable results.

  • Contribution to knowledge: Studies making significant contributions to the field were included. This includes papers introducing novel techniques, models, or frameworks, as well as those providing comprehensive reviews or meta-analyses of existing research. Studies identifying and addressing key challenges, gaps, and future directions in plant disease detection using deep learning and computer vision were also included.

  • Historical and foundational work: While the review primarily focused on recent advancements, foundational studies published before the last five years were also included. These seminal papers provided the basic knowledge and theoretical underpinnings essential for understanding current developments. Foundational works from as far back as 1943 and 2005 were also considered, particularly if key concepts, methods, or technologies significantly influencing subsequent research were introduced.

  • Language and accessibility: To ensure the review was accessible to a wide audience, only studies published in English were included. This criterion helped maintain consistency and comprehensibility across the reviewed literature. Additionally, studies published in open-access journals or repositories were favored to facilitate easy access for researchers and practitioners.

The exclusion criteria used to filter the literature for inclusion in the review are detailed below:

2.1.2 Exclusion criteria

  • Irrelevance to topic: Studies not directly addressing plant disease detection using computer vision and deep learning were excluded. This included research focused on other aspects of agriculture or different applications of deep learning and computer vision not pertaining to plant disease detection. Irrelevance was determined by reviewing the study’s abstract, introduction, and conclusion sections.

  • Lack of practical implementation and experimental evidence: Studies purely theoretical or conceptual, lacking practical implementation or experimental evidence, were excluded. While theoretical insights were valuable, the focus of this review was on applied research demonstrating real-world applicability and effectiveness. Papers without empirical validation or practical case studies were filtered out.

  • Methodological weaknesses: Studies with significant methodological weaknesses were excluded. This included research with poorly designed experiments, small or biased datasets, inadequate validation techniques, or lack of reproducibility. Methodological weaknesses were identified through a critical appraisal of the study’s data collection, research design, and analysis methods.

  • Limited contribution to knowledge: Studies not making a substantial contribution to the field were excluded. This included papers reiterating well-known findings without offering new insights, failing to address significant challenges, or lacking originality in their approach. Additionally, studies not providing a comprehensive literature review or contextualizing their findings within the broader field were filtered out.

  • Non-English publications: To maintain consistency and accessibility, studies published in languages other than English were excluded. This criterion ensured that the reviewed literature could be accurately interpreted and analyzed by the broader scientific community.

2.2 Selection process

2.2.1 Keyword identification

The first step in the literature selection process was the identification of relevant keywords. These keywords were derived from the research questions and objectives of the review and included terms such as “plant disease detection”, “deep learning in agriculture”, “computer vision”, “precision agriculture” and related synonyms.

2.2.2 Search strings formation

Next, search strings were generated by combining the identified keywords and their synonyms. Examples of search strings used include “(plant disease detection AND deep learning)”, “(plant disease detection OR computer vision)”, and “(deep learning AND agriculture)”. These search strings were designed to capture a broad range of relevant studies.

2.2.3 Database search

Extensive searches were conducted in multiple academic databases to ensure comprehensive coverage of the literature. The databases included PubMed (https://pubmed.ncbi.nlm.nih.gov/), Semantic Scholar (https://www.semanticscholar.org/), Scopus (www.scopus.com), Google Scholar (https://scholar.google.com/), IEEE Xplore (https://ieeexplore.ieee.org/), ScienceDirect (https://www.sciencedirect.com/) and Web of Science (https://clarivate.com/products/web-of-science/). The selection of these databases was based on their relevance to the following fields of computer science, engineering, and agricultural research.

2.2.4 Initial screening

An initial screening of the search results was conducted based on the title and abstract of each study. Studies that appeared irrelevant or did not meet the inclusion criteria were excluded at this stage. The preliminary screening procedure assisted in reducing the total number of studies to a more manageable number for detailed evaluation.

2.2.5 Full-text review

The remaining studies underwent an examination of the complete text to evaluate their relevance and the level of methodological rigor, and contribution to the field. During this stage, the criteria for both inclusion and exclusion were applied in detail, evaluating each study’s objectives, methodology, findings, and overall quality.

2.2.6 Final selection

After the full-text review, 278 articles were chosen for final review and used in this work. This methodology ensures that a rigorous and systematic approach for selecting relevant literature was employed, thereby providing a comprehensive overview of recent developments and future directions in deep learning and computer vision for disease detection in precision agriculture.

3 Computer vision for image data acquisition

Computer vision (CV) aims to create a method that enables computers to “see” and comprehend the feature-based information in digital images and videos. It is a branch of science that enables computers to analyze, record, comprehend, and process visually traceable objects (Lawaniya 2020).  CV algorithms leverage computers to comprehend and recognize the patterns from input images or video, as shown in Fig. 1. The successful implementation of deep learning models, particularly convolutional neural networks (CNNs), has been demonstrated in various CV applications in recent years. These applications include traffic detection (Yang et al. 2019a, b), recognition of medical images (Sundararajan et al. 2019), text identification (Melnyk et al. 2019), facial recognition (Kumar and Singh 2020), and crop yield predictions (Damos 2015), etc.

Fig. 1
figure 1

Perception of human vision and computer vision for disease detection

Many deep learning-based methods have recently been used in agriculture to detect plant diseases and pests. Many domestic and international companies have built WeChat applets and object/disease identification apps. Additionally, it is made simpler and less expensive by automated disease diagnosis that requires the observation of symptoms on plant leaves. This section eases the use of machine vision to offer robot guidance and autonomous process control using images (Kumar and Singh 2020; Tsaftaris et al. 2016; Yang et al. 2019a, b). In machine vision (MV) an imaging sensor is used to take plant photographs and assess if they have pests and disease attacks (Lee et al. 2017). Plant diseases and pest detection technologies based on machine vision have partially replaced visual identification (Martin et al. 2022).

Machine vision has been particularly beneficial in diagnosing the severity of diseases. An element of the MV, i.e., deep learning (DL), can be employed to ascertain the severity of diseases affecting plants and animals. MV plays a significant role in farm operations since it aids in the application of robotics for plant disease diagnosis, pest control, estimating yield and precision agriculture using visual information. The MV systems use imaging sensors such as RGB, hyperspectral, and thermal cameras to take images of the croplands and employ picture processing techniques such as pattern recognition algorithms, specifically trained neuronal networks like CNNs. Such automation assists the farmer in understanding disease development in crops and the use of effective control measures enhancing the management of crops and their yields. Despite the potential use of just-in-time crop monitoring using MV, several constraints that include equipment price, variability of the environment and incapability of acquiring enough data persist in hindering its practical use. It also categorizes diseases and prevents infections from being discovered belatedly (Lee et al. 2017). The technique might be used to examine and assess how much damage they have already done (Martin et al. 2022).

Therefore, it is essential to find the technology that is reasonable in price and can monitor plants to find and diagnose diseases, insect pests, etc. as farmers can take the appropriate procedures and safeguards after recognizing the diseases. The rapid development of software and hardware has significantly advanced image processing in agriculture. Image processing techniques (IPT) are used to manipulate and analyze images from visible light cameras, infrared imaging devices, and other electromagnetic spectrum sensors. Much fascinating research has been done on hyperspectral techniques for agricultural pest and disease identification (Oerke et al. 2016; Yu et al. 2013, 2018; Martinez-Martinez et al. 2018; Azadbakht et al. 2019). However, hyperspectral equipment is expensive and difficult for average farmers and extension workers to operate (Jiang et al. 2019).

Image processing methods used in pest and disease recognition utilizing RGB images evolve daily. In most situations, the affected plant’s disease symptoms are visible (Atole and Park 2018; Zhang et al. 2018a, b, c). Image processing algorithms can be designed to operate accurately, quickly, and affordably to identify these diseases from common digital pictures (Ngugi et al. 2021).

  • Thresholding Technique: The simplest technique is thresholding with the purpose of changing gray-scale pictures to black and white pictures. Regions of interest such as diseased sections of a leaf can be differentiated from the rest of the pixels by the thresholding technique, where an intensity threshold is set and only pixels above this threshold are included.

  • Edge Detection: Edge detection algorithms are the ones that detect the edges of the infected areas in the images. Canny and Sobel are the two most popular edge detectors that are used to identify areas within an image that undergo sharp changes in pixel intensity for example leaf spots, lesions or discolouration.

  • K-means Clustering: K-means, a widely used clustering method is an unsupervised learning technique that tends to classify image pixels by the nearest color. To help with that, algorithms are very useful when classifying affected areas of the plant in RGB images or hyperspectral images and healthy portions of plant imagery.

  • Region-based Segmentation: This portion of the static approach is implemented to sub divide into parts of the image that appear to possess similar or same characteristics. For instance, when using the watershed algorithm one would be able to identify the individual disease spots in a leaf that are overlapping hence improving accuracy in disease diagnosis.

  • Convolutional Neural Networks (CNNs): In deep learning, CNNs are a type of neural network that has completely transformed IPT in agriculture. They do most of the learning of the image data given that it involves shape, texture, and color of the leaves among other things to diagnose the diseases accurately and quickly. The use of CNNs in plant disease identification has become the norm.

  • Support Vector Machine (SVM): SVMs, along with the HOG method of feature extraction, help in the discrimination of diseased features from healthy ones using the pixel patterns. It is a useful method for solving binary classifiers such as oriented towards classification of disease infested plants and their healthy counterparts.

The following benefits are specifically provided by IPT adoption:

  • IPTs can accurately and efficiently identify agricultural diseases by utilizing photos of foliage, stems, fruits, and flowers.

  • The size of the discoloured or deformed patch concerning the magnitude of the entire fruit, flower, or leaf can be used to evaluate the severity of the disease (Tiwari and Tarum 2017; Rothe and Kshirsagar 2014; Padol and Sawant 2016; Bierman et al. 2019; Dey et al. 2016).

  • Monitoring the disease development in plants is important for seeing features like the contagion stage and spotting signs that are usually invisible to humans (Padol and Sawant 2016; Barbedo et al. 2016; Wang et al. 2017).

  • IPTs will also assist researchers in evaluating the traits of novel crop cultivars under laboratory evaluation for disease resistance (Bierman et al. 2019).

  • People who live in rural places may easily and affordably access knowledge through IPTs (Prakash et al. 2017; Anand et al. 2016).

  • The correct diagnosis will result in the more cost-effective usage of pesticides. In addition to lowering production costs, improved admittance to human professionals who can be accessed from distant locations rather than having physical visits to each farm will also preserve the environment and improve accessibility to heavily regulated markets (Anand et al. 2016; Krithika and Grace Selvarani 2017).

4 Imaging techniques available for plant health status detection

Various imaging techniques, such as hyperspectral imaging, thermal imaging, and RGB imaging, have been utilized by several researchers to gather data for examining plant health status (Fig. 2). Fluorescence, thermal, hyperspectral, multispectral, visible, photo-acoustic, tomographic, thermographic, and MRT are useful imaging techniques (Singh et al. 2020). Additionally, 3D imaging techniques can also be utilized in combination with other techniques.

Fig. 2
figure 2

Numerous imaging techniques utilized to detect plant diseases

5 Imaging cameras and sensors for plant disease detection

Thermal and hyperspectral sensors are the most effective methods for detecting early-stage pathogen contagions in crops. However, later-stage infection severity may also be detected by RGB, multispectral, and hyperspectral sensors (Maes and Steppe 2019).

Among different imaging sensors, the digital RGB imaging sensors have been most commonly used for plant disease detection. RGB cameras are frequently less expensive and easily available. They can also be employed to take clear still images (MacPherson et al. 2022). Cameras use RGB (Red, Green, and Blue) sensors, which are often used to identify red, green, and blue values in pixels. These cameras provide images that depict the intensity of three different colours, allowing for the assessment of biomass in crops (Gruner et al. 2019; Roth and Streit 2017; Viljanen et al. 2018). When estimating biomass, RGB cameras are deployed in conjunction with multispectral and near-infrared cameras to improve precision (Roth and Streit 2017). Red filters are substituted for near-infrared filters in modified RGB cameras (Berra et al. 2017; Nijland et al. 2014). Commercial RGB cameras are inexpensive yet lackluster in terms of spectral resolution (Nijland et al. 2014). Not all wavelengths within 380 nm to 750 nm electromagnetic spectrum range of RGB cameras are appropriate for precise disease detection in crops (Bock et al. 2020). RGB colour information is transformed to other colour spaces like hue saturation value (HSV), LAB (lightness), YCbCr (blue difference, luma component, and red difference chroma components), and other colour spaces that are particularly helpful in diagnosing plant diseases. The lack of resolution in images from RGB cameras makes it impossible to distinguish between different severity levels of disease (Zhang et al. 2018a, b, c). RGB cameras, on the other hand, have the potential to acquire images with a high spatial resolution, which in turn provides better spatial features for plant disease monitoring and detection. This potential is one of the primary advantages of RGB cameras over multispectral systems. Using RGB cameras effectively ensures that the lighting and colour of the photos are consistent. Photos taken consistently will show less error in determining healthy and sick plants.

The fundamental hyperspectral image classification procedure for identifying plant diseases is shown in Fig. 3.

Fig. 3
figure 3

Fundamental procedure in classifying plant diseases using hyperspectral imaging

Agricultural analytics performed with multispectral cameras provide the best results, these can take photos with spatial resolution and measure near-infrared reflectance (Nhamo et al. 2020). Vegetative indices are produced using multispectral and NIR cameras using near-infrared or other light bands (Adao et al. 2017; Geipel et al. 2016; Iqbal et al. 2018). Various spectral bands, primarily red, red-edge, green, blue, and near-infrared, are used by multispectral cameras. Based on bandwidth, they can be divided into two groups: broadband and narrowband (Deng et al. 2018). Multispectral cameras are utilized for most aerial photos to monitor crop health concerns since they can calculate indices like NDVI, NIR and other metrics (Viljanen et al. 2018; Nhamo et al. 2020; Geipel et al. 2016; Zaman-Allah et al. 2015; Kalischuk et al. 2019). There are various other studies that have used multispectral cameras for disease detection (Di Gennaro et al. 2016; Zhang et al. 2018a, b, c; Albetis et al. 2018; Calderon et al. 2014; Dash et al. 2018; Khot et al. 2015; Nebiker et al. 2016).

The primary distinction between hyper and multispectral cameras is that each pixel in the image produced by a hyperspectral camera collects light from various tiny size bands. In contrast, multispectral camera images are produced by a continuous spectrum (Lowe et al. 2017; Adao et al. 2017). Multispectral cameras can catch biomolecule-reflected light, but light bandwidth and location allow us to distinguish responses. These cameras are particularly good at detecting light emitted by biomolecules, including chlorophyll (Cilia et al. 2014; Gevaert et al. 2015), mesophyll (Lowe et al. 2017), xanthophyll (Proctor and He 2015), and carotenoids (Cilia et al. 2014; Gevaert et al. 2015). The expensive nature of the cameras (Adao et al. 2017; Deery et al. 2014) and the enormous amount of unusable data produced when they are not calibrated properly are the main drawbacks of employing a hyperspectral camera (Lowe et al. 2017; Saari et al. 2017). Hyperspectral cameras are mostly employed to improve upon the limitations of multispectral cameras. Identifying and differentiating target objects requires hyperspectral cameras, which can record details with fewer spectral variances. Hyperspectral cameras have significantly advanced image processing compared to conventional cameras (Thomas et al. 2018). They can identify plant stress with potential etiological factors (pathogen/disease).

The temperatures of the objects are shown as a thermal picture by thermal cameras, which collect infrared light between 0.75 and 1000 μm (Costa et al. 2013). Thermal cameras are cheaper than other spectrum cameras, but the conversion of RGB cameras to thermal imaging is feasible (Mahajan and Bundel 2016). Thermal cameras were initially used to examine drought stress in crops (Deery et al. 2014; Mahajan and Bundel 2016; Gago et al. 2015; Granum et al. 2015. Thermal pictures have limited resolution compared to photos taken by other large cameras but contain the temperature of the nearby objects (Costa et al. 2013). According to Calderon et al. (2013) and Smigaj et al. (2015), thermal sensors are also used to monitor crops and detect agricultural diseases. Yang and coworkers created a technique that uses thermal photography for the early diagnosis of diseases in tea (Yang et al. 2019a, b). Thermal sensors perform better than multispectral and hyperspectral ones for observing drought stress (Ludovisi et al. 2017; Zhou et al. 2020). The studies that have reported disease detection using several sensors are listed below in Table 1.

Table 1 Disease identification/detection/recognition/classification using various optical sensors

6 Traditional image processing techniques for plant disease detection

Machine vision approaches for plant disease and pest detection often use traditional image processing algorithms or manual feature development using classifiers (Lee et al. 2017). Plant diseases and pests’ identification has several difficulties in a complex natural environment, including tiny differences between the lesion area and the backdrop, poor contrast, wide changes in the scale of the lesion area and different kinds, and significant noise in the lesion picture.

Colour, form, and texture are the three main characteristics of plant imagery. The shape is less useful for identifying plant diseases than colour and texture (Hlaing and Zaw 2018); combining texture and colour characteristics, Hlaing and Zaw classified tomato plant disease. They discovered the texture data, which included information on the shape, location and size, using the scale invariant feature transform (SIFT).

Traditional image-based disease detection and classification was done by digital image processing. Traditional image processing is divided into several steps, from image pre-processing to disease classification. Image pre-processing includes filtering, contrast limited adaptive histogram equalization (CLAHE) algorithm, etc., methods, while segmentation is done by thresholding, clustering, histogram, compression, region growth, etc. After that, the feature of the disease was extracted by using local binary pattern (LBP), speeded-up robust features (SURF), histogram of oriented gradients (HOG), gray-level co-occurrence matrix (GLCM), histogram features, etc. methods. In the end, disease classification was done by using a SVM, naive bayes, decision tree, K-neural network (KNN), random forest, neural networks, fuzzy classifiers, etc. (Dhingra et al. 2018).

Automatic methods for disease detection in soybean crop were developed in 2015 (Dandawate and Kokare 2015) and 2023 (Kumar et al. 2023). In order to create an HSV colour space, they transformed the RGB image. Segmentation was done using colour and cluster-based techniques. Inferring the type of plant from its leaf form was done using the SIFT approach. Using colour texture traits and discriminant analysis, Pydipati and colleagues (2006) discovered citrus diseases. Additionally, they used the colour co-occurrence method (CCM) to see if the statistical classification methods and hue saturation and intensity (HSI) colour attributes could be used to distinguish the diseased leaves. An accuracy of greater than 0.95 was attained using this method (Pydipati et al. 2006). The following techniques may be employed to discern the presence of these pathogens, which have the potential to infiltrate various plant parts such as stems, vegetables, fruits, and more:

  • Identifying and classifying the diseases.

  • Determining the affected region.

  • Retrieving the affected area’s feature set.

This technique uses plant diseases and pests to construct the image scheme and choose the appropriate light source and shooting angle. This method ensures that images have uniform lighting. Designed imaging systems can reduce the complexity of constructing a standard algorithm but lead to an increase in application costs. It is sometimes difficult to expect standard algorithms to completely remove scene changes from recognized results to work in a natural environment (Dell’ Aquila 2009).

Fairly consistent results were obtained across various feature extraction methods. In a nutshell, the standardization of the described methodologies has not yet been established and realized. The automatic detection of plant diseases has been a long-running research topic. Researchers reported highly satisfactory results using a relatively small number of images for training and testing. This study shows that discriminant analysis, especially linear discriminant analysis, and backpropagation neural networks outperform their competitors significantly. However, with recently introduced optimized deep neural networks, the overall recognition performance of the models has significantly improved. Deep convolutional neural networks can yield superior results when utilized more effectively, making them particularly beneficial for handling large data sets. Artificial neural networks (ANNs) are a popular deep learning method for image processing and categorization. ANNs are mathematical models that link one another similarly to how neurons and synapses do in the human brain (Ferentinos 2018). The neural networks are taught to operate on a comparable collection of data after being educated into a model using formerly recognized data. The operation of biological nerve systems significantly impacts ANNs, which are computational systems. ANNs primarily consist of several connected computational nodes, which merge dispersed to learn from the input and optimize the final output. The foundation of innumerable ANNs is that each neuron will continue to take in input and carry out an action. One perceptive scoring function will relate the input raw picture matrices to the class score in the entire network (the weight). The last layer will include class-related loss functions, and normal ANN approaches apply.

Only the portion of the leaf damaged by the disease was removed by Al-bayati and Ustundag (Al-bayati and Üstündağ 2020). They additionally employed feature fusion, which aided in feature reduction. It should be ensured that the resources are available because image-based detection calls for many of them. The multilayer ANN serves as the basis for the underlying model. However, a convolutional layer executes kernel operations across various parts of the supplied image. The resultant representation is not affected by operations like rotation or translation. It has been demonstrated that these features perform better than the conventional features previously utilized in detecting plant diseases. Previous studies on using hyperspectral pictures in plant disease diagnosis have shown that categorization algorithms frequently use a correlation-based selection procedure despite the hyperspectral classification of plant diseases using the complete spectrum. Table 2 lists the research methods in hyperspectral image classification techniques for locating, identifying, and mapping plant diseases.

Table 2 List significant contributions to plant disease diagnosis, identification, and mapping using hyperspectral image classification techniques

7 Deep learning-driven computer vision models

Deep learning, a subset of machine learning, excels at processing unstructured data. Deep learning outperforms standard machine learning. Computer models can gradually learn properties from input at different processing stages (Mathew et al. 2021). Deep Learning (DL) was first described in an article published by Hinton and Salakhutdinov (2006) in Science. Deep learning extracts data features using several hidden layers, each as a perceptron. Combining low-level features with abstract high-level features can significantly reduce the risk of getting stuck in local minima. Deep learning overcomes the limitation of traditional algorithms that rely on artificially engineered features, drawing increased interest from researchers. It is effectively used in recommendation systems, computer vision, pattern recognition, speech recognition, and natural language processing (NLP) (Liu et al. 2017).

In contrast to other image recognition techniques, deep learning-based image recognition technology does not require the extraction of specific features; instead, it finds the right features through iterative learning (Backpropagation), allowing it to acquire both global and contextual features from images while also being more robust and accurate at recognizing objects. For the analysis of multidimensional data, photographs, CNN, and DL in general have been developed (Martinelli et al. 2015). The underlying features of a picture can only be extracted using traditional manual image classification and identification methods, and it is challenging to extract the complex features information (Fergus 2012). Deep learning can eliminate this barrier. Unsupervised learning from the original image can reveal low-level, middle-level, and high-level semantic properties. Plant disease and pest image detection have great promise with deep learning. Recently developed deep neural network models include stack de-noising autoencoder (SDAE), deep belief network (DBN), deep Boltzmann machine (DBM), and deep convolutional neural network (CNN) (Bengio et al. 2013).

A computer model that uses the DL method to machine learning mimics the biological pathways of a human (McCulloch and Pitts 1943). In contrast to conventional neural networks, artificial neural networks used in deep learning have a variety of processing layers (Ferentinos 2018). It entails several phases: data gathering, picture categorization, and result interpretation. Artificial neural networks for image categorization come in a variety of forms: CNNs, generative adversarial networks (GANs), and recurrent neural networks (RNNs). Among these CNNs are the most often utilized for identifying and categorizing plant diseases.

Some researchers also used Hyperspectral Imaging (HSI) and DL models to more clearly observe plant disease signs in plant disease detection. A comprehensive assessment of the DL with the HSI approach was done (Signoroni et al. 2019). A thorough evaluation of many DL models, including 2D-CNN, LSTM/GRU, and a hybrid LSTM/GRU having 2D-CNN, was conducted to prevent overfitting and boost accuracy.

7.1 Convolutional neural networks (CNNs) based deep learning approach

Convolutional neural networks, akin to traditional ANNs, are composed of neurons that acquire the ability to optimize themselves. The fact that CNNs are primarily employed in feature representation inside pictures is the only distinguishing feature between CNNs and conventional ANNs. The network’s appropriateness for image-focused activities is enhanced, and the number of setup parameters is decreased (O’Shea and Nash 2015). The GoogLeNet (Brahimi et al. 2018), Inceptionv3 (Ahmad et al. 2020), VGG19 (Ahmad et al. 2020), EfficientNet, ResNet50 (Ahmad et al. 2020), DenseNet (Tian et al. 2019), Xception (Verma et al. 2019), and MobileNet (Bi et al. 2019) are pre-trained network models. They have been used for various computer vision applications, including image classification, image production, anomaly detection, neural style transfer, image captioning, and more.

7.2 Plant disease identification using deep learning architectures

The application of DL models for disease recognition in crops is expanding quickly (Ferentinos 2018; Carranza-Rojas et al. 2017; Yang and Guo 2017). Using aerial images, CNNs are fundamental deep-learning methods for identifying plant diseases. They comprise potent modelling approaches that recognize intricate patterns in vast volumes of data (Ferentinos 2018). Studies that don’t have enough data for neural networks to operate can nevertheless add data. CNNs replaced the earlier ANNs. ANNs were created for domains with recurring patterns, such as identifying unhealthy plant images. Numerous algorithms have been effectively employed to categorize plant diseases using CNNs, making crop health monitoring simpler.

According to previous studies, CNN’s percentage of accurate predictions was 1 to 4% greater than that of SVM (Chen et al. 2014; Grinblat et al. 2016; Lee et al. 2015), and 6% greater than that of random forests (Kussul et al. 2017). Nevertheless, Song and colleagues (2016) revealed that the CNN model’s correct prediction rates are 18% lower than ML models. Deep learning models for crop categorization aid in pest control, agricultural activity, yield prediction, and other tasks (Zhu et al. 2018). Deep learning models have made farmers’ work easier, allowing them to take a photo in the field, click it, and then send it to a program to determine the disease. CNN models do not require feature engineering, a time-consuming procedure, because the key features are found during dataset training. They consist of layers that automatically learn to identify features from images. Key equations and pseudocode of these are as follows:

  • Convolution operation:

$$\:\left(I*K\right)\left(i,j\right)=\sum\:_{m}\sum\:_{n}I\left(i+m,j+n\right)K(m,n)$$
(1)

Where I is the input image, K is the kernel, and (i, j) are the coordinates of the output.

  • ReLU activation:

$$\:ReLU\left(x\right)=\text{m}\text{a}\text{x}(0,x)$$
(2)
  • Pooling operation (Max pooling):

$$\:P\left(i,j\right)={max}_{m,n}I(i+m,j+n)$$
(3)

Where P is the pooled feature map.

  • Feature extraction:

$$\:F=PretrainedModel\left(I\right)$$
(4)

Where F is the feature map extracted from the input image I.

  • Fully connected layer:

$$\:y=W.F+b$$
(5)

Where W and b are the weights and biases of the fully connected layer, and y is the output.

Through the convolution operation Eq. (1), an image i is treated with a small matrix, also known as kernel K to produce a feature map. The coordinates in the output feature map are referred to by indices i and j. While applying a kernel over an image, within the kernel window formed, the pixels are multiplied by the value in the corresponding position of the kernel and the sum of these weighted pixels is computed. This enables the model to learn features, like edges and corners and textures that are quite small, but very important in learning patterns contained in an image. The convolutional procedure serves to pinpoint these low-level patterns. These patterns are combined in a progressive way to learned higher-level patterns in the layers of the network that are deeper. In this part of the model, the projection after activation of the input layer applies the relu activation function (Eq. 2). It will simply remove the effect of neurons with negative activations in the feature map rather than their positive counterparts. This will help in complex learning as most networks are uninterested in learning non-linear relations from the data the task primarily deals with linear hypotheses. This is especially important for ReLU as it helps to keep the architectural design of the model simple while increasing its feature extraction capabilities from the input data significantly.

The next step is a pooling operation which is most times used as the max pooling which can be described by the Eq. (3). ‘Max pooling’ aids in scaling down the feature map, by utilizing small parts of the input feature map and finding only the extreme maximum value in them. Means of down-sampling like this keep the essential characteristics and throw away the unimportant ones thus cutting down the processing costs and minimizing the chances of overfitting. In this way, max pooling enhances how well the network performs across data that is different from what it was trained on by retaining only the most important features. For feature extraction, it can be modified by Eq. (4) with accepted pretrained model. Most commonly, some models like ResNet or VGG are used as Pretrained models for transfer learning in which the filters trained on data are used for new tasks. They have learned how to build features within deep networks that have been exposed to copious amounts of data and have useful features for processing in the next stages of the network. The last part of this model; the fully connected layer illustrated in Eq. (5). It combines knowledge acquired by the model and serves for the final stage - prediction, i.e. discrimination of plant diseases in the image.

  • Pseudo code for Simple CNN model for plant disease detection:

Algorithm 1
figure a

.

Due to the specific characteristics of each disease location, Barbedo (2018a) examined individual lesions and patches rather than the entire leaf. This approach can detect multiple diseases on one leaf and expand data by dividing the leaf image into many sub-images. In 2015, Lee et al. (2015) projected a new perspective on leaf disease detection that focused on identifying diseased area methods. Experiments showed that training the deep learning model with the general disease was more generic, regardless of crop type or lack of observation.

Customized deep learning models have proven to be highly effective in plant disease detection due to their ability to be tailored to specific datasets and issues. These models can be fine-tuned to deliver optimal performance for different types of plant diseases, varying image qualities, and diverse environmental conditions, providing a more accurate and efficient solution compared to generic models. A great example of this is the work by Mahmud et al. (2024) in their paper “Light-Weight Deep Learning Model for Accelerating the Classification of Mango-Leaf Disease,” where they developed a streamlined version of the DenseNet architecture specifically for mango-leaf disease classification. The custom DenseNet model was designed to be lightweight, reducing computational complexity while maintaining high accuracy. This makes it ideal for use in resource-limited environments, such as mobile devices or edge computing systems often used in agriculture. The model was fine-tuned using a dataset of mango-leaf images with various diseases, allowing it to learn disease-specific features more effectively and thus improve its classification accuracy. Moreover, the lightweight nature of this custom DenseNet model ensures faster processing times, which is crucial for real-time field applications. This combination of efficiency and accuracy makes it a valuable tool for farmers and agricultural experts who need quick and reliable disease diagnostics. The custom model achieved outstanding results in classification tasks, showcasing that tailored architectures can significantly enhance performance for specific agricultural challenges. The study highlighted improved accuracy and reduced computational overhead compared to standard deep learning models. The success of the custom DenseNet model by Mahmud et al. highlights the advantages of creating specialized deep learning architectures for plant disease detection. Custom models can be optimized for the unique characteristics of different plants and disease types, leading to better diagnostic tools that are both efficient and effective. Using customized deep learning models in plant disease detection offers several benefits, including higher accuracy, faster processing times, and adaptability to various environments. As demonstrated by the custom DenseNet model for mango-leaf disease classification, these tailored approaches can significantly advance precision agriculture, contributing to healthier crops and more sustainable farming practices. Table 3 summarizes current studies that use the DL framework for disease detection and categorization directly. The flowchart of DL model implementation for disease detection is shown in Fig. 4.

Table 3 Disease identification/detection/recognition/classification using deep learning algorithms
Fig. 4
figure 4

Plant disease detection flow diagram with DL implementation (Agarwal et al. 2020)

Deep learning computer vision models, such as CNNs, offer several significant benefits. One of the primary advantages is their ability to automatically extract relevant features from raw image data, eliminating the need for extensive manual feature engineering (Chakraborty et al. 2022; Chandel et al. 2022). This capability allows these models to identify complex patterns and structures within images, making them highly effective for tasks like image classification, object detection, and semantic segmentation. These models have demonstrated impressive accuracy and performance in various real-world applications, particularly when ample labeled data is available for training. Another key strength of deep learning computer vision models is their scalability. Their performance improves with the availability of more labeled data and better hardware resources, such as powerful GPUs, making them suitable for large-scale image analysis projects. Moreover, these models are versatile and have been successfully applied across different domains, including medical imaging, autonomous driving, and agricultural monitoring, showcasing their wide-ranging applicability (Upadhyay et al. 2024).

However, there are notable limitations to these models. One significant challenge is their dependency on large amounts of labeled data for training. Without sufficient annotated datasets, the performance of these models can be limited. In comparison, advanced models often incorporate unsupervised or semi-supervised learning techniques to mitigate this dependency. Additionally, deep-learning computer vision models require substantial computational power for training and deployment, which can be a barrier for organizations with limited resources. Interpretability is another major concern. These models often act as “black boxes,” making it difficult to understand the reasoning behind their decisions and the features they have learned. While advanced models sometimes incorporate techniques to enhance interpretability, traditional deep learning models lag in this aspect. Overfitting is also a common issue, especially when these models are trained on limited datasets. Although regularization methods and data augmentation can help address this, advanced models often use more sophisticated techniques to prevent overfitting effectively. Lastly, deep learning computer vision models can be vulnerable to adversarial attacks, where minor perturbations in input data can lead to incorrect predictions. Advanced models may offer better defenses against such attacks, highlighting an area where traditional deep learning models need improvement.

In summary, while deep learning computer vision models provide automated feature extraction, high accuracy, scalability, and versatility, they also face challenges related to data dependency, computational complexity, interpretability, overfitting, and robustness to adversarial attacks. Advanced models often address these limitations to some extent, serving as a benchmark and pointing out areas for enhancement in traditional deep-learning approaches.

8 Advanced computer vision-based deep learning models

These methods start by using region proposal techniques to create several sparse candidate boxes, after which they use a detector employed with CNN to conduct regression with a bounding box for categorization. The second family of algorithms, known as single-stage detectors, are the single shot multi-box detector (SSD) (Liu et al. 2016) and you only look once (YOLO) series algorithms (Redmon et al. 2016; Redmon and Farhadi 2017, 2018). They simultaneously estimate bounding boxes and target class probabilities from entire pictures. These algorithms, which are based on CNN, have excelled in major competitions where objects were recognized in real-world images, including PASCAL VOC (Pattern Analysis), Statistical Modeling ImageNet (Deng et al. 2009), COCO (Common Objects in Context) (Lin et al. 2014) and Computational Learning Visual Object Classes (Everingham et al. 2010; Li et al. 2020a, b).

Fig. 5
figure 5

Faster R-CNN architecture for Agricultural Greenhouses detection (Li et al. 2020a, b)

Two modules comprise Faster R-CNN: a Fast R-CNN detector and a region proposal network (RPN) (Everingham et al. 2010; Li et al. 2020a, b; Girshick et al. 2014; Girshick 2015; Ren et al. 2017). The RPN is a proposal generation network that is entirely convolutional. Each feature map site can generate nine anchors with three scales and different aspect ratios. Depending on the availability of targets, these anchors are then classified as positive or negative. In order to avoid bias, the positive and negative markers are chosen randomly in a 1:1 ratio as a minibatch. These anchors create candidate areas by comparing them to ground truth boxes of the objects during training. As a result, a batch of characteristic information that describes whether or not area proposals contain the candidate items may be created once the convolutional feature maps of any size are introduced into the RPN. The architecture of the Faster R-CNN detector and the RPN share is shown in Fig. 5. Better object detections can be obtained using a base network and then deployed into those convolutional layers to extract features (Li et al. 2020a, b).

Fig. 6
figure 6

Flowchart of YOLOv4 object identification method for disease (Roy et al. 2022)

Over YOLO v1 and v2, YOLO v3 combines both an improvement and an inheritance. Within the YOLO v1 algorithm, all input images are rescaled to a specified size and split into an S*S grid. When a definite set B of bounding boxes and its related assurance score are explicitly predicted, a grid cell can only be connected with one item. A series of probabilities for each class item are generated simultaneously using the fully linked layer. The same dataset target, however, may have more than one box surrounding it. The detection boxes with the greatest confidence score are chosen using the non-maximum suppression (NMS) with an intersection over union (IoU) criterion to avoid redundant predictions. IoU calculates the amount of overlap between predicted and actual bounding boxes. NMS keeps the recognition box with the greatest confidence and abandons the rest. This result effectively reconstructs an object identification issue into an end-to-end regression job (Li et al. 2020a, b).

Recent versions of YOLO, including YOLOv4, YOLOv5, YOLOv7, and the latest YOLOv9, have brought substantial improvements in terms of accuracy, speed, and efficiency. YOLOv4 introduced advanced features like cross stage partial (CSP) connections, spatial pyramid pooling (SPP), and the Mish activation function, which together enhance detection performance and training stability, as shown in Fig. 6. YOLOv5, although not an official continuation by the original creators, has become popular for its ease of use and implementation improvements. YOLOv7 took this further by optimizing both speed and accuracy, making it ideal for embedded and mobile applications. YOLOv9 continues this trend with even more advanced optimizations and innovations, further boosting performance across various metrics. It features improved backbone networks, better handling of small objects, and enhanced real-time detection capabilities, making it one of the most efficient and accurate models available for object detection.

Faster R-anchor CNN’s box technique and YOLO v1’s regression concept are combined in the SSD method. In object identification models, the base network, which consists of the initial few layers, is a popular design, as shown in Fig. 7. Similar to Faster R-CNN, the same VGG-16 network was utilized. SSD employs a pyramid structure with feature maps of various dimensions following the base network. As the spatial resolution of feature maps decreases, 3-D detail information was continually lost, while abstract semantic characteristics continue to expand. Due to this, features of varying depths may identify both tiny and big objects simultaneously, which is crucial for resolving the issue of changing object sizes (Li et al. 2020a, b).

Fig. 7
figure 7

Architecture of single shot detector (SSD) (Sivakumar et al. 2020)

The advancements in models like YOLO demonstrate significant progress in addressing some of these challenges, particularly in object detection, by enhancing both speed and accuracy. Despite these advancements, there are still some challenges. One major issue is the need for large amounts of labeled data to train these models effectively. Without enough annotated datasets, their performance can suffer. In contrast, newer models often use unsupervised or semi-supervised learning techniques to address this dependency. Additionally, deep learning models for computer vision require significant computational power for training and deployment, which can be a hurdle for organizations with limited resources.

Table 4 Comparison of multi box SSD, you look only once-v3, and faster R-CNN (Li et al. 2020a, b)

8.1 Vision-transformers (ViTs)

The Transformer architecture has established itself as the industry norm for jobs involving natural language processing, but its applicability to computer vision is yet very few. The vision, and attention are utilized to replace some of the convolutional networks’ components while maintaining the overall structure of the network. Several researchers have demonstrated that there was no requirement for CNN dependency and that pure transformers can be used directly for collections of patches of pictures, which may get excellent results on image classification operation. Traditional CNNs process images by applying a series of convolutional layers to capture local features within the image. In contrast, ViTs divide an image into smaller patches and treat each patch like a word in a sentence. This method allows the model to examine the relationships between these patches, enabling a broader and more holistic analysis of the image rather than relying only on local pixel information. The transformer architecture stands out for its flexibility, making it suitable for a wide range of vision tasks beyond just classification, including object detection and segmentation. Unlike CNNs, which are often fine-tuned for specific tasks, ViTs can be more easily adapted to different applications. One of the key strengths of ViTs is their ability to capture long-range dependencies, allowing them to grasp the overall context of an image, not just the local features. This holistic understanding is essential for tasks that require a comprehensive interpretation of the visual scene (Maurício et al. 2022). The architecture of ViT is mainly based on the original transformer, as shown in Fig. 8. ViT achieves good results when it is pre-trained on large quantities of data and transferred to various mid-sized or small image recognition platforms (CIFAR-100, ImageNet, VTAB, etc.) while substantially requiring very few CPU resources to train (Dosovitskiy et al. 2020).

After being proposed by Vaswani et al. (2017) for machine translation, transformers are now the most advanced method for many NLP tasks. If self-attention were unconditionally applied to images, each pixel would have to accompany every other pixel. This approach isn’t scalable to actual input sizes because of the variations in the pixel density. Therefore, transformers have been used to process images in various approximate ways. Parmar et al. (2018) employed self-attention locally rather than globally for each query pixel. By using such local multi-head dot product self-attention blocks, convolutions may be completely replaced (Zhao et al. 2020; Ramachandran et al. 2019; Hu et al. 2019). The operations in the layers of ViTs were deduced from the following equations and pseudocode:

  • Patch embedding:

$$\:{x}_{p}=Flatten\left({x}_{i}\right)$$
(6)

Where, xi is the i-th image patch.

  • Linear projection:

$$\:{z}_{i}={x}_{p}{W}_{e}+{b}_{e}$$
(7)

Where We and be are the projection weights and biases.

  • Self-attention.

$$\:Attention\:\left(Q,K,V\right)=softmax\left(\frac{Q{K}^{T}}{\surd\:{d}_{k}}\right)V$$
(8)

Where Q, K, and V are the query, key, and value matrices.

  • Transformer encoder layer:

$$\:{z}_{l}^{{\prime\:}}=MSA\left(LN\left({z}_{l-1}\right)\right)+{z}_{l-1}$$
(9)
$$\:{z}_{l}=MSA\left(LN\left({z}_{l}^{{\prime\:}}\right)\right)+{z}_{l}^{{\prime\:}}$$
(10)

Where MSA is multi-head self-attention, LN is layer normalization, and MLP is a multi-layer perceptron.

  • Pseudocode of a Vision Transformer (ViT) for plant disease detection:

Algorithm 2
figure b

.

Fig. 8
figure 8

Vision Transformer architecture (Bazi et al. 2021)

8.2 Generative adversarial network (GAN)

Ian Goodfellow and colleagues (2014) introduced the idea of generative adversarial network (GAN), which have discriminator and generator networks, as shown in Fig. 9. The discriminator checks the generated material while the generator creates the content. The discriminator determines if a picture appears natural, whereas the generator makes images that appear natural. GAN is regarded as a two-player minimax method. Convolutional and feed-forward neural nets are used by GANs (Goodfellow et al. 2014).

Compared with explicit modelling, the implicit modelling approach of GAN may provide superb pictures and avoid complexities. Additionally, given that GAN may accommodate high-dimensional data dissemination and top-notch picture-generating performance, The generation model’s current top approach is GAN. High-quality, fine-grained RGB plant pictures were produced by Zhu and others (2020) by modifying the different class tags with the desired class tags to the network architecture of CDCGAN. The photographs dramatically enhanced the recognition’s classification performance after data augmentation; the F1 score increased by 0.23, supporting the hypothesis. The outcomes were comparable to those of the bigger training set without adding any info. A novel technique was presented, utilizing GANs to enhance the data for identifying leaf diseases with the pictures produced via generative adversarial networks with deep convolutions (DCGAN) (Wu et al. 2020). Using the actual pictures as Google Net’s input, this model may attain a 94.33% average accuracy in identification. However, there is still much space for development in the precision and quality of the disease-related images produced by the techniques mentioned above for classification (Zhang et al. 2022). Additionally, in several related efforts, updated or upgraded DL architectures were used to produce better outcomes and create software for disease identification systems.

Fig. 9
figure 9

Block Diagram of Generative Adversarial Networks (GAN) (Aggarwal et al. 2021)

8.3 Vision-language models

Vision-language models (VLMs) first came into light in 2019, bridging the gap between computer vision and natural language processing. By combining image analysis with text comprehension, these models overcome the limitations of older object recognition systems. That year saw major strides in transformer architectures and dual-stream frameworks, which helped VLMs enable exciting new applications like image captioning and visual question answering. This progress has sparked continued innovation in artificial intelligence, pushing the field forward (Li et al. 2019a, b).

In 2020, there was a burst of innovation in VLMs. Researchers made great strides in refining how these models are pre-trained to better understand the relationship between images and text. For example, VL-BERT worked on creating flexible visual-linguistic representations that could adapt to various tasks (Su et al. 2019). XGPT focused on improving image captioning by enhancing cross-modal generative pre-training. Pixel-BERT took a creative approach by aligning individual image pixels with the right text components (Huang et al. 2020). The Multimodal Framework (MMF) provided researchers with a robust set of tools tailored for vision and language studies (Xing et al. 2021). OSCAR emphasized the importance of connecting visual elements with their corresponding text descriptions (Li et al. 2020a, b). UNITER aimed to standardize the way images and text are formatted, reflecting a broader trend towards more adaptable VLM solutions (Chen et al. 2019a, b). These advancements collectively expanded what VLMs could do, paving the way for more advanced applications.

Recent advancements in vision-language models, such as CLIP (Contrastive Language-Image Pre-training) and ALIGN (A Larger-scale Image-Language Co-training), have significantly enhanced our ability to connect visual and textual information (Zhang et al. 2021). These models are designed to interpret and generate images from textual descriptions, enabling functions like zero-shot classification, image captioning, and visual question answering (Radford et al. 2021). By learning from extensive datasets that pair images with text, these models exhibit impressive versatility and adaptability, often excelling in a wide range of tasks without the need for specialized fine-tuning.

8.4 Foundation models

Large foundation models, like those from OpenAI and DeepMind, have become a major trend in the field. Trained on vast datasets and capable of handling various tasks, these models exemplify the move toward creating general-purpose AI systems. They utilize transfer learning to apply insights from one area to another, resulting in highly adaptable models that can tackle multiple tasks with minimal extra training. Their influence on computer vision has been significant, setting new performance standards and expanding the possibilities of AI (Li et al. 2023a, b, c).

A model like DALL-E could generate visual simulations of disease progression on crops, helping researchers understand how different diseases spread and evolve, leading to better preventive measures. GPT-3 is capable of creating detailed descriptions for images. For example, if you show it a picture of a busy city street, it can craft a vivid narrative that captures the essence of the scene, including the people, buildings, and overall atmosphere.

8.5 Advanced and hybrid computer-vision deep learning architectures

Major downsides of many existing deep neural network architectures include a huge number of parameters, a lengthy training period, expensive storage and processing costs, etc. New or modified DL structures are utilized to detect leaf disease, and a flow chart of their implementation is given in Fig. 4. Table 5 below summarizes current studies on enhancing DL in plant disease diagnosis.

Table 5 Imaging-based disease identification/detection/recognition/classification using advanced and hybrid computer-vision deep learning architectures

9 Plant disease public datasets

An image dataset consists of digitized pictures that have been carefully selected for use in training, testing, and assessing the performance of computer vision algorithms. The data sets used to analyze leaves are built from primary data gathered in the fields. Because the statistics are based on visible characteristics of the leaves, they have a high degree of trustworthiness. Additionally, the data sets are separated into easily understandable portions.

For instance, the research by Atila et al. (2021) divides the study into sections according to diseases such as sheath blight (SB), rice blast (RB), and bacterial leaf blight (BLB). In this study, another data set named as PlantVillage was used, having 54,306 photos of 14 distinct crops representing 26 plant diseases that make up the dataset. Different-coloured leaves were depicted in the photographs that were part of the data collection. Examples from the Plant-Village data set are shown in Fig. 10. The hues represent the areas of the leaves afflicted by the diseases being researched (Geetharamani and Pandian 2019).

Fig. 10
figure 10

Plant-Village data set: illustrations of diverse plant phenotypes (Hughes and Salathe 2015).

Additionally, ImageNet data collection was employed in the research, and the interaction of merging different approaches produced high-quality research results (Atila et al. 2021). The requirement to demonstrate image-based detection algorithms for coffee leaf diseases promotes the utilization of coffee leaf data sets in research (Esgario et al. 2020). The researchers employed large numerical data sets and information-rich colour data sets to present their gathered data (Proctor and He 2015).

In PlantVillage, the descriptions of the leaves both before and during the disease’s impact are included in the data sets. This data set displays healthy leaves with those that suffered dents from attacks by septoria leaf blight, frog eye leaf spot, and downy mildew. The data set is understandable and well-organized. The data set is unambiguous and shows how many leaves in total were examined and divided into four groups. The PlantVillage dataset, which Sharada P. Mohanty and others in 2016 compiled, comprises 87,000 RGB photos of healthy and unwell plant leaves divided into 38 groups. They have only chosen 25 classes to test their algorithm, and Table 6 displays these classes.

Table 6 Dataset specifications (Hughes and Salathe 2015)

In the subsequent paragraphs, some findings from this investigation were summarized: finding leaf photos for particular plant diseases is challenging. The size of the accessible plant data sets is consequently quite modest. Only a few papers have submitted thousands of pictures for investigation (Barbedo et al.2016; Sladojevic et al. 2016; Meunkaewjinda et al. 2008; Pires et al. 2016; Shrivastava et al. 2019; Schikora et al. 2012). The photographs in the database are taken under extremely restricted environmental settings. In order to make the algorithms more useful, photos must be collected under real-world circumstances. The current situation calls for effective picture acquisition of leaf images. The research community would appreciate such databases if these photos were recorded in real-time circumstances. Images captured using sophisticated mobile devices are becoming more common in most recently published works (Chandel et al. 2024). Although several single-click picture solutions are also presented, the researchers hope to significantly increase plant disease detection algorithms’ automation. A severe problem with database size may be resolved with the move of image-capture systems to smart devices.

A smartphone-assisted disease detection system was developed by Mohanty et al. (2016). The prime motivation for the study was a combination of rising global smartphone usage and recent breakthroughs in computer vision enabled by deep learning. A deep convolutional neural network was trained to identify 14 crop species and 26 diseases using images of a public dataset of 54,306 photos of healthy and damaged plant leaves taken under controlled settings. These images were taken from the ‘Plant Village’ dataset. The dataset of images was resized to 256 × 256 pixels. Different combinations of training and testing splits were tried. Two approaches were followed while developing deep learning models. The first approach was pre-trained transfer learning models used for classification model development. In the second approach, a CNN-based model for classification was developed from scratch. The effect of image type on classification accuracy was studied as input to network colour images, grayscale images, and background-removed images were used. The accuracy of the two models was expressed in terms of the F1 score. The average F1 score of GoogleLeNet and AlexNet was 98.86% and 98.48%, respectively. The classification accuracy of the model reached up to 99.35%.

CNN-based deep learning network was developed to classify tubers into five classes, i.e., healthy and four disease classes (Oppenheim and Shani 2017). The dataset contains different shapes, sizes, and tones of potatoes acquired under normal conditions. The dataset was labeled manually by a subject matter specialist. The total dataset was made into different train and test split combinations to see the effect on model accuracy. The highest classification accuracy was 96% at a training dataset size of 90% and a testing dataset of 10%.

GoogleLeNet and Inception v3 models were deployed on TensorFlow to detect two types of pests and three diseases in cassava crops (Ramcharan et al. 2017). The dataset of 11,670 images was used during the training, validation, and testing stages. The confusion matrix was used as a performance metric. The classification accuracy ranged from 80 to 93%. Waheed et al. (2020) employed an optimized DenseNet model for three corn leaf disease detection and classification. The total trainable parameters were 77,612. The optimized DensNet classification accuracy was compared with the VGGNet, XceptionNet, EfficientNet, and NASNet. The DL models were competent on the 12,332 images dataset. The image resolution of each image was 250 × 250 × 3. The data augmentation technique (cropping, padding, and horizontal flipping) increased relevant data. The accuracy of the DenseNet was found to be 98.06%. In the future, they intend to develop a mobile application for corn leaf disease detection and classification.

AlexNet, VGG16, and VGG19 CNN models were used for disease detection in rice fields (Sethy et al. 2020). Three models were trained to detect and classify four types of diseases in paddy, i.e., Blast, Brown spot, Tungro, and Bacterial blight. The support vector machine classifier was used for four types of rice disease detection. In this study, CNN models have used a feature extractor from disease images 5932. The extracted features were used in the second stage to input the support vector machine classifier. AlexNet (feature layer: fc8) called resnet50 with support vector machine algorithm classified better than other algorithms. The F1 score was found to be 98.38%. CNN-based deep learning model was trained with 10,000 labeled images for cassava crop disease detection (Sambasivam and Opiyo 2021). The images were processed using the adaptive histogram equalization technique. The accuracy of the model varied from 76.9 to 99.30%. CCN and ANN models were used for plant disease detection (Shin et al. 2020). Feature extraction with CCN and ANN classification generated better results than other supervised machine learning models under different positions or leaf angles under field conditions.

It has been argued that deep learning-based image analysis techniques surpass traditional methods for visually assessing disease severity. However, these imaging systems are not without flaws. The quality of the training data significantly impacts the system’s performance. In plant disease automation, the training images and a few extracted features greatly influence a system’s performance. A well-trained system uses high-quality training data. However, a definite set of conditions must be met for most prevailing systems to function correctly. The method may produce false findings if requirements are not fulfilled which results in incorrect disease detection. Some generalized techniques that function in diverse situations need to be modified. In-depth knowledge of the techniques and appropriate tool use are also required to increase productivity. Table 7 compares and summarizes the most recent findings for diagnosing plant diseases using various data sets and techniques.

Table 7 Disease detection and identification using image dataset

10 Performance metrics for classification and object detection models

Many measures have been introduced in research, each addressing specific aspects of an algorithm’s performance. Consequently, for every machine learning problem, researchers need a suitable set of measures for performance assessment.

In this study, several common metrics were collected to obtain crucial information on the effectiveness of algorithms in categorization tasks and performed a side-by-side comparison. These metrics include precision, recall (Powers 2011), F1-score (Sasaki 2007), accuracy, ROC-AUC score, IOU (Breton and Eng 2019), mAP, and the confusion matrix (Fawcett 2006; Brown and Davis 2006).

10.1 Confusion matrix

This matrix is the most useful and clearest criterion for defining a machine learning algorithm’s accuracy and correctness. It is mostly used for classification problems where the output might have two or more categories of various classes. It consists of true and false negatives (TN & FN) and true and false positives (TP & FP), as shown in Fig. 11. Precision and recall performance metrics were derived from the confusion matrix.

Fig. 11
figure 11

Confusion matrix with actual and predicted classes

10.2 Precision

It just indicates what number of selected data items are pertinent. In other words, the quantity of positive observations that a machine learning system predicted. \(\:Formula\:\left(11\right)\) specifies that accuracy is calculated by dividing the total number of true positives by the sum of erroneous and true positive results (Powers 2011):

$$\:Precision=\frac{TP}{TP+FP}$$
(11)

10.3 Recall

It displays the fraction of relevant data items chosen. In reality, it remarks the number of truly encouraging data that have the algorithm’s predictions. \(\:Formula\:\left(12\right)\) states that Recall is calculated by dividing the count of true positives by the sum of erroneous negatives and true positives (Powers 2011).

$$\:Recall=\frac{TP}{TP+FN}$$
(12)

10.4 F1-score

It measures algorithm performance, often known as f-measure or f-score, which also considers recall and precision. It is the harmonic mean of precision and recall, written as follows in mathematics (Sasaki 2007):

$$\:F1-score=2\times\:\frac{Precision\times\:Recall}{Precision+Recall}$$
(13)

10.5 Accuracy

Probably the most common and first way to evaluate an algorithm’s categorization performance. It measures the percentage of successfully predicted data points to all observations (\(\:Formula\:\left(14\right)\)). In spite of being extensively applicable in various fields, accuracy may not be the best performance measure when the dataset’s intended parameter classes are uneven (Breton and Eng 2019).

$$\:Accuracy=\frac{TN+TP}{FP+TP+FN+TN}$$
(14)

10.6 ROU-AUC score

The ROC (receiver operating characteristic) curve depicts the connection between the rate of true positives and the false positive, which is used to compute this statistic (1- specificity) (Breton and Eng 2019). A binary classification statistic called Area Under ROC Curve, or ROC-AUC, demonstrates how effectively a model can discriminate between negative and positive target classes. The ROC-AUC score might be supportive in indicating the performance if the relevance of negative and positive classes is equal for a specific problem.

10.7 IoU

The Deep Learning community uses the Intersection over Union (IoU) (also known as Area Ratio Overlap) to measure the effectiveness of object detection models. Considering the size of the bounding boxes, this measure goes one step further than the Centroid of Rectangles (CoR). It is possible to think of the IoU as a generalized instance of the CoR. The (x, y) coordinates of the predicted box of bounding from a detector won’t almost certainly match the (x, y) coordinates of the corresponding ground truth bound box in practice. As a result, this assessment measure rewards overlapping bounding boxes for the ground truth and anticipated boundaries. IoU ranges from 0 (no overlap) to 1 (perfect overlap) (Breton and Eng 2019).

$$\:IOU\:\left(GT,\:PB\right)=\:\frac{TP}{TP+FP+FN}$$
(15)

PB is the Predicted bounding box, GT is the Ground truth bounding box, TP & FP are True Positive and False Positive, and FN is False Negative.

10.8 mAP

An indicator of object detection accuracy across all classes in a given database is the mean Average Precision (mAP) (Padilla et al. 2020).

$$\:mAP=\frac{1}{N}\sum\:_{i=1}^{N}A{P}_{i}$$
(16)

Where N is the total number of courses being assessed and APi is the average precision in the IT class.

11 Challenges and the way forward

The decline in agricultural production and productivity adversely affects human being and animals. Addressing this issue will require the application of modern technology. This investigation demonstrates that a wide range of parameters affect image segmentation-based technology. The dents and alterations causing diseases of the plants may be identified using this method in contrast to diseases that result in damages that cannot be seen from the photos of the plants (Loey et al. 2020). This investigation also reveals a deficient database that might be utilized to offer context for contrasting the captured photos (Barbedo 2018b). The segmentation of large-scale images under complex and real-world natural scenes will continue to be a focus and a challenging area of research because various natural environment factors, such as wind velocity, illumination, temperature, humidity, and background, influence the acquisition and obtaining images of disease. Another difficulty is that different diseases might have various symptoms and features that are quite similar (Barbedo 2018a).

The absence of appropriate tools for use in picture detection is the other problem. Most field specialists lack the necessary tools to interpret the photos they collect, making it challenging to gather precise information and recognize diseases (Ashqar and Abu-Naser 2019). Stringent data validity measures have led to low adoption rates in certain areas for agricultural technologies. For instance, during the fourth and sixth international conferences on soft computing and machine learning, several rules hindered the use of machine learning in specific areas (Durmus et al. 2017). Because some of the outcomes from the ML functions do not comply with the necessary criteria, the rules forbid their usage in actual applications.

The above-mentioned difficulties demonstrate the wide range of applications for image-based detection but also make it less practical. The first option is to give sufficient information that can be utilized to reliably identify the diseases without confounding those closely related. Numerous diseases that have not been officially reported have been brought on by weather changes, global warming, and other effects. The answer is to expand the number of scientists involved and advocate for an improved method of information gathering (Sladojevic et al. 2016). Improved methods for recording information about the diseases are another potential option. Suppose data captioning is enhanced to include the fine features of the photos captured and the distinctions that define them. In that case, the problem of insufficient information about the disease can be resolved (Barbedo 2018a). The images should be carefully examined to determine whether one is affected or damaged.

The image-based detection method makes extracting and detecting diseases simpler due to high accuracy, minimal hassles, and less data duplication. For particular plants, such as tomatoes, a high accuracy rate is required to utilize the photos to identify the diseases that affect them and the amount of damage (Fuentes et al. 2017). Utilizing contemporary information storage techniques might provide a solution. For instance, using cloud computing might improve accessibility and storage accuracy. The alternative solution is to train those responsible factors for research and information analysis. The precision of the technique is increased by a trained DL algorithm (Rangarajan et al. 2018). The second option would be comprehending the phenotypes employed in disease detection (Ubbens and Stavness 2018). Typically, the phenotype used to identify diseases is a product of the weather and climate (Stewart et al. 2019; Rangarajan et al. 2018). Updating the systems would be the alternative way to guarantee that the data gathered is recent. The utilization of the technology is influenced by the considerable uncertainty surrounding disease detection. For instance, various uncertainties are connected to using Bayesian DL (Hernandez and López 2020). It implies that when used alone, this strategy is unreliable. In dealing with inaccurate and sluggish disease detection procedures, CNN techniques may also be useful (Singh et al. 2018). The techniques have various advantages and have been used to identify rice-affecting diseases (Li et al. 2019a, b).

The inefficiencies in procedures might be reduced by combining some approaches. For instance, the use of deep learning models with meta-architectures offers remedies for the problems encountered when utilizing different techniques for disease identification (Saleem et al. 2020a, b). The practice of deep convolutional generative adversarial networks would be the alternative method for identifying and analyzing the pictures (Li et al. 2019a, b). The adversarial networks’ participation improves the detection method’s accuracy. ViT can also be applied for disease identification as they have the potential to perform better for large datasets. Since many resources are required for image-based detection, the authorities should ensure they are available.

12 Discussion

Deep learning techniques have significantly improved plant disease identification by extracting intricate features from images and learning hierarchical representations. This review paper explores the employment of computer vision and deep learning and their collaborative application in plant disease identification. It highlights the need for efficient methods to monitor and diagnose plant health. It deliberates the role of various imaging technologies, including RGB and hyperspectral imaging, in capturing detailed plant visual information. The paper also evaluates imaging cameras and sensors for plant disease detection, highlighting their advantages and limitations. It discusses the historical perspective of traditional image-processing techniques and their transformative impact on deep learning. The review also explores advanced computer vision models, such as RNN and CNN, and their impact on accuracy and robustness in plant disease identification.

Academic research in object detection has yielded models that significantly benefit agricultural applications. Knowing their licensing is crucial for practical implementation. Paper; highlight some notable models, their uses in agriculture, and their licenses.

AlexNet and VGG-16 are widely used for plant disease identification. AlexNet showcased CNNs’ potential, while VGG-16 improved accuracy with deeper architectures. Both models have permissive licenses—Apache 2.0 for AlexNet and BSD for VGG-16 making them free to use in production. ViTs divide images into patches and analyze relationships using self-attention mechanisms, capturing long-range dependencies which was developed by Google. ViTs are available under the Apache 2.0 license, allowing free use in production environments. Models like StyleGAN and BigGAN generate synthetic images of diseased plants, enhancing training datasets. StyleGAN is available under the NVIDIA Source Code License for non-commercial use, while BigGAN is under the Apache 2.0 license, allowing commercial use. SimCLR and BYOL use large amounts of unlabeled data to learn useful representations, which can be fine-tuned for plant disease detection. Both models are available under the Apache 2.0 license, suitable for commercial production use. CLIP and ALIGN integrate visual and textual information for zero-shot classification and image captioning. These models classify new diseases based on textual descriptions. CLIP is under the MIT license, while ALIGN’s licensing can vary and should be checked for commercial use. EfficientNet and MobileNet balance accuracy and computational efficiency, ideal for mobile and edge computing in agriculture. Both models are available under the Apache 2.0 license, allowing for free use in commercial environments.

Understanding the licensing of these models helps practitioners make informed decisions about their implementation. Models like AlexNet, VGG-16, ViT, BigGAN, SimCLR, BYOL, CLIP, EfficientNet, and MobileNet are accessible for commercial use due to their permissive licenses. Utilizing these state-of-the-art models enhances plant disease detection systems, contributing to sustainable and productive farming practices.

However, challenges persist, such as the requirement for diverse and large datasets for model training, which limits the generalizability of models. Future research should focus on creating standardized datasets and fostering collaboration among researchers to address this issue. The interpretability of deep learning models is also crucial, as their inherent complexity poses challenges in understanding their decision-making processes. Addressing this interpretability gap is essential for gaining the trust of end-users, especially in agricultural settings where decisions based on disease identification models directly affect crop yield and food security. Researchers and practitioners must explore model compression techniques, lightweight architectures, and edge computing solutions to make these technologies more accessible and feasible for real-world deployment.

12.1 Outcomes

The thorough analysis of “Deep Learning–Computer Vision for Plant Disease Detection” summarizes the current knowledge of DL and CV, which offers insightful information to agricultural and plant science academicians, practitioners, and policymakers. The review’s outcomes include:

12.1.1 Knowledge synthesis

The study provides a collective resource for researchers initiating or advancing in the field by integrating information on employing deep learning-driven computer vision in plant disease detection.

12.1.2 Guidance for practitioners

A comprehensive evaluation of imaging methodologies, sensor configurations, camera selection, and model construction provides practitioners with the knowledge needed to facilitate well-informed assessments, allowing them to select the most appropriate strategies for their particular use cases.

12.1.3 Dataset evaluation

Researchers acquire a more profound comprehension of the accessible plant disease datasets, enabling them to make knowledgeable judgments regarding dataset selection and emphasizing the significance of filling in the current data diversity and quality gaps.

12.1.4 Performance assessment

In plant disease identification, discussing performance metrics facilitates a standardized method for evaluating the accuracy, precision, and recall of classification and object recognition models. This study helps researchers to examine their models more successfully.

12.1.5 Identification of challenges

Future study initiatives are guided by the identification of constraints, such as a lack of annotated datasets, complex architectures of DL models, and problems with interpretability. In order to solve these issues, the paper promotes interdisciplinary cooperation and the investigation of novel solutions.

12.1.6 Roadmap for future research

The review study suggests a research roadmap, highlighting the need for improvements in explainable AI, more interaction between plant science, advanced deep learning models, and computer vision communities, and an emphasis on practical deployment factors.

13 Conclusion

Plant disease and pest detection approaches based on deep learning combined edge detection and feature extraction have wide advancing projections and high potential, in contrast to standard image processing techniques, which handle these jobs in various phases and linkages. Even though the technology for detecting pests and diseases in plants is advancing quickly and has been affecting agricultural research and its applications, some way is still there to go before it is fully developed for use in the actual natural environment, and there are still some issues that need to be fixed. This review made it possible to map the many deep-learning research studies on disease diagnosis with multiple data modalities. The following conclusion can be drawn from the study:

  • Spectral imaging may be a crucial tool for determining the health of a crop because this is related to the extent of disease severity, the extent of spectral sensitivity to stress, and variations at different crop growth stages. Hyperspectral and multispectral pictures were highly helpful for disease identification and offered greater accuracy.

  • The technical features (brightness, resolution, etc.), sample grounding settings (field or laboratory), and sample features can all have an impact on the spectrum reflectance (size, texture, humidity, etc.). It is necessary to do more research on reflectance based on crop vegetation indicators during all crop development and infection stages.

  • Intelligent image segmentation and enormous data processing will be of utmost importance for identifying and treating agricultural diseases due to the quick growth of big data, IoT, and artificial intelligence technologies.

  • Indeed, in the agricultural industry, neural networks and deep learning models showed an extensive potential to monitor crop health and its development and capture abnormalities, exceeding conventional machine learning methods. Therefore, combining many crucial components can result in an effective disease detection system.

  • The choice of a learning framework and algorithm is required for multimodal deep learning applications. Multimodal fusion has recently demonstrated significant promise and is being employed more often in various fields, including object identification, sentiment analysis, human-robot interaction, and healthcare.

Agricultural practices can be transformed through enhanced precision, speed and scalability by integrating DL models into the identification of plant diseases. Therefore, eliminations of plant diseases would be easier since they can accurately diagnose in real-time translating into increased agricultural productivity and food security. Nonetheless, more research into diverse high quality datasets improvement on the efficiency of models as well as practical deployment strategies across different agricultural contexts needs to be done if these potential benefits are to be known in entirety. In order to overcome current challenges and realize the full potential of DL models, researchers, practitioners and technology developers must work together so that we can have an agricultural future where advanced technology works hand in hand with the agricultural sector to ensure global food security issues are adequately addressed.