1 Introduction

Agriculture, as a major source of food production, plays a crucial role in meeting the nutritional needs of the growing human population. In the face of limited agricultural land and an increasing population, enhancing the efficiency of agricultural production becomes imperative to meet the rising food demand. An essential requirement for effective agricultural management is up-to-date information on crop types and their spatial distribution. Knowledge about the specific crop types serves as a fundamental input for analysis in crop management, including crop growth monitoring (Mascolo et al. 2015), estimation of crop area (Ali et al. 2022; Hudait and Patel 2022), and assessment of water requirements (Foster et al. 2019).

The advent of satellite sensors with high spatial resolution has significantly improved the ability to rapidly create accurate crop maps. Consequently, extensive research has been dedicated to automating crop classification using various data sources, such as optical (Niazmardi et al. 2018; Vuolo et al. 2018; Hao et al. 2020; Sakamoto 2021; Xia et al. 2022; Teimouri and Mokhtarzade 2023) and radar images (Bargiel 2017; Hariharan et al. 2018). It was found that identifying and differentiating crops from images is challenging due to factors like diverse environmental conditions, spectral heterogeneity within a particular class as well as similarity among different classes, and small-scale management practices, such as varying planting and harvesting times, leading to complex and spatially diverse signatures across multiple seasons.

On the algorithmic side, deep learning (DL) approaches and in particular convolutional neural networks (CNNs) are currently considered the best methods in image classification (e.g., Heipke and Rottensteiner 2020). Also, methods based on attention mechanisms (Vaswani et al. 2017; Dosovitskiy et al. 2021; Voelsen et al. 2023) have recently made a major impact in the field. However, DL methods require a vast amount of training data in the learning phase to yield good results, and these training data are not always available.

In this paper, we address this problem and suggest a method, which automatically generates labels for unlabeled samples, so called VTL, from a given amount of real training labels (RTL). We show that adding the VTL to the RTL improves crop classification using Sentinel 1 and 2 (S1 and S2) time series, i.e. fusing optical and radar imagery. The architecture proposed by Teimouri et al. (2022) is applied in this study to assess the impact of the VTL on the training of 3D-CNNs.

The remainder of this article is structured as follows. Section 2 gives an overview of the state-of-the-art in crop classification, Sect. 3 discusses the approach of VTL and the structure of the 3D-CNN for crop classification using a fusion of optical and radar time series. Section 4 presents the study area, input data, experiments, and the analysis of the results. Finally, Sect. 5 provides the conclusions of the study.

2 State-of-the-Art in Deep Learning for Crop Classification

2.1 CNNs of Various Dimensions

CNNs are capable of learning complex functions, making them a powerful tool for developing accurate classification. Depending on the dimension of the convolution operator (one-, two- or three-dimensional (1D, 2D, 3D), CNNs can extract various types of features, including spatial, spectral, temporal, spatial-spectral, and spatial–temporal features. For example, 1D convolutions have been used to extract spectral features in hyperspectral images (Li et al. 2016) and temporal features in image sequences (Pelletier et al. 2019), while the standard 2D convolutions are commonly used for extracting spatial features in single images. 3D convolution operators are typically applied to extract spatial–temporal or spectral-spatial features, as demonstrated in studies by Li et al. (2017), Ji et al. (2018), Han et al. (2020), Sellami et al. (2020), and Fernandez-Beltran et al. (2021). Additionally, some studies have used a combination of 1D and 2D convolutions (Kussul et al. 2017; Zhang et al. 2017) or of 2D and 3D convolutions (Ge et al. 2020; Voelsen et al. 2022).

2.2 Crop Classification Using Neural Networks

Research on crop classification using networks of different dimensions is briefly reported in this section. Most approaches rely on time series and, besides CNN operations, employ architectures developed for temporal data such as recurrent neural networks (RNN), long-short-term memory networks (LSTM), and transformers based on attention mechanisms.

1D-CNN: Rußwurm and Körner (2020) investigated the effectiveness of 1D-CNN, LSTM, and self-attention neural networks for crop classification from S2 time series. Their research demonstrated that both, the transformer and LSTM models outperformed the 1D-CNN. The study by Zhao et al. (2021) aimed at evaluating the performance of five different neural network models, namely 1D-CNN, LSTM, gated recurrent unit (GRU), LSTM-CNN, and GRU-CNN, in classifying crops using S2 time series images. The results showed that GRU-CNN and LSTM-CNN, as well as the 1D-CNN, performed significantly better than the other investigated models.

2D-CNN: Moreno-Revelo et al. (2021) proposed a 2D-CNN for classifying ten agricultural crops in a tropical region from S1 and Landsat 8 images. A major limitation of this method is that the proposed architectures are shallow, which limits their ability to extract more complex features. Mazzia et al. (2019) classified fifteen different crops using a combination of RNNs and CNNs applied to S2 time series images. The proposed network architecture involved feeding the time series images into a LSTM module, followed by concatenating the extracted features and passing them through a 2D-CNN. The reported results were better than those for support vector machines and random forests. Seydi et al. (2022) applied a dual-stream network to classify agricultural and non-agricultural crops. The network consisted of convolutional blocks and attention models.

Garnot et al. (2020) proposed a method for crop classification from the S2 time series. The method involved extracting temporal features using an architecture that relied on self-attention mechanisms, while spatial features were obtained using a pixel-set encoder. Garnot and Landrieu (2021) introduced the Unet-Temporal Attention Encoder (U-TAE) model, which combines multi-scale spatial convolutions and temporal attention, enabling the extraction of spatial–temporal features at various resolutions. Ofori-Ampofo et al. (2021) integrated S1 and S2 time series for crop classification using an attention-based encoder. Garnot et al. (2022) explored various strategies for the fusion of optical and radar time series images, i.e. parcel-based classification, semantic, and panoptic segmentation, for crop classification. Finally, they proposed a mid-fusion scheme that utilizes separate spatial encoders and a shared temporal encoder. Finally, Tarasiou et al. (2023) introduced the spatial–temporal vision transformer, a model based on visual transformers (Dosovitskiy et al. 2021). Additionally, they proposed tokenization schemes to adapt the approach for modeling satellite image time series.

3D-CNN: In the study conducted by Ji et al. (2018) agricultural crops were classified using optical time series images and a 3D-CNN. The network was designed by separately considering time series patches of spectral bands, and then combining the obtained features. Similarly, Teimouri et al. (2022) proposed a 3D-CNN architecture for crop classification using a fusion of S1 and S2 time series images. This architecture was able to extract temporal-spatial-radar-spectral features, and the results showed its high potential of this network for crop classification; it forms the basis of the work reported in this paper.

2.3 Data Augmentation for CNN Training

Studies have shown that the performance of CNNs improves with an increase in the amount of training data (Chen et al. 2016; Li et al. 2016), while traditional methods do not show significant improvements with the same increase in data (Sarker 2021). Thus, having a large number of training samples is essential for improving the accuracy of deep networks. However, labeling high-quality samples manually is expensive and time-consuming. To solve this problem, the related research can be divided into two main categories (Hao et al. 2023): data-driven methods and network-based methods.

Data-driven methods involve the generation of new samples by employing various techniques applied to real training data. These techniques include: (1) Geometric transformations such as rotation, scaling, flipping, and cropping (e.g., Zhang et al. 2017; Acción et al. 2020); (2) Noise disturbance (Ding et al. 2016); (3) Sharpness transformation (Ledig et al. 2017), albeit with limited success. (4) Generative adversarial networks (GANs) (Goodfellow et al. 2014); since then, various extensions have been suggested. However, these networks still require a large amount of training samples and have a high computational cost. (5) Virtual labels (Chen et al. 2016; Li et al. 2016). Labeling these samples is done before network training, thus reducing the high computing time required by GAN methods.

On the other hand, network-based data augmentation methods focus on modifying the architecture or learning process of CNNs. These methods include: (1) Transfer learning (Wurm et al. 2019; Cui et al. 2020), taking advantage of models pre-trained on large datasets and fine-tuning them for specific tasks. (2) Regularization (Yun et al. 2019), using techniques like dropout, weight decay, and batch normalization. (3) Meta-learning (Li et al. 2021) to train the model on multiple tasks to enhance its ability to better adapt to new tasks.

Despite the potential benefits of using virtual labels, limited research has focused on their application in remote sensing. Chen et al. (2016) proposed two novel approaches for generating VTL to improve the classification of hyperspectral images using a 3D-CNN. The first method involved multiplying a training sample with a real label by a random factor and adding random noise to create a VTL. The second method considered a combination of two RTL of the same class and added random Gaussian noise. Li et al. (2016) proposed a pixel-pair-based method for generating VTL from hyperspectral images. They utilized VTL along with real training labels for the classification of images. These techniques have the potential to increase the accuracy of classification models without requiring the costly and time-consuming process of manual labeling.

2.4 Summary of State-of-the-Art

In summary, promising results have been obtained in crop classification using image time series and CNNs for the spatial domain as well as CNNs or transformers for the temporal domain. In addition, fusing various data sources, such as optical and radar time series images led to improved classification results. Notably, the decision-level fusion yielded significantly better performance compared to the feature-level fusion approach. The lack of appropriate training data limits the success of these methods, however. To at least partly overcome this problem the generation of VTL (thus, a data-driven method) is suggested in this paper. We fuse optical and radar images and use a 3D-CNN architecture without transformers in our work.

3 Methodology

3.1 Overview

This research proposes a method to overcome the challenge of collecting sufficient training samples through field methods or other manual processes, which can be both, time-consuming and expensive. The aim is to generate VTL that can be used in the training of deep neural networks, enabling the network to accurately classify crops, ultimately improving its overall performance.

VTL are generated by first sub-dividing RTL of each class separately into different sub-classes in an unsupervised manner using SOM (Kohonen 1995). Unlabeled pixels are then associated with the sub-class they are most similar to using a set of similarity criteria, yielding the VTL.

Subsequently, the networks are trained using VTL and RTL together. Note that at this stage only the original classes, and not the sub-classes are considered, as otherwise the number of training samples would be too low. 3D convolution operators are used to extract feature vectors, which are then fed as input to the actual classifier. The employed architecture is the one proposed in Teimouri et al. (2022).

The study compares the results obtained by training the network using a combination of RTL and VTL with those achieved using RTL and with VTL only. The four evaluation metrics are OA, KC, the F1-score per class, and the user accuracy (UA).

3.2 VTL Generation Using SOM

An agricultural crop is affected by many factors, such as sunlight, soil properties, irrigation, and other environmental factors. These effects can lead to differences in growth pattern. Xu et al. (2018) demonstrated how the reflection of a crop in different areas of a study region, as captured in an image, may vary. It can thus be beneficial to divide the training data of the individual classes into several sub-classes based on the highest degree of similarity in the growth cycle.

The method suggested in this paper is based on this observation. The training samples for each class are sub-divided into different sub-classes using an unsupervised classification of the time series images. This clustering step is guided by two constraints: the cluster centers should be as far away from each other as possible, and the clusters should be as compact as possible.

Once these clusters are found, unlabeled pixels are tested to belong to one of those sub-classes based on some metric, and are labeled according to the sub-class (and thus the class) with the minimum distance. These newly labeled pixels are the desired VTL, see Fig. 1 for an overview.

Fig. 1
figure 1

The flowchart of VTL generation, for details see text

The procedure consists of three main steps: (1) Sub-division of the training samples of each class, (2) Similarity computation, (3) Majority voting and labeling.

Sub-Division of the Training Samples of Each Class: For this clustering problem we use the traditional SOM, as they are a highly notable unsupervised neural network classifier. To reduce the computational load, we only use the first principal component of each optical image of each epoch, as well as the VV and VH polarizations of radar time series images as input for clustering, as shown in Fig. 1. The SOM output layer consists of m neurons, meaning that we sub-divide a given class into m different sub-classes (or clusters). Different values for m (i.e., 2, 3, and 4) have been explored for each class separately, and the best number of clusters was selected, again for each class separately.

To determine this best number m of sub-classes for each class, a scoring criterion is defined, which considers two constraints: the first constraint, referred to as BC, is based on the different sub-class centers. Equation (1) is used to calculate the distance between the centers of sub-classes and the class center.

$$BC=\sum_{i=1}^{m}||{\mu }^{i}-\mu {||}^{2}$$
(1)

In this context, μ and \({\mu }^{i}\) represent the center of all samples belonging to a class, computed by averaging the positions of all samples within that class, and the center of sub-class i, respectively.

The second constraint, termed WC, is related to the cluster compactness of each sub-class, see Eq. (2):

$$WC=\sum_{i=1}^{m}\sum_{j=1}^{L}||{x}_{j}^{\left(i\right)}-{\mu }^{i}{||}^{2}$$
(2)

\({\mu }^{i}\) denotes the center of the ith sub-class, L represents the number of samples belonging to sub-class i, and \({x}_{j}^{\left(i\right)}\) refers to jth sample in sub-class i.

Using the scoring criterion given in Eq. (3), based on BC and WC defined above, the best number of sub-classes m is selected for each crop. The number of sub-classes is considered best, if the distance between the cluster centers, and thus BC, is largest and the compactness is highest, resulting in a small value for WC. Thus, we minimize the Score for each training class as a function of m:

$$Score (m)=MIN(\frac{WC (m)}{BC (m)})$$
(3)

Similarity Computation: Next, n unlabeled pixels are randomly chosen in the study area, and virtual labels are assigned to these pixels. To do so, the distance to the center of all generated sub-classes is computed according to the following four similarity criteria (Thenkabail et al. 2007): spectral angle mapper, spectral correlation similarity, Euclidean distance, and spectral similarity value. A threshold is then determined for each similarity criterion for each sub-class. These thresholds (Eq. (4)) are calculated as the average value of each similarity criterion among RTL.

$${T}^{(i)}= \frac{1}{L}\sum_{j=1}^{L}{V}_{j}^{(i)}$$
(4)

\({V}_{j}^{(i)}\) denotes the value derived from the similarity criterion, which measures the similarity between sample j and the center of sub-class i. L represents the number of samples belonging to that subclass, and T denotes the threshold value for the similarity criterion of that subclass.

Only pixels with a value below the established thresholds for all four similarity criteria are chosen for further processing, and for each criterion the class with the smallest score is retained.

Majority Voting and Labeling: These computations are carried out separately for the principal components of the optical images, and the VV component and the VH component of the radar images, resulting in one, two or three sets (for the optical, the VV, and the VH bands, respectively) of classes with four entries each (one for each similarity criterion), i.e., up to 12 possible labels for each pixel. The final label is chosen according to the majority voting method. In case of ambiguity, i.e., two or more classes have the same number of votes, and no class has a higher number, the pixel is rejected.

3.3 3D-CNN Architecture for Combined Optical and Radar Time Series Image Classification

In this study, 3D-CNNs were employed to extract spatial–temporal, spectral, and intensity features from the optical and radar data; the architecture used in our previous research (Teimouri et al. 2022) was employed. In this previous study, 3D-CNNs were trained using RTL only for crop classification. Here, we extend our approach by incorporating both, RTL and VTL together to train the network and investigate the impact of VTL on crop classification.

As shown in Fig. 2, to fuse the optical and radar time series images, a 3D-CNN with two input branches is used. Each branch takes the optical and radar time series images, linearly normalized to [0, 1], as input, respectively. Each consists of twelve 3D convolutional operators. Finally, the features extracted from each data source are concatenated and fed into the fully connected layer.

Fig. 2
figure 2

The network structure for fusion of S1 and S2 (adopted from Teimouri et al. 2022)

More specifically, the input channels for each dataset (optical, radar, and fused) are processed separately, where the 3D convolution operators are applied to a sequence of three images with stride one in the temporal direction for each channel. Next, the features generated from each time series of a specific channel are concatenated. The network architectures for radar and optical data consist of three convolutional blocks with 32, 32, and 64 kernels, respectively. Each block is followed by a ReLu activation function and maximum pooling with dimensions of 2 × 2 × 1. The final layers of this architecture consist of two fully connected layers with 128 and 64 neurons, respectively.

This architecture is used for pixel-wise classification, the central pixel of each 7 × 7 patch is classified. Patches for training are randomly extracted from the scene, taking care to avoid any overlap between patches in order to decrease possible correlations. While for the generation of the semantic segmentation map, each pixel of the scene was classified independently, the test sample patches were again selected randomly in the scene and in a way that they did not have any overlap either.

The learning rates, number of epochs, and mini-batch sizes used in this study are 0.001, 1000, and 500, respectively. Dropout layers with a rate of 0.4 are used after each fully connected layer to reduce the effect of overfitting. The network in this research was trained using the cross-entropy loss function with adaptive Moment Estimation (Adam) optimizer (Kingma and Ba 2014) and early stopping. The early stopping criterion was considered to be satisfied when the validation accuracy had consistently decreased for ten consecutive iterations.

4 Experimental Results

4.1 Test Site and Preprocessing

We use images from the region of Catalonia, located in the northeastern part of Spain. The majority of the area is covered by agricultural lands, as shown in Fig. 3a. The selected area is dominated by seven different crops (alfalfa, oat, corn, beans, triticale, wheat, and rapeseed). As is usual in crop monitoring, one image per month was used in this work, resulting in seven images between February and August 2018 (Table 1). The optical image of March is covered by a few clouds and therefore was ignored in this study. Each S1 image consists of two polarizations (VV, VH), which were acquired in Ground Range Detection (GRD) mode. The pre-processing applied to these images included accurate geolocation, removing thermal noise to enhance image quality, performing radiometric calibration to normalize the intensity values, applying speckle filtering to reduce the granular noise, and conducting range doppler terrain correction to correct geometric distortions caused by topography. All pre-processing steps were executed using Sentinel’s Application Platform (SNAP) software, the necessary parameters were taken from the available orbit files. The radar images were resampled to a spatial resolution of 10 × 10 m2. Four spectral bands (red, green, blue, and near-infrared) of each S2 level 2A image were chosen, as these hold significant potential for crop classification (Defourny et al. 2019; Dhau et al. 2021; You et al. 2021). All images consisted of 1593 × 2516 pixels. A ground truth map was produced by the Department of Agriculture, Livestock, Fishing and Food of the Generalitat of Catalonia. This map was resampled to 10 × 10 m2 (Fig. 3b) as well. For network training, 1050 training samples and 490 validation samples were used, and 3500 test samples were employed to evaluate the algorithms (where each sample is an individual pixel). Training as well as validation and test samples were randomly distributed across the study area. An equal distribution of the number of samples was ensured across all classes.

Fig. 3
figure 3

a True color S2 image of the study area in July (Red: B4 band, Green: B3, Blue: B2), b ground truth map, both with a size of 1593 × 2516 pixels at a resolution of 10 × 10 m2 (from Teimouri et al. 2022)

Table 1 SAR and optical data collection

4.2 Generating VTL

As described in chapter 3, there are two different inputs for VTL generation: the six first principal components derived from the optical time series images (one for each epoch), and seven VV polarization as well as seven VH polarization channels for the radar images. The 1050 pixels, all coming from the training data, were then employed as RTL to generate VTL.

We generated three different sets of VTL, one each for classifying only optical and radar data separately, and one for the classification of both image types together. For the VTL for classifying the optical data, only RTL from the optical channels were used, and analogously for the radar data and the fused image set.

The number of selected unlabeled pixels was chosen to be approximately three times as large as the number of RTL. The reason was that in the end we wanted to have approximately the same number of RTL and VTL, however, some VTL were rejected due to ambiguous results, as mentioned before. The factor of three turned out to be a good choice. Finally, a total of 2100 samples (i.e., RTL + VTL) was used for training in each run.

4.3 Results

A comparison of the results for the 3500 test samples achieved with only RTL and with a combination of RTL and VTL is presented in Tables 2 and 3. It shows that for most classes the combination leads to an increase in classification accuracy for crops in optical and radar data sources, as well as their fusion. Achieving an OA of 92.6% and a KC of 91.4% for the S2 images demonstrates the performance of combining RTL and VTL. The inclusion of VTL improved the OA by 4.0% and the KC by 4.7%. In particular, the VTL generated for corn, oat, wheat, and triticale were highly effective, with corn showing the largest improvement of 15.6% in UA. These improvements are further supported by the F1-score analysis (Table 3), which confirms the enhanced performance of these classes when using RTL and VTL together. However, it should be noted that the addition of VTL resulted in a decrease in UA for beans, although the F1-score indicates an improvement of 2.1%.

Table 2 OA, UA and KC, all in %, for classification with only RTL, combination of RTL + VTL and differences (based on 3500 test samples, RTL only from Teimouri et al. 2022)
Table 3 F1-scores in % for classification with only RTL, combination of RTL + VTL and differences (based on 3500 test samples, RTL only from Teimouri et al. 2022)

Similarly, for radar images the integration of VTL and RTL led to an increase in OA (3.3%) and KC (3.9%). Interestingly, for alfalfa the VTL generated using radar data showed more significant improvements compared to those generated using optical data, with an improvement of approximately 7.2% in UA and 5.0% in F1-score, while the impact of VTL generated from optical data was only a 1% in UA and showed a decrease in the F1-score of 0.2%. The integration of VTL also led to improvements in UA of beans, corn, oat, triticale, and wheat; the largest improvement in F1-score was observed in corn, triticale, and wheat, with approximately 11.5%, 2.3%, and 7.5% improvements, respectively. However, for rapeseed, oat, and beans the F1-scores decreased somewhat.

The combination of VTL and RTL also yielded significant improvements in the OA and KC when fusing optical and radar data. With OA and KC scores of 93.0% and 91.9%, respectively. Notably, UA of wheat, corn, beans, and oat improved by 4.4%, 3.2%, 2.6%, and 2.4%, respectively, demonstrating the effectiveness of VTL in accurately identifying and distinguishing between different crops. Furthermore, the F1-score displayed improvements in all crops. These results highlight the potential of using VTL, which can significantly improve the accuracy of crop classification, especially for certain crops.

To test the quality of the generated VTL, we also trained the networks with only VTL. While the results were still acceptable, in general, a decrease of about 10% in OA and KC was observed, as was to be expected.

Figures 4 and 5 depict two large subsets of the study area, which were chosen for visual interpretation. The maps produced using a combination of VTL and RTL exhibit a significant level of map uniformity with reduced noise. Additionally, the yellow ellipses illustrate the impact of VTL on crop classification, leading to improved results in most regions.

Fig. 4
figure 4

Results of the fusion of optical and radar time series images from first subset of the study area. a VTL + RTL, b RTL only, c ground truth. Yellow ellipses show areas with particular differences

Fig. 5
figure 5

Results of the fusion of optical and radar time series images from the second subset. a VTL + RTL, b RTL only, c ground truth. Yellow ellipses show areas with particular differences

Finally, although care was taken to only generate correct VTL, there is obviously a probability that some virtual samples have incorrect labels, potentially introducing erroneous information during training. To tackle this challenge, a strategy was designed, where the virtual sample set was randomly divided into multiple subsets. The subsets were then iteratively injected into RTL. If the OA of the validation improved, the corresponding VTL subset was combined with the RTL. Otherwise, the subset was rejected, this happened in about 30% of the cases. While there is a possibility that, in this way, some virtual samples with correct labels fell into rejected subsets, the primary objective was to identify VTL subsets with high accuracy. The best number of subsets was experimentally found to be 10, with an equal number of samphles in each subset.

The selected virtual samples are referred to as VTL*. Table 4 presents the results obtained by training the 3D-CNN using RTL + VTL and RTL + VTL*, respectively, for the classification of crops from the fusion of S1 and S2 time series images.

Table 4 Comparison of performance between RTL + VTL and RTL + VTL* obtained from the fusion of S1 and S2 time series

According to Table 4, the proposed method demonstrates another significant improvement in classification accuracy. Additionally, the comparison between RTL + VTL and RTL + VTL* highlights that although the generated VTL enhances classification accuracy, there is a possibility of some samples having incorrect labels. By excluding these samples, higher accuracy samples were utilized for training the 3D-CNN, resulting in improvements in OA and KC by approximately 2.3% and 2.6%, respectively. The results obtained from RTL + VTL* exhibit stronger performance in comparison to RTL + VTL, with improved UA and the F1-scores compared to the results presented in Tables 2 and 3 across nearly all classes.

5 Conclusion

This paper presents a novel method for generating VTL to enhance the training of 3D-CNNs for crop classification using fused Sentinel S1 and S2 time series data. The study revealed that incorporating both, VTL and RTL during training leads to higher classification accuracy and a better F1-score for nearly all classes. By training the network using VTL + RTL for fusing S1 and S2 time series images, the results obtained demonstrate an improvement in OA and KC by 1.7% and 2.0%, respectively, compared to training the network solely with RTL. Furthermore, it was observed that among the generated virtual labels, some had incorrect labels. However, by iteratively adding only VTL, which increased OA, and training the network with a reduced number of VTL, OA and KC improve by 2.3% and 2.6%, respectively. Consequently, the proposed method can be said to significantly contributing to the improvement of crop classification.

In future works, we will test the method on additional and larger datasets. We also plan to incorporate attention-based approaches for the temporal domain, as well as knowledge on plant phenology, the latter by conditioning the network on this prior information in a suitable way.