1 Introduction

Since the emergence of the COVID-19 pandemic and the significant growth in usage of personal protective equipment (PPE), mostly face masks and gloves, PPE waste is becoming a significant issue for governments worldwide. A major issue caused by the substantial increase in the usage of face masks is the littering problem. Millions of discarded face masks are littered in various urban and natural environments all around the world on a daily basis. These littered masks are a major source of microplastic fibres contamination of environment [17, 52]. This new type of waste leads to significant new challenges for urban waste management [59]. The discarded face masks can lead to the spread of the virus by multiple careers [4, 28]. The COVID-19 virus can remain infectious on the face masks for up to a week, the spread of the virus through biological and non-biological careers poses potential risks. Furthermore, since many people, including street cleaners and children, may be unaware of the dangers of the discarded masks, they can also be in danger.

The issue of safe management and treatment of biomedical waste is discussed significantly in the literature [27, 58]. However, until now, the production of hazardous medical waste is always concentrated at clinics and laboratories. Handling hazardous medical waste in these controlled environments allowed the waste collectors to plan for careful and safe treatment of the dangerous and infectious materials. However, due to the COVID-19 pandemic, significant quantities of highly contagious discarded face masks are littered in streets, parks, office buildings, residential areas, and natural environments. This issue creates an urgent challenge for the waste management community to address the problems of detection, collection, and treatment of this significant amount of biomedical waste in residential areas. Another major issue in dealing with this problem is the scale and urgency of the current situation. The discarded face masks are littered in almost every urban area around the world, and they are spreading to the natural environment through wind and running water. Recently, the discarded face masks are considered a significant pollution source for the rivers and marine ecosystems. Due to the scale of the problem, manual management of this issue is challenging, and the technology should be harnessed to handle this problem.

Real-time video monitoring and surveillance are gradually become a part of smart cities. The edge computing and System on Chips (SoC) based smart cameras allow computationally intensive vision systems to run at the edges of the Internet of Things (IoT) networks [35]. A potential application for these technologies is litter monitoring and waste management in smart cities. A solution for timely detection and management of hazardous biomedical litters such as COVID-19 infected face masks is this real-time video monitoring system. In this paper, a framework that uses edge video surveillance and location intelligence for autonomous detection of littered face masks is proposed. A new dataset of littered face masks, called MaskNet, is collected and a new deep neural network architecture, called LitterNet, for fast detection of discarded face masks in various environments is proposed. Using the location of the detected littered masks and a proposed location-based model that uses smart city geo-location datasets, the areas with higher probability of hazardous biomedical litter can be predicted for preventive smart waste management. This research has several novel contributions:

  1. (1)

    The novel multi-location littered face mask dataset, called MaskNet, gathered for training machine learning models for autonomous detection and classification of discarded face masks in various urban and natural environments. This dataset is collected in Steyr, Austria, and Tehran, Iran, and to the best of our knowledge, is the first dataset of littered medical face masks in different environments. Although currently there are several face mask datasets available online, the experimental results in this research prove that the detection of littered face mask using these datasets leads to low accuracy. In order to have an accurate deep neural network model for littered mask detection, face masks with various degrees of degradation and deformation, covered in various levels of dirt and in various natural backgrounds under different light conditions should be presents in the dataset. The proposed MaskNet dataset addresses these requirements.

  2. (2)

    The second contribution of the paper is a new deep neural network architecture, called LitterNet, which is able to automatically detect and classify environments with discarded face masks with the accuracy of 96%. The speed of proposed model is 10 times faster than comparable architectures, which is an important factor in processing the required footage on the SoC processing units of edge surveillance nodes in the smart city with limited computational resources.

  3. (3)

    The third contribution of this research is a location intelligence model which combines the output of LitterNet with various models interpolated from smart city geo-location datasets in order to provide a series of proposals for hazardous litter management intelligence in smart cities for emergency response management.

The rest of the paper is organized as follows: Sect. 2 provide a review of the related works to this research. Section 3 presents the proposed framework, methodology of this research and the experimental setup. In Sect. 4, experimental results are detailed and discussed. Finally, Sect. 5 concludes the paper.

2 Litter management in smart eco-cyber-physical systems

In recent years, smart cities analyse every behaviour of the citizens and make smart decisions for autonomous management of various complex governance tasks. The visual surveillance and monitoring systems are also gradually integrated into many business environments for security and Business Intelligence (BI) based behaviour analysis. Artificial intelligence-based algorithms can enable smart cities to model human behaviour with high accuracy and provide decision support for governments. In this context, the smart city can be considered as an eco-cyber-physical system (ecoCystem) [30, 34] which is a combination of the living and nonliving components of the ecosystem in conjunction with the cyber-physical sensors and intelligent agents in the environment, interacting as a system. In many countries around the world, street cleaning requires sweepers to visit different locations in the city and manually verify if the street or location needs to be cleaned. This approach is time-consuming and is not cost-effective [8]. The governance intelligence platforms can use the collected data from surveillance camera networks and robotic agents for autonomous modelling of the litter patterns. The city management can use this data to model the littering behaviours and take precautionary measures to protect the environment and the safety of citizens. Furthermore, law enforcement can automatically identify and fine the people impacting the community's health with littering.

Artificial intelligence and robots were able to help the waste management community to increase the speed and scale of waste management significantly [33, 42, 54]. Alvarez et al. [5] proposed a process for electronic equipment recycling, which uses robots. Zhihong et al. [62] proposed Region Proposal Generation (RPN) and VGG-16 model for object recognition and pose estimation to sort littered bottles. Karbasi et al. [26] investigated the sorting of used button cell batteries by utilizing robotic arms, and they proposed three different detection techniques for sorting this type of waste. Furthermore, in [54], Wang et al. used neural networks to assist a robot in construction sites and utilized Faster R-CNN methods to detect scattered nails and screws in real-time to enable the robot to automatically collect discarded nails and screws.

Visual litter analysis is one of the primary methods for modelling the quantity and characterization of litter. Autonomous sorting of the litter using image processing helps to separate various types of waste from each other for recycling [9, 20, 31, 41, 44]. Also, deep neural networks are able to provide practical visual modelling tools for detection and classification of various types of waste [10, 12, 31, 37, 44]. Sakr et al. [41] proposed that for reducing the waste contamination by other materials, the sorting and separation must be done as early as possible, and the need to automate this process is a significant facilitator for waste companies. To this end, the authors made use of convolution neural networks (CNN) and Support Vector Machines (SVM) for sorting plastic, paper, and metal wastes, using a Raspberry Pi 3 acting as the controller. In [9], the Bircanoglou et al. utilized different deep neural network architectures for intelligent waste sorting. Sreelakshmi et al. [44] investigated the segregation of plastic and non-plastic wastes.

During the current pandemic, Medical Waste Management (MWM) has become a major challenge for governments around the world. Kalantary et al. [25] assessed five hospitals in Iran and discussed that the COVID-19 pandemic led to an average of 102.2% increase of medical waste in both private and public hospitals. They further discussed that the ratio of infectious waste has increased by 9% in medical waste composition and 121% compared to pre-pandemic levels. In [51], the Tsai et al. analysed and compared the status of the medical waste generation before and after the COVID-19 pandemic in Taiwan and found that the difference was significant in the second half of 2020. Furthermore, the author suggested that all generated waste from healthcare facilities must be disposed of in officially certified incineration plants, with proper temperature to effectively kill viruses. This process will ensure safe and complete destruction of the COVID-19 virus and prevent further spread from medical wastes, such as face mask. Hartanto and Mayasari [21] took a different path and utilized the analytic hierarchy process (AHP) in order to determine appropriate material for making environmentally friendly face masks. In [43], Sari et al. investigated medical waste from COVID-19 intensive care hospitals and disposable face mask waste generation in Indonesia, and the results show that several tons of medical and face mask waste are being generated daily, and they must be safely transported to treatment facilities. Tagle et al. [47] addressed the lack of legislation for correct disposal of waste management in the current pandemic in Mexico. They further discuss that it is necessary to create a protocol for the MWM of households which during the pandemic can be classified as medical waste.

Torres and De-la-Torre [50] addressed the issue of face mask waste management in Peru, and their estimation suggests that about 15 million face masks waste are generated daily in their country. A significant portion of these masks are being mismanaged, thus polluting streets, beaches and marine environment of regions like Lima, Ancash, La Libertad, and Piura Regions. Moreover, in [40], it is discussed that medical mask waste can be found along with household waste without special treatment and separation, which can prove to be hazardous, and they require a unique treatment process. Joes et al. [24] presented an environment-friendly design of a trash bin for the current pandemic. Their proposed pro1rtfgduct is designed to treat and manage the medical mask wastes by changing the mask's shape using a hole puncher and further disinfecting them using two disinfectants provided in the storage box inside the trash bin. Even though the proposed product disinfects face masks, the issues of littered masks in different environments and streets remains unsolved.

Tirkolaee and Aydin [7] focused on vehicle operations in waste management and minimizing the total costs of transportation, outsourcing, and use of vehicles during the pandemic by using CPLEX solver. Also, in [49], Tirkolaee et al. addressed the location-routing problem and proposed a mixed-integer linear programming (MILP) model along with fuzzy chance-constrained programming to investigate the routing problem for MWM in the current pandemic. Dharmaraj et al. [16] argued that since face masks are made of petroleum-based non-renewable polymers that are non-biodegradable, they are hazardous to the environment and create health issues. Littered face masks are ending up in rivers and open waters, thus endangering the marine ecosystem and their fauna and flora population.

However, until now, all the researches are focused on either the biomedical waste in controlled environments such as hospitals or dealing with collected waste in waste management sites. The issue of a significant quantities of littered face masks in streets, parks, office buildings, residential areas, and natural environments is a very important issue and this research deals with this issue for the first time. The scale and urgency of this problem necessitate the use of technology for a scalable and safe solution. Using mobile edge computing and installing high-resolution cameras on garbage trucks and other municipality vehicles, street cleaning can be performed more efficiently and faster [61]. Therefore, in this research a solution for addressing these issues is presented. Table 1 presents a summary of investigated related researches.

Table 1 Summary of related works

3 Location-aware hazardous litter detection

In smart cities, the cloud is commonly located far away from end-users, and therefore, it cannot offer low latency connections. In order to solve this issue, different edge computing methods are employed. The method suitable for the proposed framework is multi-access edge computing (MEC). In this approach, the data is processed between the cloud and the end-user. In our proposed framework, fixed and mobile surveillance cameras across the smart city, acting as modular rapidly deployable decision support agents (MoRaD DSA) [1, 2], provide the system with street footage as well as location and time. The MEC hosts can either be embedded in data collection agents (drones, robots, trucks), or they can be placed in fixed locations. The MEC host uses the proposed LitterNet model to process the surveillance videos and to determine the location of the discarded face masks in the streets. Meanwhile, the respective geo-location data (latitude and longitude) of the input video will also be collected. The geo-location data will be transmitted to the smart city cloud, which is responsible for performing further analysis. By having access to the geolocation data, using the proposed location intelligence model, the smart city authorities can make plans for waste collection and street cleaning and predict which districts, blocks, streets, and locations are going to be most littered with discarded face masks. Therefore, the smart city authorities will be able to make plans for the efficient routing of garbage trucks and street sweepers. If required by the smart city management, more accurate and computationally intensive models will be used on the cloud for better face mask detection or detection of other types of trash. The overall architecture of the proposed framework is presented in Fig. 1.

Fig. 1
figure 1

Overview of the proposed framework for litter management

The proposed framework consists of two main phases: (1) Data Collection & Preparation and (2) Model Training & Evaluation. In the upcoming sections, the aforementioned phases will be discussed. In the proposed framework, first, input images with different resolutions will be captured by surveillance systems and along with the location of the place that the input image was acquired will be sent to the host. Since the input images come in different resolutions, they will be resized to the input shape (step 2.1 – 540 × 720 size pixels) of the LitteNet or DenseNet201 models. The surveillance footages with QCIF, CIF, 2CIF, 4CIF, and D1 standards with size of 176 × 120, 352 × 240, 704 × 240, 704 × 480, and 720 × 480 pixels, respectively, should be upscaled in order to be used in the proposed framework. Other standard formats such as 720p HD, 960p HD, 1.3 MP, 2MP, and other standards with higher resolutions can be used in their native resolution in the proposed system. After the input image is resized, it will be sent to the LitterNet model (2.2) in order to find out whether there are any littered face masks in the images or not. If a littered mask is detected (2.3) the resized input image, along with coordinates of the location of this littered face mask (2.4) will be sent to the cloud. In this case as the masks are detected on the edge devices the network latency is very low.

In the proposed framework the authorities should provide the required level of accuracy for litter detection. The experimental results show that the accuracy of the proposed computationally light LitterNet architecture, which is suitable for running on edge nodes, is 96%. If from authorities’ point of view this accuracy is not acceptable, or if detection and classification of other forms are litter is required, the deployment of more computationally intensive neural models is essential. In this case, the input with its original resolution (before the resizing in 2.1) will be sent to the cloud. Next, in the cloud, the littered mask image will be fed to the face mask detection model (3.1) in order to perform detection and predict the bounding box with high accuracy. Then, based on the input coordinates, proximity locations within 10 to 15 m of the actual location will be generated (3.2) to perform further geolocation analysis. These locations will be stored in a centralized server (3.3) for usage in later steps. Finally, by integrating all the acquired data, further Exploratory Data Analysis (EDA) and visualizations (3.6) can be done to provide authorities and human operators in smart city with more accurate information and tools to enable them to design different plans and optimize garbage collection.

It is important to mention that in case of transmission of the recorded footage by edge nodes to the cloud the scalability of the proposed framework will require significantly high bandwidth and only in cases of emergency this high accuracy system can be used. In these conditions the network latency is also high. However, advanced SoCs with neural computing capabilities may soon make this issue irrelevant. Also various methods for detection, re-identification, and tracking across multiple cameras without the need for storing the streaming data are proposed with may help in alleviating this problem [35].

3.1 Data Collection and Preparation

Due to the fact that before this research there was not any dataset of littered medical masks, as the first step in the proposed framework a new dataset containing images of littered face masks, called MaskNet, was collected. The MaskNet dataset was collected in Austria and Iran in July 2020. The dataset was collected daily for seven days in Steyr, Austria, and Tehran, Iran, during different times of day from 6 A.M. up to 6 P.M. from different environments such as streets, parks, riverbanks, inside buildings and offices. It consists of 1061 surgical masks images that are littered on the streets and other urban areas. In addition, images of streets without discarded masks were also collected. The original size of the dataset is 6.9 GB, and the photos have an image resolution of 3024 \(\times\) 4032 pixels. The images are captured by three smartphones: An Apple iPhone 11, a Samsung Galaxy S8, and a Samsung Galaxy S10.

For sampling, the images were acquired with the default zoom of the aforementioned smartphones with the mask at the centre, top, bottom, corners of the image, and discarded masks were left in different states of deformation, various levels of dirt and near different objects, such as bins and bushes. In order to efficiently train the model used in the proposed system and normalize the dataset, the size of the images is reduced to the image size of 540 \(\times\) 720 pixels, which reduces the size of the dataset to 220 MB. In order to train the models, 70% of images are used for training the LitterNet deep neural network, 13% for cross-validation, and 17% for testing. Figure 2 shows a few samples of the MaskNet dataset.

Fig. 2
figure 2

Samples from the MaskNet dataset

In order to model the realistic surveillance videos, prevent the model from overfitting and increasing the amount of data in the dataset, data augmentation (DA) is used. DA increases the number of data samples from each image by using various techniques such as changing the object's location in the picture, the scale of the image, rotating the image, and other transformations. The importance of using this technique, in addition to increasing the amount of data in the dataset, is preventing overfitting by allowing the model to see various images of the same object. This process will increase the accuracy of the neural network model. The DA was performed using Keras’s ImageDataGenerator class. To define the best parameters for DA, various combinations of augmentation parameters are examined and the final DA parameters, presented in Table 2, were chosen. This data augmentation makes the data relevant for video monitoring applications as the changes in scale and rotation simulates the fixed and moving surveillance camera footage.

Table 2 DA parameters

An important issue in collection of datasets is randomness of data. For deep learning datasets it is essential to have a high-quality data and the data should cover as many cases as possible and be a good representation of reality. A model trained on high quality dataset can capture the underlying characteristics of the data and makes the predictions more generalize. In this research, the DA is used for oversampling minority classes and generating new data to ensure high data quality dataset that doesn’t have under-represented classes. One of the future directions of this research is making better datasets using Generative Adversarial Networks (GANs).

Another dataset used in this research is the TrashNet dataset [56]. The images in this dataset are photos of garbage with a white background. The original dataset is approximately 3.5 GB, with an image size of 2448 \(\times\) 3264. However, the resized version is around 40 MB in size and 512 \(\times\) 384 size images. The dataset consists of 2527 images distributed in six categories of cardboard (403 samples), glass (501 samples), metal (410 samples), paper (594 samples), plastic (482 samples), and trash (137 samples). The trash category does not have sufficient samples, thus, in this work, we decided it is best to remove this category. The final dataset consisted of 2390 images. Just as before, we used 70% of images for training, 17% for testing, and the remaining 13% for cross-validation. Figure 3 shows several samples from this dataset.

Fig. 3
figure 3

Samples from the TrashNet dataset

The augmentation parameters were utilized for both MaskNet and TrashNet datasets with the same values. In the next section, the training process of the models utilized in this research will be investigated.

3.2 Model training & evaluation

In this research, for rapid littered face mask detection, a light CNN-based architecture called LitterNet is proposed. The LitterNet architecture consists of Conv2D and MaxPooling2D layers, as presented in Fig. 4. In this network, layers close to the input have fewer convolutional filters, while layers that are located deep in the network and immediate to the output have more filters. Therefore, the first Conv2D layer has only 16 filters, with a 3 \(\times\) 3 strides and ReLU activation function. Then there is a MaxPooling2D layer with a 2 \(\times\) 2 pool size. After that, the Conv2D’s filter expands to 32, while other arguments are held as the previous layer. The third and final Conv2D layer has 64 filters, which are then passed to a Flatten layer after passing through the final MaxPooling2D layer. The goal of the design of the architecture was simplicity and speed while keeping the accuracy as high as possible. To prevent overfitting, a Dropout layer with the probability of 0.5 is used. A 16-unit dense layer is embedded in the architecture with rectified linear unit (ReLU) activation, whose output is then passed to the last dense layer, which utilizes SoftMax as its activation function to identify whether there is a littered face mask in the photos or not. Two optimizers, Adam [60] and RMSProp [64], are used as optimization algorithms. If LitterNet finds any discarded mask in the input, the respective geolocation of the input image will be marked.

Fig. 4
figure 4

The architecture of LitterNet and the process of mask and trash detection

For evaluation of the proposed model, several state-of-the-art deep neural network architectures are investigated for the proposed application. DenseNet [22] is a deep neural network architecture in which connections between input and output layers are very short. Since it utilizes densely connected layers, this model requires fewer layers. In DenseNets, the connections between input and output layers are very short. Following the feed-forward nature of the network, each layer in a dense block receives feature maps from all the other layers and passes its output to all subsequent layers. Feature maps are fused through concatenation. These connections form a dense circuit of pathways that allow better gradient flow, and each layer has direct access to the gradients of the loss function and the original input signal. Because of these dense connections, the model requires fewer layers, as there is no need to learn redundant feature maps, allowing the collective knowledge (features learned collectively by the network) to be reused.

Another investigated architecture is Inception-V3 [46], which is a CNN that is trained on more than a million images from the ImageNet dataset. This architecture is consisted of 48 layers and can classify images into 1000 different object categories. The next architecture used in this study for comparison of results is Neural Architecture Search Network (NASNet) [63], in which the cells are not predefined and instead designed by the reinforcement learning. In this architecture a building block is optimized on a small dataset and then it will be transferred to a larger dataset. The best convolutional layer on CIFAR-10 is selected and then these layers will be applied to the ImageNet by stacking several copies of them together. The next architecture is a modified depth-wise separable convolution called Xception [11]. This architecture is a pointwise convolution followed by a depth-wise convolution in which depth-wise convolution is the channel-wise \(n \times n\) spatial convolution. For object detection architecture, EfficientDet [48] is used, that is based on a weighted Bi-directional Feature Pyramid Network (BiFPN) and trained on the MS-COCO dataset. Finally, for transfer learning, DenseNet121, DenseNet169, DenseNet201, InceptionV3, Xception, NASNetLarge, and NASNetMobile architectures were utilized.

After choosing the best hyperparameters, the final setup and hyperparameters were the same for all trained models, both on MaskNet and TrashNet datasets. The models were trained for 50 epochs and a batch size of 16 using Adam and RMSProp optimizers and DA. Furthermore, the default image size of each dataset (540 \(\times\) 720 for MaskNet, and 512 \(\times\) 384 for TrashNet) was used as the default input shape of every trained model. The EfficientNet object detection network was retrained using TensorFlow Object Detection API. At first, 40,000 steps were set for training. However, after 18,000 steps, the loss began to increase slowly, and therefore we had to stop the process. Moreover, the loss was below 0.1 (mostly between 0.08 and 0.095) for the final few hundred steps. The experiments of this study were performed using Keras library with the TensorFlow backend (Version 2.5) on the Google Colab platform. The experiments were conducted on an Nvidia Tesla K80 GPU.

In total, 31 different models were trained on both MaskNet and TrashNet datasets. For the novel MaskNet dataset, 14 models (7 models with Adam optimizer, and 7 using RMSProp) were trained using transfer learning and aforementioned architectures, plus 2 models by utilizing the proposed LitterNet architecture (1 with Adam, and 1 with RMSProp), and one object detection model, summing up to 17 models in total. For the TrashNet dataset, 14 models in total were trained by utilizing the previously discussed methods and architectures. After the models were trained on both datasets, their performance needed to be compared and evaluated in order to choose a final model for each task (face mask and trash detection). For this task, the performance of the models on the test set was compared. The results of this comparison are presented in experimental results section.

3.3 Location intelligence for predictive biomedical waste modelling

Using the predictive geospatial location of litter, the smart city management can efficiently plan the routing of garbage trucks and rapidly remove the hazardous biomedical litter from the streets. The geo-location data generated by LitteNet can validate and suggest the exact points where the face masks are littered. However, the possibility of locating a significant portion of littered masks using visual surveillance is not practical. Therefore, the proposed framework uses a collection of smart city data along with the locations provided by the LitterNet in order to predict the litter cluster locations.

The first dataset which can help the location intelligence for litter cluster prediction is the locations of crowds. The crowd is usually the main source of litter and the number of people gathered in a place have correlation with the amount of litter in that place. The dataset of geo–locations of people favourite sites in the city, as show in Fig. 5a, can indicate the possibility of litter clusters in the areas with high visitor density.

Fig. 5
figure 5

a Clusters of people favourite locations in the city, b height map and location of buildings with public-interest, c the clusters of locations of places where people eat publicly and shop, d the geospatial clusters of public litter bins

A dataset of locations of buildings with public-interest is also available in the smart city. This geospatial dataset, presented in Fig. 5b is obtained by applying height of the structure to better visualize the larger buildings, comes in three categories: (a) list of buildings that are of national/international importance, (b) buildings of regional importance, and finally, (c) buildings of local importance. It should be noted that most of the restaurants and public litter bins are also located in the vicinity of these buildings.

Another important dataset is the locations of restaurants, pubs, cafes, takeaways, hotels and other public places, as well as supermarkets, which is presented in Fig. 5c. These locations are another important factor for density of litter in the streets. Another important factor regarding littering is that some people, due to carelessness, discard their trash outside of litter bins. Also, in many cases the litter bins are not emptied in time. Therefore, in many places there are trash clusters in vicinity of litter bins. Therefore, the final geospatial dataset utilized in this work is public litter bin’s location dataset as shown in Fig. 5d. Combining these geospatial locations, probable location of litter clusters can be generated. This information combined with validation and reporting of actual litter obtained from the proposed vision system enables the authorities to plan routing of garbage trucks and street sweepers.

4 Experimental results and discussion

4.1 Mask detection and classification using LitterNet

For training the best model on the MaskNet dataset, different hyperparameters were tested. Epochs ranging from 30 to 50, with batch sizes of 8, 16, and 32, were tested. The best models were trained for 50 epochs since epochs below that led to under-fitting, and models trained for 60 epochs were over-fitted (10–12%). The final models were trained using a batch size of 16. The best-performed model for classifying MaskNet dataset was the DenseNet201 + Adam trained model, which achieved a test accuracy of 98.13% using DA and Adam optimizer. Furthermore, this model offered the least overfitting, with a training accuracy of 98.91%. After that, the DenseNet169 + Adam trained model showed a test accuracy of 97.04%. The NASNetLarge + RMSProp and Xception + RMSProp offered similar classifications, with 94.39% and 95.33% test accuracies. Furthermore, the NASNetMobile and DenseNet121 trained models performed similarly to each other. The worst performance was observed from the InceptionV3 + Adam trained model, with a test accuracy of 88.79%.

The proposed LitterNet model offered a test accuracy of 94.39% for both Adam and RMSProp models. However, the Cross-entropy loss was lower for the Adam trained model (0.14) compared to RMSProp model (0.32). The LitterNet shows a promising performance, even though it is a simple network compared to other pre-trained and complex models and networks utilized for comparison, and it outperformed the NASNetMobile, InceptionV3, and DenseNet121 architectures for the given task. The results of the classification of the MaskNet dataset by various deep neural network architectures are presented in Table 3.

Table 3 Training and test results for MaskNet dataset
Fig. 6
figure 6

Performance comparison of LitterNet (a) and DenseNet201 (b) on MaskNet

Table 4 shows the total number of the parameters for LitterNet, DenseNet201 and DenseNet169 models. As presented, the proposed LitterNet model has a lower number of parameters, which boosts the performance, and the time for prediction is much lower than DenseNet201 trained model. Since the goal of the framework is to deployable on the edge nodes and the models have to work with a stream of videos, lightness and speed are the most vital performance measures for the selection of the neural network architecture. Therefore, when the LitterNet model presents a much faster performance, the 2–4% difference in the accuracy between LitterNet and other models such as DenseNet201 may be tolerable.

Table 4 Total number of parameters of LitterNet, DenseNet201, and DenseNet169 models

In order to further evaluate the performance of the proposed LitterNet, its prediction time on CPU was compared with that of the DenseNet201 + Adam and DenseNet169 + Adam trained models, and the results are presented in Table 5. The testing was conducted on Google Colab, which uses a single core hyper-threaded Xeon Processors @ 2.00Ghz (1 core, 2 threads). For this task, a 7-s video of littered masks and a clean street was captured using a Samsung Galaxy S10, and the framerate (FPS) was reduced to a common FPS of 24 for convenience. From this footage, 168 photos were extracted and fed to the models for prediction.

Table 5 Prediction time comparison of LitterNet, DenseNet201, and DenseNet169

In this table, the normal time is measured by a stopwatch manually, and starts from the time the one-line of code was submitted to the CPU (to predict the set of 168 photos using LitterNet model) to the time when the process completed. User is the amount of CPU time spent in user-mode code (outside the kernel) within the process, and this is only the actual CPU time used in executing the process. Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, instead of library code, which is still running in user-space. Finally, the Total indicates how much actual CPU time the process of prediction is used.

As presented, the LitterNet model, with a total CPU time of 5.13 s, performs significantly faster (almost 10 times faster) compared to DenseNet201 and DenseNet169 models, which predict the same set of 168 photos using the same CPU and setup in 54.6 and 45.5 s, respectively. Therefore, the results suggest that utilizing the LitterNet model for edge surveillance can be computationally cost-effective, and therefore, its marginally lower accuracy may be tolerated. Figure 7 shows LitterNet results, specifying in which pictures a littered face mask is found and in which picture no discarded face mask exists.

Fig. 7
figure 7

LitterNet classification results

However, the model did fail to correctly detect and classify a few instances in the dataset. As presented in Fig. 8, the model was unable to accurately predict images in which the mask was further away than other samples or a sample that had captured a littered tissue (left).

Fig. 8
figure 8

A few samples of wrong predictions of LitterNet

As for the EfficientNet detection model, the loss was below 0.1 (mostly between 0.08 and 0.095) for the final few hundred steps. Overall, the trained mask detection model was promising since it was able to detect surgical masks in different and complex environments. In Fig. 9, different losses of the mask detection model, such as classification loss (a), localization loss (b), regularization loss (c), and total loss (d), visualized by TensorBoard, are presented.

Fig. 9
figure 9

Different losses of the mask detection model

In Fig. 10, a collection of correctly detected littered face masks from MaskNet dataset using EfficientNet are presented. The masks are in different backgrounds, including dirt road, paved road, between leaves, on stairs, and tangled in a fence. All the masks are detected with high accuracy.

Fig. 10
figure 10

Samples of correctly detected littered face masks from MaskNet dataset using EfficientNet

The next stage of experiments was testing the proposed model on images it has never seen before. A series of littered face mask images are collected from Google Image service, and the automatic detection of the face masks in these images using the proposed model is investigated. As presented in Fig. 11, in most cases, the proposed system managed to detect the mask with high accuracy. The model even managed to detect an N95 mask, which was not available in the dataset and was never seen by the model while it was being trained.

Fig. 11
figure 11

Face mask detection on images from Google Image service

As previously discussed, the littered face masks endanger the marine ecosystem, and detecting and collecting these litters can save the ecosystem. As presented in Fig. 12, the proposed system was able to detect the littered face masks in the water with high accuracy. It is worth noting that there were not any samples of littered masks under the water in the dataset, and the model had not seen such instances while it was being trained yet managed to detect these litters correctly. Underwater object detection is a challenging task and usually image enhancement algorithms are required before further processing stages. Colour correction and contrast enhancement are required to compensate for scattering and absorption of light by water and the underwater suspended particles which have severe effects on image quality. Images of objects underwater usually have degradations such as colour fading, contrast reduction, and detail blurring [32].

Fig. 12
figure 12

Underwater detection of the face masks

However, there are few instances in which the model either was unable to detect the masks correctly or was only able to find a percentage of the masks presented in the images. Figure 13 shows several samples of such cases. As the figure shows, the detection model fails to detect all littered masks correctly in densely mask littered areas. There are also mislabelled cases with complex backgrounds, a large number of overlapping face masks, and circumstances in which the face masks are mislabelled as gloves.

Fig. 13
figure 13

The samples in which the proposed system was unable to detect the masks correctly and completely

There are certain elements influencing the accuracy of the proposed system. The limitation of computer vision algorithms, including reliance on good environmental light and weather, is a factor that may limit the applicability of the proposed system in lousy weather or close to sunset and during the night. In the proposed framework, detection and classification of other forms of litter is also considered. To find the best model for this part of the framework, in the next section, the results of various deep models trained on the TrashNet dataset are analysed.

4.2 Classification of other forms of litter

For classifying the TrashNet dataset, the formerly investigated pre-trained models were investigated. Same as before, different hyperparameters with different values were researched, and the best parameters and values were chosen for the given task. The final models were trained for 50 epochs with a batch size of 16. DenseNet201 + RMSProp trained model, with a test accuracy of 94.04%, presented the best performance among other models. After that, the DenseNet169 + Adam model provided 92.72% of accuracy on the test set, and NASNetLarge presented 92.09% and 91.46% for RMSProp and Adam trained models, respectively. NASNetMobile, DenseNet121, and InceptionV3, with 88% ~ 89% test accuracy, offered the worst performance among other trained models. Most of the models are overfitted by 7 to 8 percent, and the best model is overfitted by 4%, which is acceptable, given the complex task of classifying five different categories of trash. The proposed LitterNet has a comparably low accuracy of 81% for classification of TrashNet dataset. The reason for this low accuracy is similarity of samples in different classes of TrashNet dataset and the simplicity of the LitterNet architecture. The results of the classification of the TrashNet dataset by various deep neural network architectures are presented in Table 6.

Table 6 Training and test results for TrashNet dataset

The performance of the top two trained models (DenseNet201 + RMSProp & DenseNet169 + Adam) on the TrashNet dataset is presented in Fig. 14. Furthermore, Fig. 15 shows the confusion matrix of the DenseNet201 + RMSProp trained model. As shown, the model confuses some samples of glass trashes with metal and cardboard with paper.

Fig. 14
figure 14

Performance comparison of DenseNet201 (a) and DenseNet169 (b) on TrashNet

Fig. 15
figure 15

Performance comparison of DenseNet201 (a) and DenseNet169 (b) on TrashNet

The confusion of the aforementioned categories is quite understandable since there are samples that, as shown in Fig. 16, might be confused even by humans. Examples of these samples are the crumpled foil, which is confused with a paper, or cardboard confused with a piece of paper.

Fig. 16
figure 16

Mislabelled predictions in TrashNet dataset

There have been several studies of machine vision-based trash classification in the past few years. In the RecycleNet project [9], various architectures and optimizers are compared, and the best model for trash classification is selected. The test accuracy in the RecycleNet project is 81%, and for the DenseNet121 model in their project is 95% after 200 epochs, whereas the DenseNet201 model in this study achieves 94% accuracy after only 50 epochs. Aral et al. [6] achieved test accuracies of 94% and 95% after 120 + epochs, whereas the same was accomplished in this study after only 50 epochs. Therefore, the accuracy of the proposed model is comparable with similar models in the research. In the next section, location intelligence for predictive biomedical waste cluster detection will be investigated.

4.3 Location intelligence for predictive biomedical waste cluster detection

For the location intelligence dataset, different geo-location dataset publicly provided by Dundee City Council, Scotland, UK, are utilized. In a survey in Dundee city, people were asked to pinpoint their favourite location in the city, and the result was the Dundee Places dataset [13]. These locations are the most populated in the city, which have the potential to also be the most polluted in the city. A list of historic-interest buildings of Dundee city is also publicly provided [14] and used in this study. The locations of restaurants, pubs, takeaways, hotels, as well as supermarkets and other food shops in Dundee city, are provided by the food standards agency (FSA) [18]. The Public Litter Bins dataset, which consists of locations of all public bins within Dundee city [15], is used for litter bin locations. As demonstrated in Fig. 17, combining these geospatial locations, probable location of litter clusters can be generated. This information, combined with validation and reporting of actual litter from LitterNet enables authorities to plan the routing of garbage trucks and street sweepers.

Fig. 17
figure 17

Clusters of probable locations of litter in the smart city

In order to make the location intelligence map more usable for waste management, as presented in Fig. 18, the litter location information can be clustered to provide per district information for the smart city authorities.

Fig. 18
figure 18

Per district litter location intelligence

This information enables smart city management to offer optimized solutions for collecting garbage across different areas, blocks, and streets of the city. At the first level of the multi-level assessment model (Level 1), the waste management authorities have the litter information for the entire city. In Level 2, the per district analysis of the litter information is presented. At Level 3, adjacent blocks and, in Level 4, a single block is analysed. Level 5 focuses on a single point of interest (PoI) which can be either a restaurant, supermarket, pub, grocery store, takeaway, hotel, or even a public litter bin. The points provided by LitterNet, are used for validation and adding information at Level 6. The multi-level analysis of littering probability in Dundee city is presented in Fig. 19.

Fig. 19
figure 19

Multi-level biomedical litter predictive modelling

The measures taken for the management of the littering problem can be categorized into four main groups: the preventive measures, the mitigating measures, the removing measures, and the behaviour-changing measures. The proposed system can help to identify the mask litterers and help to provide preventive, mitigating, and educational solutions for these individuals.

Due to the large scale of the biomedical littering problem, labour-based management of this issue is very hard, expensive, and time-consuming. The waste management decision-makers require detailed and real-time knowledge regarding issues such as magnitude and sources of biomedical litter. Using this knowledge, the smart city management may appropriately intervene to develop rapid cleaning programs, enforce the law and establish educational plans in the area. The city management can use the automatic biomedical litter classification system for rapid and large-scale city-wide litter analysis and modelling. The proposed system allows the reduction in the cost and increase in the scale and speed of litter surveying in smart cities. The close to the real-time geospatial distribution of biomedical litter and up-to-date region level models can be visualized for smart city management and to provide an encompassing view of the health issues from face mask littering affecting the region. Automatic face recognition and person tracking [39] can be used for the enforcement of littering laws. In this case, the privacy of the citizens becomes an issue for which autonomous artificial intelligence based privacy persevering methods provide accurate solutions [3].

Another application of the LitterNet is as the vision system of street cleaning robots. The current street sweeping robots can be equipped with a LitterNet based vision system to automatically detect and collect the littered face masks. A robotic trash sorting machine can use the LitterNet system for sorting the face masks and separating them from other forms of trash. The LitterNet can also be used for automatic counting and volume estimation of the face mask in bins and dumps. Various issues regarding the implementation of camera network for this task such as implementation using Software Defined Networks [36] can also be the topic of future research. Furthermore, the issues of privacy preserving in the collection of data from the smart city is an important issue and machine learning as service model can be used for training necessary models while keeping the privacy of citizens [38].

The collection of a larger dataset and training the LitterNet by all the situations it may encounter in real-world scenarios is essential for making the proposed system more reliable. This dataset needs to be updated at regular intervals in order to provide an accurate view of the littered face mask patterns. The LitterNet can also help in the automation of sorting the litter and finding the hazardous face masks without posing health risks to the waste collectors.

4.4 Performance comparison with related works

There are many different face mask datasets publicly available on the internet, such as the Face Mask Detection ~ 12 K Images Dataset (MD-12 K, Available at: https://www.kaggle.com/ashishjangra27/face-mask-12k-images-dataset) and Face Mask Lite Dataset (FML, Available at: https://www.kaggle.com/prasoonkottarathil/face-mask-lite-dataset). The MD-12 K dataset consists of six thousand images of people wearing face mask that are scrapped from google search, and images without the face mask from the CelebFace dataset (Available at: https://www.kaggle.com/jessicali9530), summing up to approximately twelve thousand samples in total. The FML dataset consists of a total of twenty thousand samples, ten thousand for each category (with/without mask), and all samples (all human faces) are generated by style-based GAN architecture.

These datasets are mainly focused on identifying whether a person is wearing a face mask or not. In this part of experimental results, it is investigated that whether a deep neural network model trained on these datasets is suitable to littered face mask detection. Therefore, the LitterNet model was trained on MD-12 K and FML datasets, and the performances were compared with the model trained on MaskNet dataset. For training the new models, the same hyperparameters and set ups as training the LitterNet model on the MaskNet dataset were used. A model comparison was performed for the current simulation and the results are presented in Table 7.

Table 7 Performance of models trained on face mask detection datasets for littered face mask detection

As presented in Table 7, the MD-12 K trained model yielded 53.19% test accuracy for detecting littered face masks. Also, the trained model on FML dataset offered an accuracy of 51.77% for detection of littered face masks. In can be concluded that the mask detection datasets in previous research are not suitable to detecting littered face masks in different environments. A few samples of wrong predictions are presented in Fig. 20.

Fig. 20
figure 20

Wrong predictions from models trained on available face mask detection datasets

It can be concluded that in order to have an accurate deep neural network model for littered mask detection, face masks with various degrees of degradation and deformation should be present in the dataset. Furthermore, the masks should be covered in various levels of dirt and also various natural environments and different light conditions should be presents in the dataset. The proposed MaskNet dataset addresses these requirements. In order to present the advantages and contributions of this research the proposed framework is compared with some of the proposed methods in 2021 for waste management and face mask detection in Table 8.

Table 8 Comparison of the proposed framework with existing works

5 Conclusion and future works

Due to the COVID-19 pandemic, the littered medical masks are currently endangering different cities and environments. In this paper, a biomedical litter management framework that uses video surveillance and location intelligence for detection and predictive modelling of littered face masks is proposed. The proposed LitterNet model achieved an accuracy of 96% for detecting medical masks with optimized hyperparameter tuning. The simplicity of the model allows implementation on MEC and high speed of processing. The accuracy of the proposed model is more than enough for practical applications. A location intelligence based multi-level predictive litter model for enabling authorities to efficiently offer an optimized plan for cleaning streets of the smart city is also presented in this research. For future direction of research acquisition of more littered medical masks samples from a wide variety of environments, such as beaches, along with other types of face masks is necessary. Furthermore, by using GANs on the MaskNet more samples can be generated. Building a robot or a drone capable of detecting and cleaning littered medical masks in different environments is a topic of interest. Finally, in future, the proposed system will be utilized in a large-scale real-life scenario.