Keywords

1 Introduction

At the point brought by technology, the computer has become one of the indispensable parts of our lives. The time spent in front of the computer increased with the change of both remote working and education [1]. Depending on this situation, diseases such as tendinitis, myofascial pain syndrome, neck hernia, pulmonary embolism, eye strain, and humpback caused by postural disorders continue to increase and decrease the quality of life and work efficiency [2]. Recent advances in Artificial Intelligence (AI) provide new approaches in many industries [3, 4]. In this study, pose estimation is made by processing the images taken from the computer's camera to prevent the mentioned diseases. The pose estimation was developed with reference to the OpenPose [5] study, which was developed using deep neural networks and shared as open-source in python language. This study uses a bottom-up approach. Although this approach works slower than the top-down approach, it has been preferred. After all, it gives more accurate results because it can capture spatial dependencies between different people that require global inference. Posture analysis is performed using values ​​obtained from the pose estimation study by the authors in [6]. The decision support system works by sending notifications to the user which also appears in the notification center in the case of situations that pose a risk.

In order to increase the efficiency of the pose estimation process, it is sufficient to determine only the eyes, ears, nose, shoulders, and neck to be used in the study. For this reason, a new data set containing the necessary body parts from the MPII Human Pose [7] dataset is curated. This study's customized pose estimation network is obtained by finetuning the model in [5] with the dataset, using the transfer learning method. Real-time posture analysis is performed using the obtained model. As a result of the analysis, situations that may decrease working efficiency are notified to the user in real-time to increase the user's well-being. In order for the study to be easily used by everyone, a user interface has been designed so that the user can customize it in every aspect. Improvements are made by making the designed interface used by people of different ages and professions.

2 Related Works

Object detection algorithms improved significantly with the help of AI [8]. Pose estimation is the detection of body parts in images or videos. Most studies [9,10,11,12,13,14,15,16] use a top-down approach for detection and estimation. As the first step in this approach, the person's part is detected, and then the body parts are estimated. Finally, pose estimation is made for each person. Although this approach applies to single-person detection, it cannot capture spatial dependencies between different people that require global inference. Studies [2] and [17] use a bottom-up approach in pose estimation. In this approach, all parts of each individual in the image are identified. Then, parts of different individuals are combined and grouped. Although this approach works slower than the top-down approach, it gives more accurate results because it can capture spatial dependencies between different people that require global inference [2]. Figure 1 presents an example of these approaches. For this reason, the bottom-up approach was preferred in this study.

Fig. 1.
figure 1

Top: typical top-down approach. Down: typical bottom-up approach [26, 27].

3 Methodology

A real-time decision support system application is developed for the well-being of computer users. In the OpenPose [5] study, which has proven its success for the purpose, the pose estimation model was retrained using the transfer learning technique utilizing the customized MPII Human Pose [18] dataset.

3.1 Environment

The camera used for image transfer is the bisoncam nb pro, which is one of the models that comes internally in personal computers. The reason for choosing this model is to show that the system can work without any extra financial cost. This is a standard camera with a fixed focus, field of view of 60°, and maximum resolution of 720 p/30 fps. Nvidia Tesla T4 graphics card specially designed for AI applications was used to shorten the time for the model training phase. The high-end graphical processor has Turing architecture, 16 GB GDDR6 memory capacity, 320 Turing cores, and 2560 Cuda cores. For the testing phase of the pose estimation model, a system with similar characteristics to the average computer user was chosen. This system has Nvidia GTX 970 m graphics card, Intel I7-6700HQ 2.60 GHz central processor, and 16 GB DDR4L short-term memory.

3.2 Dataset

The pose detection model utilized the “MPII Human Pose” dataset via transfer learning. This dataset contains approximately 25000 images containing more than 40000 people with annotated body joints. Images were systematically collected using an established taxonomy of human activities each day. Overall, the dataset covers 410 human activities, and each image is provided with an activity tag [18]. In this study, only human exposure estimation was made without distinguishing activity.

3.3 Model Development

As with many bottom-up approaches, the proposed model first detects each person's parts (key points) in the image, then assigns the parts to different people. As in Fig. 2, the network is first used to extract features from the image using one of the models in studies [19,20,21,22] with its first few layers. A set of feature maps named F is created. Then F is fed into a network of two parallel convolutional layers. Stage-1 predicts 18 confidence maps, each representing a specific part of the human pose skeleton. At this stage, the network generates several Part Affinity Fields (PAFs) (1) L1 = ɸ1(F), where ɸ1 refers to the CNNs for inference in Stage-1. The predictions from the previous stage and the original image features F are combined and used to produce refined predictions at each subsequent stage.

Stage-2 predicts a value of 38 PAFs, which represents the degree of relationship between the parts. The successive stages are used to correct the forecasts made by each branch. Two-sided graphs were created between pairs using part confidence maps. By using PAFs values, weak links in bilateral charts are trimmed, and pose skeletons are predicted.

Fig. 2.
figure 2

Flowchart of OpenPose architecture [5].

$$ L^{t} =\upphi ^{\uptau } \left( {F,L^{t - 1} } \right),\;\;\forall 2 \le\uptau \le {\text{T}}_{{\text{P}}} $$
(1)

In Eq. (1), \(\upphi ^{\uptau }\) represent the CNNs for inference at stage t, and Τp is the total number of PAF stages. After Τp iterations, the process is repeated to detect confidence maps, starting with the most recent PAF estimate.

$$ {\text{S}}^{{{\text{TP}}}} =\uprho ^{\uptau } \left( {{\text{F}},{\text{L}}^{{{\text{TP}}}} } \right),\;\;\forall\uptau = {\text{T}}_{{\text{P}}} $$
(2)
$$ {\text{S}}^{\uptau } =\uprho ^{\uptau } \left( {{\text{F}},{\text{L}}^{{{\text{TP}}}} ,{\text{S}}^{{\uptau - 1}} } \right),\;\;\forall {\text{T}}_{{\text{P}}} <\uptau \le = {\text{T}}_{{\text{P}}} + {\text{T}}_{{\text{C}}} $$
(3)

In Eq. (3), \(\uprho ^{\uptau }\) represents the number of CNNs for inference at stage t, and Τc is the number of total confidence map stages.

Confidence map results are estimated above the most recent processed PAF estimates, making a barely noticeable difference in confidence map stages. A loss function is applied at the end of each stage to direct the network to iteratively estimate the PAF values of the body parts in the first stage and the confidence maps in the second stage. L2 loss is used between the expected forecasts and the baseline information maps and fields.

$$ f_{L}^{{t_{i} }} = \sum\nolimits_{c = 1}^{C} {\sum\nolimits_{p} {W\left( p \right)} } .\left\| {L_{c}^{{t_{i} }} \left( p \right) - L_{c} \left( p \right)} \right\|_{2^{\prime}}^{2} $$
(4)
$$ f_{S}^{{t_{k} }} = \sum\nolimits_{j = 1}^{J} {\sum\nolimits_{p} {W\left( p \right)} } .\left\| {S_{c}^{{t_{k} }} \left( p \right) - S_{j} \left( p \right)} \right\|_{2^{\prime}}^{2} $$
(5)

Equation (5) has spatially weighted the loss functions to address a practical problem where some datasets do not thoroughly label all people. Where Eq. (4) Lc is the PAF actual reference value (ground truth), Sj is the true confidence map, and is a binary mask with W(p) = 0 when p is missing annotation on the pixel. The mask is used during training to avoid penalizing true positive guesses. Interim inspection at each stage fills the gradient periodically, eliminating the vanishing gradient problem.

To increase the efficiency of the study, it was foreseen that it would be sufficient to determine only the eyes, ears, nose, shoulders, and neck ending to be used in the study. In order to detect only these body parts, a new data set consisting of the necessary body parts in the data set was created. With this dataset, the model was retrained using the transfer learning method to benefit the pre-trained weights of the MobilNetV2 [22].

For conditions that may pose a risk of disease, the angle values between the limbs were taken as a reference [6]. The reference values can be changed by ±30% with the slide bar in the program's interface and by ±250% in the advanced settings menu. When the values calculated due to exposure estimation went out of the reference value limits, the system took 50 samples at equal intervals for 10 s, and the average was calculated. The calculated value and the reference values are compared. When there is a situation that may pose a disease risk, the user is notified in real-time with notifications as in Fig. 3.

Fig. 3.
figure 3

Notification example of work in Windows 10 environment

3.3.1 Creation of the User Interface

The target audience of this study is computer users, and the user interface has been designed so that everyone can use it easily. In addition, the work has been made saved as a file with a.exe extension. The designed interface is designed in such a way that the user can customize it in every aspect, as seen in Fig. 4. The interface's main features are the selection of the posture tests, customization of wrong posture reference values, and customization of notification frequency. The settings also have five different profiles to save the made customizations.

Fig. 4.
figure 4

Improved user interface

4 Results and Discussions

Using the images reserved from the dataset for testing, the high performed models, i.e., PersonLab [23], METU [24], Associative Emb. [25], and OpenPose [5] models were compared with the proposed method in Table 1 with respect to accuracy and precision. Since the most critical parameter for the applicability of the study is speed, the proposed method performed better than other methods.

Table 1. Performance comparison of the proposed method and the methods in the literature.

5 Conclusion

Health is one of the significant elements that reduce efficiency in organizations. In addition, one of the most critical expense items in most countries is healthcare. This study aimed to contribute to the country's economy by identifying situations that reduce working efficiency and quality of life. In order to achieve this aim, an optimized pose estimation model is proposed using the transfer learning technique. With the developed application, posture disorder analysis will be performed, and health problems that may occur in the waist, neck, and joint regions will be prevented. Visual disturbances will be prevented by analyzing the distance to the monitor, working environment lighting, and usage time. With the help of the proposed model, the work environment will be analyzed dynamically to support healthy working conditions.