Author Contributions
Conceptualization, L.F. and Y.Z.; methodology, L.F.; software, Y.S.; validation, Y.Z., L.F., and W.Z.; formal analysis, J.T.; investigation, L.F.; resources, Y.S.; data curation, Y.S.; writing—original draft preparation; L.F.; writing—review and editing, W.Z.; visualization, L.F.; supervision, Y.Z.; project administration, L.F.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Examples of challenging images: (a) an image occluded by other objects, (b) an image camouflaged by the surroundings, and (c) an image taken at night.
Figure 1.
Examples of challenging images: (a) an image occluded by other objects, (b) an image camouflaged by the surroundings, and (c) an image taken at night.
Figure 2.
Workflow of the wild feline action recognition.
Figure 2.
Workflow of the wild feline action recognition.
Figure 3.
Example results of the Outline Mask region-based convolutional neural network (R-CNN) (Top: original images of felines in different poses; middle: masks extracted by Mask R-CNN; and bottom: outline extracted by Outline Mask R-CNN).
Figure 3.
Example results of the Outline Mask region-based convolutional neural network (R-CNN) (Top: original images of felines in different poses; middle: masks extracted by Mask R-CNN; and bottom: outline extracted by Outline Mask R-CNN).
Figure 4.
Illustration of Tiny Visual Geometry Group (VGG).
Figure 4.
Illustration of Tiny Visual Geometry Group (VGG).
Figure 5.
Joint point-labeled skeleton: (a) the eighteen tracked points; (b) the bending degrees of the knee joints (zoom in).
Figure 5.
Joint point-labeled skeleton: (a) the eighteen tracked points; (b) the bending degrees of the knee joints (zoom in).
Figure 6.
Images in which felines’ spines are perpendicular to the image plane in space: from this perspective, the angle of leg joints is almost straight regardless of species and the states of felines.
Figure 6.
Images in which felines’ spines are perpendicular to the image plane in space: from this perspective, the angle of leg joints is almost straight regardless of species and the states of felines.
Figure 7.
Video representative frames and variation in the bending angle: from the first row to the fourth row are video representative frames of three upright actions and the fifth row is the variation curves of the bending angles (blue and red represent the variation curves of the front legs and hind legs, respectively) for (a) standing, (b) galloping, and (c) ambling.
Figure 7.
Video representative frames and variation in the bending angle: from the first row to the fourth row are video representative frames of three upright actions and the fifth row is the variation curves of the bending angles (blue and red represent the variation curves of the front legs and hind legs, respectively) for (a) standing, (b) galloping, and (c) ambling.
Figure 8.
Example of the dataset.
Figure 8.
Example of the dataset.
Figure 9.
Instance segmentation results for four different actions (first row: ResNet98, second row: ResNet50, and last row: ResNet101; from left to right: ambling, two different types of standing, and galloping).
Figure 9.
Instance segmentation results for four different actions (first row: ResNet98, second row: ResNet50, and last row: ResNet101; from left to right: ambling, two different types of standing, and galloping).
Figure 10.
The convergence curves of the six loss functions with different numbers of layers. (ResNet50, ResNet98, and ResNet101 represent the number of layers being 50, 98, and 101, respectively. The blue, green, and red lines correspond to convergence of the loss function when the number of ResNet network layers is 50, 98, and 101, respectively.)
Figure 10.
The convergence curves of the six loss functions with different numbers of layers. (ResNet50, ResNet98, and ResNet101 represent the number of layers being 50, 98, and 101, respectively. The blue, green, and red lines correspond to convergence of the loss function when the number of ResNet network layers is 50, 98, and 101, respectively.)
Figure 11.
Convergence curve of the loss function for the three tiny convolutional networks: the blue, red, and green lines represent the convergence of the loss functions of the recognition networks Tiny VGG, Tiny Inception V3, and Tiny MobileNet V2, respectively.
Figure 11.
Convergence curve of the loss function for the three tiny convolutional networks: the blue, red, and green lines represent the convergence of the loss functions of the recognition networks Tiny VGG, Tiny Inception V3, and Tiny MobileNet V2, respectively.
Figure 12.
Accuracy and training time for the three models tracking joint points: (a) violin plots denote the overall error distribution, and the error bars denote the 25th and 75th percentiles; (b) the red, blue, and green boxes plot the training times of LEAP (Leap Estimate Animal Pose), DeepLabCut, and DeepPoseKit, respectively.
Figure 12.
Accuracy and training time for the three models tracking joint points: (a) violin plots denote the overall error distribution, and the error bars denote the 25th and 75th percentiles; (b) the red, blue, and green boxes plot the training times of LEAP (Leap Estimate Animal Pose), DeepLabCut, and DeepPoseKit, respectively.
Figure 13.
Confusion matrix for the three proposed models: all correct predictions are located on the diagonal of the matrix; it is easy to visually inspect the number of feline actions with incorrect predictions, as they are represented by values outside the diagonal.
Figure 13.
Confusion matrix for the three proposed models: all correct predictions are located on the diagonal of the matrix; it is easy to visually inspect the number of feline actions with incorrect predictions, as they are represented by values outside the diagonal.
Table 1.
The structures of the ResNets: ResNet50, ResNet98, and ResNet101 represent that the number of layers is 50, 98, and 101, respectively.
Table 1.
The structures of the ResNets: ResNet50, ResNet98, and ResNet101 represent that the number of layers is 50, 98, and 101, respectively.
Layer | Output Size | ResNet50 | ResNet101 | ResNet98 |
---|
Conv_1 | | |
Conv_2 | | | | |
Conv_3 | | | | |
Conv_4 | | | | |
Conv_5 | | | | |
Average Pooling | | 1000 dimensions |
Table 2.
The structures of the ResNets: the convolutional layer parameters are denoted as “Conv (receptive field size [
49]) (number of channels)”; the ReLU [
50] activation function is not shown for brevity.
Table 2.
The structures of the ResNets: the convolutional layer parameters are denoted as “Conv (receptive field size [
49]) (number of channels)”; the ReLU [
50] activation function is not shown for brevity.
Layer | Patch Size | Stride |
---|
Conv1_64 | 3 × 3 | 1 |
Max Pooling | — | 2 |
Conv3_128 | 3 × 3 | 1 |
Max Pooling | 2 × 2 | 2 |
Conv3_256 | 3 × 3 | 1 |
Max Pooling | 2 × 2 | 2 |
Table 3.
Structure of Tiny Inception V3: the building blocks are shown, with the size of the filter bank and the numbers of stride stacked.
Table 3.
Structure of Tiny Inception V3: the building blocks are shown, with the size of the filter bank and the numbers of stride stacked.
Layer | Filter Shape/Stride |
---|
Conv2d_bn | 1 × 1 × 32/1 |
Conv2d_bn | 3 × 3 × 32/1 |
Conv2d_bn | 3 × 3 × 64/2 |
Max Pooling | Pool 3 × 3/2 |
Conv2d_bn_1_1 | 1 × 1 × 64/1 |
Conv2d_bn_1_5 | 1 × 1 × 48/1 |
Conv2d_bn_1_5 | 5 × 5 × 64/1 |
Conv2d_bn_1_3 | 1 × 1 × 64/1 |
Conv2d_bn_1_3 | 3 × 3 × 96/1 |
Average Pooling | Pool 1 × 1/1 |
Conv2d_bn_Pool | 1 × 1 × 32/1 |
Conv2d_bn_2_1 | 1 × 1 × 64/1 |
Conv2d_bn_2_5 | 1 × 1 × 48/1 |
Conv2d_bn_2_5 | 5 × 5 × 64/1 |
Conv2d_bn_2_3 | 1 × 1 × 64/1 |
Conv2d_bn_2_3 | 3 × 3 × 96/1 |
Conv2d_bn_2_3 | 3 × 3 × 96/1 |
Average Pooling | - |
Conv2d_bn_Pool | 1 × 1 × 64 |
Max Pooling | - |
SoftMax | classifier |
Table 4.
Structure of Tiny MobileNet V2: each line describes a sequence that repeats the same layer n times. All layers in the same sequence have the same number c of output channels. The module repeats stride s for the first time, and all others use stride 1. The expansion factor t is always applied to the input size.
Table 4.
Structure of Tiny MobileNet V2: each line describes a sequence that repeats the same layer n times. All layers in the same sequence have the same number c of output channels. The module repeats stride s for the first time, and all others use stride 1. The expansion factor t is always applied to the input size.
Operator | t | c | n | s |
---|
Conv2d 3 × 3 | - | 32 | 1 | 2 |
Bottleneck | 1 | 16 | 1 | 2 |
Bottleneck | 6 | 24 | 2 | 2 |
Conv2d 1 × 1 | - | 64 | 1 | 2 |
MaxPool 7 × 1 | - | - | 1 | - |
Conv2d 1 × 1 | - | 3 | - | - |
Table 5.
Results of the three outline-based convolutional networks: different actions of wild felines produce different results for different networks.
Table 5.
Results of the three outline-based convolutional networks: different actions of wild felines produce different results for different networks.
| Tiny MobileNet V2 | Tiny Inception V3 | Tiny VGG |
---|
Galloping | 79% | 77% | 96% |
Standing | 88% | 89% | 95% |
Ambling | 82% | 74% | 84% |
Average accuracy | 83% | 80% | 92% |
Table 6.
Accuracy (%) of the ablation study: the outline-only method applies Outline Mask RCNN for target feline outline and Tiny VGG for action recognition; the skeleton-only method use LEAP to obtain target feline skeleton and Long Short-Term Memory (LSTM) for action recognition; and the two-stream method incorporates the above two methods for action recognition.
Table 6.
Accuracy (%) of the ablation study: the outline-only method applies Outline Mask RCNN for target feline outline and Tiny VGG for action recognition; the skeleton-only method use LEAP to obtain target feline skeleton and Long Short-Term Memory (LSTM) for action recognition; and the two-stream method incorporates the above two methods for action recognition.
| Galloping | Standing | Ambling | Average Accuracy |
---|
Outline-only method | 96% | 95% | 84% | 92% |
Skeleton-only method | 83% | 93% | 91% | 89% |
Two-stream method | 97% | 97% | 93% | 95% |