Quasi-Online Detection of Take and Release Actions from Egocentric Videos

Scavo, Rosario; Ragusa, Francesco; Farinella, Giovanni Maria; Furnari, Antonino

doi:10.1007/978-3-031-43153-1_2

Rosario Scavo^10,11,
Francesco Ragusa^10,11,
Giovanni Maria Farinella^10,11 &
…
Antonino Furnari^10,11

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14234))

Included in the following conference series:

International Conference on Image Analysis and Processing

572 Accesses

Abstract

In this paper, we considered the problem of detecting object take and release actions from untrimmed egocentric videos in an industrial domain. Rather than requiring that actions are recognized as they are observed, in an online fashion, we propose a quasi-online formulation in which take and release actions can be recognized shortly after they are observed, but keeping a low latency. We contribute a problem formulation, an evaluation protocol, and a baseline approach that relies on state-of-the-art components. Experiments on ENIGMA, a newly collected dataset of egocentric untrimmed videos of human-object interactions in an industrial scenario, and on THUMOS’14 show that the proposed approach achieves promising performance on quasi-online take/release action recognition and outperforms methods for online detection of action start on THUMOS’14 by \(+8.64\%\) when an average latency of 2.19s is allowed. Code and supplementary material are available at https://github.com/fpv-iplab/Quasi-Online-Detection-Take-Release.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html.
2.
For detailed statistics regarding the dataset, please refer to the supplementary material.
3.
See the supplementary material for a study on the influence of the different parameters on THUMOS.

References

Besari, A.R.A., Saputra, A.A., Chin, W.H., Kubota, N., et al.: Feature-based egocentric grasp pose classification for expanding human-object interactions. In: 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), pp. 1–6. IEEE (2021)
Google Scholar
Damen, D., et al.: Scaling egocentric vision: the dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 753–771. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_44
Chapter Google Scholar
De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., Tuytelaars, T.: Online action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 269–284. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_17
Chapter Google Scholar
Farinella, G.M., et al.: VEDI: vision exploitation for data interpretation. In: Ricci, E., Rota Bulò, S., Snoek, C., Lanz, O., Messelodi, S., Sebe, N. (eds.) ICIAP 2019. LNCS, vol. 11752, pp. 753–763. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30645-8_68
Chapter Google Scholar
Gao, J., Yang, Z., Nevatia, R.: RED: reinforced encoder-decoder networks for action anticipation. arXiv preprint arXiv:1707.04818 (2017)
Gao, M., Xu, M., Davis, L.S., Socher, R., Xiong, C.: StartNet: online detection of action start in untrimmed videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5542–5551 (2019)
Google Scholar
Gao, M., Zhou, Y., Xu, R., Socher, R., Xiong, C.: WOAD: weakly supervised online action detection in untrimmed videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1915–1923 (2021)
Google Scholar
Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8359–8367 (2018)
Google Scholar
Idrees, H., et al.: The THUMOS challenge on action recognition for videos “in the wild’’. Comput. Vis. Image Underst. 155, 1–23 (2017)
Article Google Scholar
Karita, S., et al.: A comparative study on transformer vs RNN in speech applications. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449–456. IEEE (2019)
Google Scholar
Moltisanti, D., Wray, M., Mayol-Cuevas, W., Damen, D.: Trespassing the boundaries: labeling temporal bounds for object interactions in egocentric video. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2886–2894 (2017)
Google Scholar
Qi, S., Wang, W., Jia, B., Shen, J., Zhu, S.-C.: Learning human-object interactions by graph parsing neural networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 407–423. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_25
Chapter Google Scholar
Ragusa, F., Furnari, A., Farinella, G.M.: MECCANO: a multimodal egocentric dataset for humans behavior understanding in the industrial-like domain. arXiv preprint arXiv:2209.08691 (2022)
Shou, Z., et al.: Online detection of action start in untrimmed, streaming videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 551–568. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_33
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, T., Yang, T., Danelljan, M., Khan, F.S., Zhang, X., Sun, J.: Learning human-object interaction detection using interaction points. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4116–4125 (2020)
Google Scholar
Wang, X., et al.: OadTR: online action detection with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7565–7575 (2021)
Google Scholar
Xu, M., et al.: Long short-term transformer for online action detection. Adv. Neural. Inf. Process. Syst. 34, 1086–1099 (2021)
Google Scholar
Zhao, Y., Krähenbühl, P.: Real-time online video detection with temporal smoothing transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XXXIV, vol. 13694, pp. 485–502. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_28

Download references

Acknowledgements

This research has been supported by Next Vision s.r.l., by the project MISE - PON I &C 2014-2020 - Progetto ENIGMA - Prog n. F/190050/02/X44 - CUP: B61B19000520008, and by Research Program Pia.ce.ri. 2020/2022 Linea 2 - University of Catania.

Author information

Authors and Affiliations

FPV@IPLAB, DMI - University of Catania, Catania, Italy
Rosario Scavo, Francesco Ragusa, Giovanni Maria Farinella & Antonino Furnari
Next Vision s.r.l. - Spinoff of the University of Catania, Catania, Italy
Rosario Scavo, Francesco Ragusa, Giovanni Maria Farinella & Antonino Furnari

Authors

Rosario Scavo
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Ragusa
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Maria Farinella
View author publications
You can also search for this author in PubMed Google Scholar
Antonino Furnari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rosario Scavo .

Editor information

Editors and Affiliations

University of Udine, Udine, Italy
Gian Luca Foresti
University of Udine, Udine, Italy
Andrea Fusiello
University of York, York, UK
Edwin Hancock

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 48 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Scavo, R., Ragusa, F., Farinella, G.M., Furnari, A. (2023). Quasi-Online Detection of Take and Release Actions from Egocentric Videos. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14234. Springer, Cham. https://doi.org/10.1007/978-3-031-43153-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-43153-1_2
Published: 05 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43152-4
Online ISBN: 978-3-031-43153-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Quasi-Online Detection of Take and Release Actions from Egocentric Videos