Abstract
Video conferencing, which includes both video and audio content, has contributed to dramatic increases in Internet traffic, as the COVID-19 pandemic forced millions of people to work and learn from home. Global Internet traffic of video conferencing has dramatically increased Because of this, efficient and accurate video quality tools are needed to monitor and perceptually optimize telepresence traffic streamed via Zoom, Webex, Meet, etc.. However, existing models are limited in their prediction capabilities on multi-modal, live streaming telepresence content. Here we address the significant challenges of Telepresence Video Quality Assessment (TVQA) in several ways. First, we mitigated the dearth of subjectively labeled data by collecting \(\sim \)2k telepresence videos from different countries, on which we crowdsourced \(\sim \)80k subjective quality labels. Using this new resource, we created a first-of-a-kind online video quality prediction framework for live streaming, using a multi-modal learning framework with separate pathways to compute visual and audio quality predictions. Our all-in-one model is able to provide accurate quality predictions at the patch, frame, clip, and audiovisual levels. Our model achieves state-of-the-art performance on both existing quality databases and our new TVQA database, at a considerably lower computational expense, making it an attractive solution for mobile and embedded systems.
The entity that conducted all of the data collection/experimentation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
FFmpeg. https://ffmpeg.org/
Akhtar, Z., Falk, T.H.: Audio-visual multimedia quality assessment: a comprehensive survey. IEEE Access 5, 21090–21117 (2017)
Argyropoulos, S., Raake, A., Garcia, M.N., List, P.: No-reference video quality assessment for SD and HD H.264/AVC sequences based on continuous estimates of packet loss visibility. In: 2011 Third International Workshop on Quality of Multimedia Experience, pp. 31–36 (2011)
Bampis, C.G., Li, Z., Katsavounidis, I., Huang, T., Ekanadham, C., Bovik, A.C.: Towards perceptually optimized end-to-end adaptive video streaming. arXiv preprint arXiv:1808.03898 (2018)
Belmudez, B., Moeller, S., Lewcio, B., Raake, A., Mehmood, A.: Audio and video channel impact on perceived audio-visual quality in different interactive contexts. In: 2009 IEEE International Workshop on Multimedia Signal Processing, pp. 1–5. IEEE (2009)
Cao, Y., Min, X., Sun, W., Zhai, G.: Deep neural networks for full-reference and no-reference audio-visual quality assessment. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 1429–1433. IEEE (2021)
Caviedes, J.E., Oberti, F.: No-reference quality metric for degraded and enhanced video. Digit. Video Image Qual. Perceptual Coding, 305–324 (2017)
Cheng, S., Zeng, H., Chen, J., Hou, J., Zhu, J., Ma, K.: Screen content video quality assessment: subjective and objective study. IEEE Trans. Image Process. 29, 8636–8651 (2020)
Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6
Zoom Video Communications, Inc.: Using QoS DSCP marking (2021). https://support.zoom.us/hc/en-us/articles/207368756-Using-QoS-DSCP-Marking
Demirbilek, E., Grégoire, J.: INRS audiovisual quality dataset. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 167–171 (2016)
Demirbilek, E., Grégoire, J.: Towards reduced reference parametric models for estimating audiovisual quality in multimedia services. In: 2016 IEEE International Conference on Communications (ICC), pp. 1–6. IEEE (2016)
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference Computer Vision and Pattern Recognition, pp. 248–255, June 2009
Elsayed, N., Maida, A.S., Bayoumi, M.: Deep gated recurrent and convolutional network hybrid model for univariate time series classification. arXiv preprint arXiv:1812.07683 (2018)
Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619 (2018)
Gamper, H., Reddy, C.K., Cutler, R., Tashev, I.J., Gehrke, J.: Intrusive and non-intrusive perceptual speech quality assessment using a convolutional neural network. In: 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 85–89. IEEE (2019)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)
Ghadiyaram, D., Bovik, A.C.: Massive online crowdsourced study of subjective and objective picture quality. IEEE Trans. Image Process. 25(1), 372–387 (2016). Jan
Ghadiyaram, D., Pan, J., Bovik, A.C., Moorthy, A.K., Panda, P., Yang, K.C.: In-capture mobile video distortions: a study of subjective behavior and objective algorithms. IEEE Trans. Circ. Syst. Video Tech. (2017). LIVE-Qualcomm Database. http://live.ece.utexas.edu/research/incaptureDatabase/index.html
Goetze, S., Albertin, E., Rennies, J., Habets, E.A., Kammeyer, K.: Speech quality assessment for listening-room compensation. In: Audio Engineering Society Conference: 38th International Conference: Sound Quality Evaluation. Audio Engineering Society (2010)
Goudarzi, M., Sun, L., Ifeachor, E.: Audiovisual quality estimation for video calls in wireless applications. In: 2010 IEEE Global Telecommunications Conference GLOBECOM 2010, pp. 1–5. IEEE (2010)
Hahn, F.G., Hosu, V., Lin, H., Saupe, D.: No-reference video quality assessment using multi-level spatially pooled features (2019)
Hands, D.S.: A basic multimedia quality model. IEEE Trans. Multimedia 6(6), 806–816 (2004)
Hosu, V., et al.: The Konstanz natural video database (KoNviD-1K). In: 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6. IEEE (2017). http://database.mmsp-kn.de/konvid-1k-database.html
Howard, A., et al.: Searching for MobileNetV3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324 (2019)
Howard, J., Gugger, S.: Fastai: a layered API for deep learning. Information 11(2), 108 (2020)
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018)
ITU-T Recommendation P.910: Subjective video quality assessment methods for multimedia applications. International Telecommunication Union (2021)
Kang, L., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for no-reference image quality assessment. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1733–1740, June 2014
Kay, W., et al.: The Kinetics human action video dataset (2017)
Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: MUSIQ: multi-scale image quality transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5148–5157 (2021)
Keegan, L.: Video conferencing statistics (all you need to know!) (2020). https://skillscouter.com/videoconferencing-statistics
Keimel, C., Oelbaum, T., Diepold, K.: No-reference video quality evaluation for high-definition video. In: Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 1145–1148 (2009)
Keimel, C., Redl, A., Dieopold, K.: The TUM high definition video datasets, vol. pp. 97–102 (2012). https://doi.org/10.1109/QoMEX.2012.6263865
Kendall, M.G.: Rank correlation methods (1948)
Kim, W., Kim, J., Ahn, S., Kim, J., Lee, S.: Deep video quality assessor: from spatio-temporal visual sensitivity to a convolutional neural aggregation network. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 224–241. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_14
Korhonen, J.: Two-level approach for no-reference consumer video quality assessment. IEEE Trans. Image Process. 28(12), 5923–5938 (2019). https://doi.org/10.1109/TIP.2019.2923051
Köster, F., Guse, D., Wältermann, M., Möller, S.: Comparison between the discrete ACR scale and an extended continuous scale for the quality assessment of transmitted speech. Fortschritte der Akustik, DAGA 3 (2015)
Li, D., Jiang, T., Jiang, M.: Quality assessment of in-the-wild videos (2019). https://doi.org/10.1145/3343031.3351028
Li, T., Min, X., Zhao, H., Zhai, G., Xu, Y., Zhang, W.: Subjective and objective quality assessment of compressed screen content videos. IEEE Trans. Broadcast. 67(2), 438–449 (2020)
Li, Z., Bampis, C.G.: Recover subjective quality scores from noisy measurements. In: 2017 Data Compression Conference (DCC), pp. 52–61. IEEE (2017)
Li, Z., Bampis, C.G., Janowski, L., Katsavounidis, I.: A simple model for subject behavior in subjective experiments. Electron. Imaging 2020(11), 131–1 (2020)
Lin, H., Hosu, V., Saupe, D.: KonIQ-10K: towards an ecologically valid and large-scale IQA database. arXiv preprint arXiv:1803.08489, March 2018
Liu, W., Duanmu, Z., Wang, Z.: End-to-end blind quality assessment of compressed videos using deep neural networks. In: Proceedings ACM Multimedia Conference (MM), pp. 546–554 (2018)
Madhusudana, P.C., Birkbeck, N., Wang, Y., Adsumilli, B., Bovik, A.C.: Image quality assessment using contrastive learning. arXiv preprint arXiv:2110.13266 (2021)
Martinez, H.B., Farias, M.C.: Full-reference audio-visual video quality metric. J. Electron. Imaging 23(6), 061108 (2014)
Martinez, H.B., Farias, M.C.: A no-reference audio-visual video quality metric. In: 2014 22nd European Signal Processing Conference (EUSIPCO), pp. 2125–2129. IEEE (2014)
Martinez, H.A.B., Farias, M.C.Q.: Combining audio and video metrics to assess audio-visual quality. Multimedia Tools Appl. 77(18), 23993–24012 (2018). https://doi.org/10.1007/s11042-018-5656-7
Martinez, H.B., Hines, A., Farias, M.C.: Perceptual quality of audio-visual content with common video and audio degradations. Appl. Sci. 11(13), 5813 (2021)
Martinez, H.B., Hines, A., Farias, M.: UNB-AV: an audio-visual database for multimedia quality research. IEEE Access 8, 56641–56649 (2020)
Martinez, H.B., Farias, M.C., Hines, A.: NAViDad: a no-reference audio-visual quality metric based on a deep autoencoder. In: 2019 27th European Signal Processing Conference (EUSIPCO), pp. 1–5. IEEE (2019)
Min, X., Zhai, G., Zhou, J., Farias, M., Bovik, A.C.: Study of subjective and objective quality assessment of audio-visual signals. IEEE Trans. Image Process. 29, 6054–6068 (2020)
Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 21(12), 4695–4708 (2012)
Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “Completely blind’’ image quality analyzer. IEEE Signal Process. Lett. 20, 209–212 (2013)
Nuutinen, M., Virtanen, T., Vaahteranoksa, M., Vuori, T., Oittinen, P., Häkkinen, J.: CVD2014-a database for evaluating no-reference video quality assessment algorithms. IEEE Trans. Image Process. 25(7), 3073–3086 (2016). https://doi.org/10.1109/TIP.2016.2562513
Oguiza, I.: tsai - a state-of-the-art deep learning library for time series and sequential data. Github (2020). https://github.com/timeseriesAI/tsai
Pandremmenou, K., Shahid, M., Kondi, L.P., Lövström, B.: A no-reference bitstream-based perceptual model for video quality estimation of videos affected by coding artifacts and packet losses. In: Human Vision and Electronic Imaging XX, vol. 9394, p. 93941F (2015)
Perrin, A.N.M., Xu, H., Kroupi, E., Řeřábek, M., Ebrahimi, T.: Multimodal dataset for assessment of quality of experience in immersive multimedia. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1007–1010 (2015)
Pinson, M.H., et al.: The influence of subjects and environment on audiovisual subjective tests: an international study. IEEE J. Sel. Top. Sig. Process. 6(6), 640–651 (2012)
Reddy Dendi, S.V., Channappayya, S.S.: No-reference video quality assessment using natural spatiotemporal scene statistics. IEEE Trans. Image Process. 29, 5612–5624 (2020). https://doi.org/10.1109/TIP.2020.2984879
Rodgers, J.L., Nicewander, W.A.: Thirteen ways to look at the correlation coefficient. Am. Stat. 42(1), 59–66 (1988). https://doi.org/10.1080/00031305.1988.10475524
Seshadrinathan, K., Soundararajan, R., Bovik, A.C., Cormack, L.K.: Study of subjective and objective quality assessment of video. IEEE Trans. Image Process. 19(6), 1427–1441 (2010). https://doi.org/10.1109/TIP.2010.2042111
Simone, F.D., Tagliasacchi, M., Naccari, M., Tubaro, S., Ebrahimi, T.: A H.264/AVC video database for the evaluation of quality metrics. In: Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2430–2433 (2010)
Simou, N., Mastorakis, Y., Stefanakis, N.: Towards blind quality assessment of concert audio recordings using deep neural networks. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3477–3481. IEEE (2020)
Sinno, Z., Bovik, A.: Large-scale study of perceptual video quality. IEEE Trans. Image Process. 28(2), 612–627 (2019). LIVE VQC Database. http://live.ece.utexas.edu/research/LIVEVQC/index.html
Su, S., et al.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3667–3676 (2020)
Søgaard, J., Forchhammer, S., Korhonen, J.: No-reference video quality assessment using codec analysis. Trans. Circuits Syst. Video Technol. 25(10), 1637–1650 (2015)
Talebi, H., Milanfar, P.: NIMA: neural image assessment. IEEE Trans. Image Process. 27(8), 3998–4011 (2018). Aug
YouTube Geofind: Search YouTube for geographically tagged videos by location, topic, or channel. https://mattw.io/youtube-geofind/location
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Tu, Z., Wang, Y., Birkbeck, N., Adsumilli, B., Bovik, A.C.: UGC-VQA: benchmarking blind video quality assessment for user generated content (2020)
Tu, Z., Yu, X., Wang, Y., Birkbeck, N., Adsumilli, B., Bovik, A.C.: RAPIQUE: rapid and accurate video quality prediction of user generated content. arXiv preprint arXiv:2101.10955 (2021)
Urban, J.: Understanding video compression artifacts, September 2017. https://blog.biamp.com/understanding-video-compression-artifacts/
Valenzise, G., Magni, S., Tagliasacchis, M., Tubaro, S.: No-reference pixel video quality monitoring of channel-induced distortion. IEEE Trans. Circuits Syst. Video Technol. 22(4), 605–618 (2011)
Vega, M.T., Mocanu, D.C., Stavro, S., Liotta, A.: Predictive no-reference assessment of video quality. Sig. Process. Image Commun. 52, 20–32 (2017)
(VQEG): VQEG HDTV phase I database. https://www.its.bldrdoc.gov/vqeg/projects/hdtv/hdtv.aspx
Vu, P.V., Chandler, D.M.: VIS3: an algorithm for video quality assessment via analysis of spatial and spatiotemporal slices. J. Electron. Imag. 23(1), 013016 (2014). Feb
Wang, H., et al.: MCL-JCV: a JND-based H.264/AVC video quality assessment dataset. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 1509–1513 (2016). https://doi.org/10.1109/ICIP.2016.7532610
Wang, H., et al.: VideoSet: a large-scale compressed video quality dataset based on JND measurement (2017)
Wang, Y., Inguva, S., Adsumilli, B.: YouTube UGC dataset for video compression research (2019). https://doi.org/10.1109/MMSP.2019.8901772
Warzybok, A., et al.: Subjective speech quality and speech intelligibility evaluation of single-channel dereverberation algorithms. In: 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 332–336. IEEE (2014)
Winkler, S., Faller, C.: Perceived audiovisual quality of low-bitrate multimedia content. IEEE Trans. Multimedia 8(5), 973–980 (2006)
Xu, J.: Optimizing perceptual quality for online multimedia systems with fast-paced interactions. The Chinese University of Hong Kong, Hong Kong (2017)
Xu, J., Wah, B.W.: Optimizing the perceptual quality of real-time multimedia applications. IEEE Multimedia 22(4), 14–28 (2015)
Ying, Z., Mandal, M., Ghadiyaram, D., Bovik, A.C.: Patch-VQ: ‘patching up’ the video quality problem. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14019–14029 (2021)
Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik, A.C.: From patches to pictures (PaQ-2-PiQ): mapping the perceptual space of picture quality. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3572–3582 (2020). https://doi.org/10.1109/CVPR42600.2020.00363
You, J., Reiter, U., Hannuksela, M.M., Gabbouj, M., Perkis, A.: Perceptual-based quality assessment for audio-visual services: a survey. Sig. Process. Image Commun. 25(7), 482–501 (2010)
Yu, H., et al.: Yamnet (2021). https://github.com/tensorflow/models/tree/master/research/audioset/yamnet
Zhang, W., Ma, K., Yan, J., Deng, D., Wang, Z.: Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Trans. Circuits Syst. Video Technol. 30(1), 36–47 (2018)
Zhang, Y., Gao, X., He, L., Lu, W., He, R.: Blind video quality assessment with weakly supervised learning and resampling strategy. IEEE Trans. Circuits Syst. Video Technol. 29(8), 2244–2255 (2019). https://doi.org/10.1109/TCSVT.2018.2868063
Zheng, X., Zhang, C.: Towards blind audio quality assessment using a convolutional-recurrent neural network. In: 2021 13th International Conference on Quality of Multimedia Experience (QoMEX), pp. 91–96. IEEE (2021)
Acknowledgments
This work was supported by Meta Platforms, Inc. A.C. Bovik was supported in part by the National Science Foundation AI Institute for Foundations of Machine Learning (IFML) under Grant 2019844.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ying, Z., Ghadiyaram, D., Bovik, A. (2022). Telepresence Video Quality Assessment. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-19836-6_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19835-9
Online ISBN: 978-3-031-19836-6
eBook Packages: Computer ScienceComputer Science (R0)