An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | IEEE Conference Publication | IEEE Xplore