Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data | IEEE Conference Publication | IEEE Xplore