Learning Contextually Fused Audio-Visual Representations For Audio-Visual Speech Recognition | IEEE Conference Publication | IEEE Xplore