생성 기반 캡션 정보가 결합된 비디오 특징을 이용한 텍스트-비디오 검색 성능 향상

허찬; 이동훈; 박혜영; 박상효

doi:10.5626/KTCP.2024.30.2.084

주제분류

...

저널정보

Korean Institute of Information Scientists and Engineers
정보과학회 컴퓨팅의 실제 논문지 학술저널
정보과학회 컴퓨팅의 실제 논문지 제30권 제2호
2024.2 84 - 90 (7page)
DOI : 10.5626/KTCP.2024.30.2.084

저자정보

허찬 (경북대학교)
이동훈 (경북대학교)
박혜영 (경북대학교)
박상효 (경북대학교)

이용수
내서재: 2

내서재에 추가
되었습니다. 내서재에서
삭제되었습니다.

초록·키워드

오류제보하기

텍스트-비디오 검색 문제는 주어진 텍스트 쿼리를 활용하여 관련 비디오를 검색하는 연구 분야이다. 이를 위해 텍스트와 비디오 데이터 데이터의 의미가 잘 표상된 공통-임베딩 공간을 구축하여 검색에 활용하는 공통-임베딩 기반 방법들이 널리 사용된다. 그러나 두 종류의 입력 데이터는 본질적으로 서로 다른 특성을 가지고 있기 때문에 공통-임베딩 공간에서의 분포 차이가 발생하고 이는 검색 성능의 저하로 이어질 수 있다. 이러한 문제를 극복하기 위해 본 연구는 비디오의 시각적 특징과 언어적 특징을 결합하는 새로운 비디오 특징 표현 방법을 제안한다. 구체적으로, 제안하는 모델은 비디오로부터 캡션을 생성하고 이 캡션을 비디오의 시각적 특징과 결합하여 언어적 정보와 시각적 정보가 결합된 개선된 비디오 특징 벡터를 생성하였다. 이러한 강화된 특징 벡터는 추론 과정에서 쿼리로 주어지는 텍스트와 후보 비디오간의 모달리티 간격을 완화시킴으로써, 비디오 검색 성능을 향상시킨다. 제안된 방법의 성능을 검증하기 위해 수행한 두가지 벤치마크 데이터셋에 대한 실험에서 베이스라인 모델 대비 Recall@sum 지표로 3.7%(MSR-VTT), 0.7%(VATEX)의 성능 향상을 확인하였다.

Text-to-video retrieval is an emerging research area that involves retrieving appropriate video clips in response to a text query. To achieve this purpose, researchers widely employ methods based on joint embedding, constructing a shared embedding space is constructed to represent the semantic meaning of both text and video data and utilizing them for retrieval. However, since the two heterogeneous data have inherently different characteristics, they will have different distributions in the joint embedding space, which may lead to degraded search performance. To overcome this problem, we propose a novel video feature representation method that combines visual and textual information. Specifically, the proposed model generates captions from the video and combines them with the visual features of the video to produce an enhanced video feature vector. These enhanced feature vectors improve the retrieval performance by reducing the modality gap between the query text and the candidate videos during the inference process. In experiments validating the efficiency of the proposed method, we observed a performance improvement of 3.7% (MSR-VTT) and 0.7% (VATEX) in the Recall@sum metric over the baseline model on two benchmark datasets.

#text-video retrieval #joint-embedding space #video feature representation #videocaptioning