ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

Tejedor, Javier; Toledano, Doroteo T.; Lopez-Otero, Paula; Docio-Fernandez, Laura; Proença, Jorge; Perdigão, Fernando; García-Granada, Fernando; Sanchis, Emilio; Pompili, Anna; Abad, Alberto

doi:10.1186/s13636-018-0125-9

Research
Open access
Published: 13 April 2018

ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

Javier Tejedor¹,
Doroteo T. Toledano²,
Paula Lopez-Otero³,
Laura Docio-Fernandez⁴,
Jorge Proença⁵,
Fernando Perdigão⁵,
Fernando García-Granada⁶,
Emilio Sanchis⁶,
Anna Pompili⁷ &
…
Alberto Abad⁷

EURASIP Journal on Audio, Speech, and Music Processing volume 2018, Article number: 2 (2018) Cite this article

3638 Accesses
2 Citations
Metrics details

Abstract

Query-by-example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given an acoustic (spoken) query containing the term of interest as the input. This paper presents the systems submitted to the ALBAYZIN QbE STD 2016 Evaluation held as a part of the ALBAYZIN 2016 Evaluation Campaign at the IberSPEECH 2016 conference. Special attention was given to the evaluation design so that a thorough post-analysis of the main results could be carried out. Two different Spanish speech databases, which cover different acoustic and language domains, were used in the evaluation: the MAVIR database, which consists of a set of talks from workshops, and the EPIC database, which consists of a set of European Parliament sessions in Spanish. We present the evaluation design, both databases, the evaluation metric, the systems submitted to the evaluation, the results, and a thorough analysis and discussion. Four different research groups participated in the evaluation, and a total of eight template matching-based systems were submitted. We compare the systems submitted to the evaluation and make an in-depth analysis based on some properties of the spoken queries, such as query length, single-word/multi-word queries, and in-language/out-of-language queries.

1 Introduction

The huge amount of heterogeneous speech data stored in audio and audiovisual repositories makes it necessary to develop efficient methods for speech information retrieval. There are different speech information retrieval tasks, including spoken document retrieval (SDR), keyword spotting (KWS), spoken term detection (STD), and query-by-example spoken term detection (QbE STD).

Spoken term detection aims at finding individual words or sequences of words within audio archives. It is based on a text-based input, commonly the word/phone transcription of the search term. For this reason, STD is also called text-based STD. Query-by-example spoken term detection is similar, but is based on an acoustic (spoken) input. In QbE STD, we consider the scenario in which the user has found a segment of speech which contains terms of interest within a speech data repository, and their purpose is to find similar speech segments within that repository. The speech segment found is the query, and the system outputs other similar segments from the repository, which we will henceforth refer to as utterances. Alternatively, the query can be uttered by the user. This is a highly valuable task for blind people or devices that do not have a text-based input, and consequently, the query must be given in other format such as speech.

The STD systems are typically composed of three different stages: (1) the audio is decoded into word/subword lattices using an automatic speech recognition (ASR) subsystem trained for the target language (which makes the STD system language-dependent), (2) a term detection subsystem searches the terms within those word/subword lattices to hypothesize detections, and (3) confidence measures are computed to rank detections. The STD systems are normally language-dependent and require large amounts of resources in the form of transcribed corpora to be built.

QbE STD has been mainly addressed from three different approaches: methods based on the word/subword transcription of the query, methods based on template matching of features, and hybrid approaches. These approaches are described below.

1.1 Methods based on the word/subword transcription of the query

These methods make use of the text-based STD technology. In order to do this, they need to transcribe the query into word/subword units. The errors produced in this transcription can lead to significant performance degradation. [1, 2] employ a Viterbi-based search on Hidden Markov Models (HMMs). [3–6] employ dynamic time warping (DTW) or variants of DTW, e.g., non-segmental dynamic time warping (NS-DTW) from phone recognition. [7–10] employ word and syllable speech recognizers. Hou et al. [11] employs a phone-based speech recognizer and a weight finite-state transducer (WFST)-based search. Vavrek et al. [12] uses multilingual phone-based speech recognition, from supervised and unsupervised acoustic models and sequential dynamic time warping for search.

1.2 Methods based on template matching of features

These methods extract a set of features from the query and the speech repository, and a search of these features produces the query detections. Regarding the features used for query/utterance representation, [5, 13–15] employ Gaussian posteriorgrams; [16] proposes an i-vector-based approach for feature extraction; [17] uses phone log-likelihood ratio-based features; [18] employs posteriorgrams derived from various unsupervised tokenizers, supervised tokenizers, and semi-supervised tokenizers; [19] employs posteriorgrams derived from a Gaussian mixture model (GMM) tokenizer, phoneme recognition, and acoustic segment modelling; [11, 15, 20–26] use phoneme posteriorgrams; [11, 27–29] employ bottleneck features; [30] employs posteriorgrams from non-parametric Bayesian models; [31] employs articulatory class-based posteriorgrams; [32] proposes an intrinsic spectral analysis; and [33] is based on the unsupervised segment-based bag of an acoustic words framework.

All these studies employ the standard DTW algorithm for query search, except for [13], which employs the NS-DTW algorithm, [15, 24, 25, 28, 30], which employ the subsequence DTW (S-DTW) algorithm, [14], which presents a variant of the S-DTW algorithm, and [26], which employs the segmental DTW algorithm.

These methods were found to outperform subword transcription-based techniques in QbE STD [34]. This approach can be employed effectively to build language-independent STD systems, since prior knowledge of the language involved in the speech data is not necessary.

1.3 Hybrid approach

A powerful way of enhancing performance relies on building hybrid (fused) systems that combine the two individual methods. [35–37] propose a logistic regression-based fusion of acoustic keyword spotting and DTW-based systems using language-dependent phoneme recognizers. [38–41] use a logistic regression-based fusion on DTW- and phone-based systems. Oishi et al. [42] uses a DTW-based search at the HMM state-level from syllables obtained from a word-based speech recognizer and a deep neural network (DNN) posteriorgram-based rescoring, and [43] adds a logistic regression-based approach for detection rescoring. Obara et al. [44] employs a syllable-based speech recognizer and dynamic programming at the triphone-state level to output detections and DNN posteriorgram-based rescoring.

1.4 Motivation and organization of this paper

The increasing interest from within the speech research community in speech information retrieval has allowed the successful organization of several international evaluations related to SDR [45, 46], STD [47, 48], and QbE STD [49, 50]. In 2012 and 2014, the first two QbE STD evaluations in Spanish were held in the context of the ALBAYZIN 2012 and 2014 evaluation campaigns. These campaigns are internationally open sets of evaluations supported by the Spanish Network of Speech Technologies (RTTH)^{Footnote 1} and the ISCA Special Interest Group on Iberian Languages (SIG-IL)^{Footnote 2}, which have been held every 2 years since 2006. These evaluation campaigns provide an objective mechanism for the comparison of different systems and the promotion of research into different speech technologies such as audio segmentation [51], speaker diarization [52], language recognition [53], spoken term detection [54], query-by-example spoken term detection [55, 56], and speech synthesis [57].

The Spanish language is widespread throughout the world, and significant research has been conducted into it for ASR [58–60], KWS [61, 62], and STD [62–64]. This, combined with the success of the ALBAYZIN QbE STD evaluations held in 2012 and 2014, have encouraged us to organize a new QbE STD evaluation for the 2016 ALBAYZIN evaluation campaign which aims to evaluate the progress in this technology in Spanish. Compared with the previous evaluations, the third ALBAYZIN QbE STD evaluation incorporated stricter rules regarding the evaluation queries, e.g., in-vocabulary (INV) vs. out-of-vocabulary (OOV) queries, and employs two different databases to cover different acoustic conditions and topics to provide a more comprehensive evaluation. In addition, all the queries and the database employed in the QbE STD evaluation held in 2014 are kept, thus enabling a comparison between the systems submitted to both evaluations on the common set of queries.

The remainder of the paper is organized as follows: The following section presents a description of the QbE STD evaluation. Section 3 presents the different systems submitted to the evaluation. The results and discussion are then presented, and the paper is concluded in the final section.

2 ALBAYZIN QbE STD 2016 evaluation

2.1 Evaluation description

The ALBAYZIN QbE STD 2016 evaluation involves searching for audio content within audio content using an audio content query. The input to the system is an acoustic example per query; therefore, prior knowledge of the correct word/subword transcription corresponding to each query is not available. The target participants are the research groups or companies working on speech indexing, speech retrieval, and speech recognition.

The evaluation consists of searching a development query list within development speech data, and searching two different test query lists within two different sets of test speech data (MAVIR and EPIC databases, which will be explained later). The evaluation result ranking is based on the system performance when searching the query terms within the test speech data corresponding to the MAVIR database. Any kind of data, except for the MAVIR test data and the EPIC data, can be used by the participants for system training and development. The systems could be fine-tuned for each of the two databases individually. To facilitate the system construction, the participants were provided with MAVIR data, which can only be used as defined by the training, development, and test subsets.

This evaluation defines two different sets of queries for each database: the in-vocabulary query set and the out-of-vocabulary query set. The OOV query set was defined to simulate the out-of-vocabulary words of a Large Vocabulary Continuous Speech Recognition (LVCSR) system. If the participants employed an LVCSR system for processing the audio, these OOV queries must be removed from the system dictionary. Therefore, other methods must be used for searching the OOV queries. Conversely, the INV queries can appear in the dictionary of the LVCSR system.

The evaluation participants could submit a primary system and up to two contrastive systems. No manual intervention was allowed to generate the final output file, and hence, all the systems had to be fully automatic. Listening to the test data, or any other human interaction with the test data, was forbidden before all the evaluation results had been sent to the participants. The standard XML-based format corresponding to the National Institute of Standards and Technology (NIST) STD evaluation tool [65] was used to build the system output file.

The participants were given approximately 3 months to construct the system. The training and development data were released by the end of June 2016. The test data were released at the beginning of September 2016. The final system submission was due by mid-October 2016. The evaluation results were discussed at the IberSPEECH 2016 conference at the end of November 2016.

2.2 Evaluation metric

In QbE STD, a hypothesized occurrence is called a detection; if the detection corresponds to an actual occurrence, it is called a hit; otherwise it is a false alarm. If an actual occurrence is not detected, this is called a miss. The actual term-weighted value (ATWV) proposed by NIST [65] was used as the main metric for the evaluation. This metric integrates the hit rate and the false alarm rate of each query into a single metric and is then averaged over all the queries:

$$ ATWV=\frac{1}{|\Delta|}\sum_{K \in \Delta}{\left(\frac{N^{K}_{\text{hit}}}{N^{K}_{\text{true}}} - \beta \frac{N^{K}_{\text{FA}}}{T-N^{K}_{\text{true}}}\right)}, $$

(1)

where Δ denotes the set of queries and |Δ| is the number of queries in this set. $N^{K}_{\text {hit}}$ and $N^{K}_{\text {FA}}$ represent the numbers of hits and false alarms of query K, respectively, and $N^{K}_{\text {true}}$ is the number of actual occurrences of K in the audio. T denotes the audio length in seconds, and β is a weight factor set at 999.9, as in the ATWV proposed by NIST [66]. This weight factor causes an emphasis to be placed on recall compared to the precision in the ratio 10:1.

The ATWV represents the term-weighted value (TWV) for the threshold set by the system (usually tuned on development data). An additional metric, called maximum term-weighted value (MTWV) [65], can also be used to evaluate the performance of a QbE STD system. The MTWV is the ATWV the system would obtain with the optimum threshold. The MTWV results are presented to evaluate threshold selection.

In addition to the ATWV and the MTWV, NIST also proposed a detection error trade-off (DET) curve [67] to evaluate the system performance at various miss/FA ratios. Although the DET curves were not used for the evaluation, they are also presented in this paper for a comparison of the systems.

The NIST STD evaluation tool [68] was employed to compute the MTWV, the ATWV, and the DET curves.

2.3 Database

Two different databases that comprise different acoustic conditions and domains were employed for the evaluation. For comparison, the same MAVIR database employed in the ALBAYZIN QbE STD evaluation held in 2014 was used. The second database was the EPIC database distributed by ELRA^{Footnote 3}. For the MAVIR database, three separate datasets, i.e., training, development, and test, were given to the participants. For the EPIC database, only the test data were provided. The MAVIR and EPIC data could only be used for the intended purpose of the corresponding subset (training, development, and test). The use of two different domains was permitted to compare the system performance across the two different domains and enabled the examination of the performance degradation of the systems depending on the nature of the speech data, the acoustic conditions, the training/development and testing mismatch, and the over-fitting issues.

The MAVIR database consists of a set of Spanish talks taken from the MAVIR workshops^{Footnote 4} held in 2006, 2007, and 2008 that contain speakers from Spain and Latin America.

The MAVIR Spanish data consist of spontaneous speech files, each containing different speakers, amounting to approximately 7 h of speech. These data were further divided for the purpose of this evaluation into training, development, and test sets. The data were also manually annotated in an orthographic form, but the timestamps were only set for the phrase boundaries. To prepare the data for the evaluation, the organizers manually added the timestamps for the approximately 1600 occurrences of the spoken terms used in the development and test evaluation sets. The training data were made available to the participants and included the orthographic transcription and the timestamps for the phrase boundaries^{Footnote 5}.

The MAVIR speech data were originally recorded in several audio formats, e.g., pulse code modulation (PCM) mono and stereo, MP3, 22.05 KHz, and 48 KHz. The data were converted to PCM, 16 KHz, single channel, 16 bits per sample using the SoX tool^{Footnote 6}. Except for one, all the recordings were made with the same equipment, a Digital TASCAM DAT model DA-P1. Different microphones were used for the different recordings. In most cases, they were tabletop or floor standing microphones, but in one case, a lavalier microphone was used. The distance from the mouth of the speaker to the microphone varied and was not particularly controlled but in most cases was less than 50 cm. The recordings contain spontaneous speech from the MAVIR workshops in a real setting. The recordings were made in large conference rooms with capacity of over a hundred people, and a large number of people were in the conference room. This poses additional challenges including background noise, in particular ‘babble noise’ and reverberation. The realistic settings and the different nature of the spontaneous speech in this database made it appealing and challenging enough for the evaluation. Table 1 includes some database features such as the division in training, development, test data of the speech files, the number of word occurrences, the file duration, and the p.563 Mean Opinion Score (MOS) [69] which gives and indication of the quality of each speech file. The p.563 standard estimates the quality of the human voice without a reference signal, for which no reference signal is necessary. The MOS values are in the range of 1–5, 1 representing the worst quality and 5 the best [69].

Table 1 Summary of MAVIR database

ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

Abstract

1 Introduction

1.1 Methods based on the word/subword transcription of the query

1.2 Methods based on template matching of features

1.3 Hybrid approach

1.4 Motivation and organization of this paper

2 ALBAYZIN QbE STD 2016 evaluation

2.1 Evaluation description

2.2 Evaluation metric

2.3 Database

2.3.1 Query list selection

2.4 Comparison with other QbE STD evaluations

3 Systems

3.1 A-GTM-UVigo-Three feature+DTW-based fusion QbE STD system (A-GTM-UVigo-3-fea+DTW fusion)

3.1.1 Feature extraction

3.1.2 Search

3.1.3 Fusion

3.2 B-L2F-Four phone log-likelihood ratio feature+DTW-based fusion QbE STD system (B-L2F-4-pllr fea+DTW fusion)

3.2.1 Speech segmentation

3.2.2 Feature extraction

3.2.3 Search

3.2.4 Fusion

3.3 C-L2F-Four likelihood feature+DTW-based fusion QbE STD system (C-L2F-4-likel fea+DTW fusion)

3.4 D-ELiRF-UPV-Posteriorgram+DTW-based QbE STD system (D-ELiRF-UPV-Post+DTW)

3.5 E-ELiRF-UPV-Posteriorgram+DTW-based normalized QbE STD system (E-ELiRF-UPV-Post+DTWNorm)

3.6 F-SPL-IT-UC-Four phoneme recognizer+DTW-based fusion QbE STD system (F-SPL-IT-UC-4-phnrec+DTW fusion)

3.6.1 Feature extraction

3.6.2 Search

3.6.3 Fusion

3.7 G-SPL-IT-UC-Three phoneme recognizer+DTW-based fusion QbE STD system (G-SPL-IT-UC-3-phnrec+DTW fusion)

3.8 H-SPL-IT-UC-Two language-independent phoneme recognizer+DTW-based fusion QbE STD system (H-SPL-IT-UC-2-LIphnrec+DTW fusion)

3.9 I-Text-based STD system

3.9.1 Word-based STD system

3.9.2 Phone-based STD system

3.10 System comparison

4 Results and discussion

4.1 Development data results

4.2 Test data results

4.2.1 MAVIR test data

4.2.2 EPIC test data

4.3 Development and test data DET curves

4.4 System performance analysis based on query length

4.5 System performance analysis based on single-word/multi-word queries

4.6 System performance analysis based on in-language/out-of-language queries

4.7 Comparison with the ALBAYZIN QbE STD 2014 evaluation

4.8 Towards a language-independent STD system

5 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords