1 Introduction

In the last six consecutive years, the global recorded music market has seen an increasing of revenues. Specifically, this market grew by 7.4% in 2020. This growth has been obtained thanks to a continued rise in paid subscription streaming revenues which offset a decline in rights revenues due for physical performance probably caused by the COVID-19 pandemic [13]. In this scenario several companies get profit by applications able to automatically extract metadata from broadcast media (in several way, i.e., radio broadcasting, internet streaming, live concerts, music played in public places), specifically, by applications able to recognize copyrighted material in real time analyzing short excerpts of audio signals. Several application scenarios permit revenues: monitoring at distributor side, transmission channel or consumer end; added-value services; integrity verification systems and so on [7].

The research whose results are partially reported here takes the cue from a collaboration with an “European - based” company that provides a music recognition service to different kinds of customers such as radio stations and advertisers. Its core business is indeed to measure airtime songs duration with the purpose of generating music charts and song popularity. The current framework of the commercial use consists in fingerprint database, recognition instance and FM transceiver. The primary goal of the collaboration was to improve the recognition algorithm they use in terms of accuracy, efficiency and quantity of manageable data, even if their approach is kept secret. Moreover, the proposed solution have to satisfy the subsequent mandatory constraints considered fundamental by the company for its business:

  • ability to fast add new songs inside the dataset of songs to be recognized;

  • ability to recognize short excerpts of audio in real time;

  • ability to locate the time position of the short excerpt inside the recognized song in real time.

As specified in subsequent Section 2, several researches dealt with song recognition issues, thus some open-source implementations of specific algorithms exist. At the best of our knowledge, none of these works matches the above listed constraints and, at the same time, lets get good performance in terms of accuracy when very short excerpts of audio have to be recognized. Accordingly, the main contributions of this work are:

  • analyzing performance of available open-source songs recognition tools in terms of accuracy against very short excerpts of audio to be recognized;

  • comparing performance specified in the previous point against performance obtained by our approach and the algorithm currently used by the company;

  • performing the previous comparisons using a public available dataset of songs.

As part of the collaboration, the company provided us a testing database containing thousands of songs they use in daily industrial practice; we used it in our experimentation in addition to the public MTG-Jamendo dataset [6].

The common approach for performing recognition of chunks of song is based on on-the-fly extraction of fingerprints characterizing short piece of songs that are then searched into a reference data-base storing fingerprints of original songs [7]. According to [7], requirements of each audio fingerprinting application includes: accuracy, reliability, robustness, granularity, security, versatility, scalability, complexity and fragility.

A song recognition algorithm able to obtain high accuracy with high granularity can also correctly identify very short excerpts of audio. The analysis of this feature and related performance, joined with the ability to add new published songs into the dataset on the fly, is the main motivation of this work. Short excerpts of copyrighted audio is often played inside advertisements and radio-jingles. One of the challenges of the companies working in the song recognition field is the ability to recognize each small excerpt of copyrighted audio on airplay. Indeed, in the music industry, the Performing Rights Organizations (PRO) do the administrative work of collecting performance royalties and distribute them to proper artists or their representatives. Public broadcasting royalties payouts system works as follows: the broadcaster purchases a blanket license from the local PRO; the licence will allow the broadcaster to play all music represented by the PRO; the broadcaster reports the songs it has broadcasted back to the PRO; the PRO uses those data for allocating and distributing the royalties due to right artists and/or their representatives. Often the royalties are related not only to the air play-time and number of times a specific song is played, but also to the specific hour of the day and the specific day which can affect the audiance of the broadcasting. Broadcaster programmers are obligated to provide a record of every song they have put on the air to the PRO. This record transcription is called “broadcaster log”. Given the scale of the operation, these logs are often riddled with missing details and errors (misspelled artist names, straight-up missing track data, something like “Track #” instead of the song’s title). Moreover the transmission of different advertisements or jingles are often identified with a single voice like “adv”. Corrupted broadcast logs mean that the PROs can’t identify the right artist and so the royalties collected will never paid out. Incomplete broadcast logs mean that artists around the globe miss out on millions in potential revenue. The ability to identify short excerpts of copyrighted materials, played also inside jingles and advertisements, permits to perform a more accurate monitoring to the companies which perform airplay monitoring service like the company we collaborate with. Moreover, it will not be necessary to perform a prior association between advertisements/jingles and copyright materials, because the known copyrighted audio will be automatically identified regardless of the context in which it is played.

Specifically, in this paper, we compare accuracy in recognizing very short excerpts of audio materials (lasting from 1 to 5 seconds) obtained using an extension and refinement of our proposed approach introduced in [8, 19] and fully described in [20] against five baseline algorithms: the first one is a Shazam-like approach founded on landmark-based fingerprint method appeared in [29] and implemented as open source version with the name AudfprintFootnote 1 [11, 30]. The second one is Dejavu,Footnote 2 another open-source implementation of the Shazam-like approach [29] that uses the constellation algorithm [10]. The third and fourth ones are respectively a new implementation of the classic Shazam algorithm [29], named Olaf, and an updated version of the algorithm described in [24], named Panako, which uses the Gabor transform to move from time domain to spectral domain; Panako and Olaf are distributed as an open-source software.Footnote 3 The last one is the algorithm, based on a Philips-like approach, used by the company we are collaborating with for its business.

The paper is organized as follows: Section 2 introduces the most relevant works related to this research. In Section 3, we summarize the components and methodology of our proposed approach. The evaluation of our algorithm in comparisons with baselines and company algorithms are presented in Section 4. Finally, Section 5 concludes the overall proposed architecture and introduces future works.

2 Related work

Although song recognition issues can be considered an audio classification or annotation task [12, 16] and, accordingly, it appears could be solved using a “representation learning approach” [4, 18], it has very specific peculiarities:

  1. 1.

    audio/music classification algorithms try to detect, for instance, the music genre or mood of the track [14, 26] or, if the excerpt of audio under test contains the sound of a specific musical instrument, a specific set of musical instruments or/and the voice of a singer [21, 22]; instead, song recognition algorithms try to detect if played excerpts are extracted from a copyrighted audio track, an identifier of the copyrighted audio track (like song title and the artist name) and, usually, also the time position of the excerpt inside the original audio track;

  2. 2.

    classification algorithms deal with few number of classes while song recognition algorithms deal with hundreds of thousands or millions of tracks to detect;

  3. 3.

    classification algorithms deal with a constant number of classes that do not vary over the time, while song recognition algorithms deal with a number of tracks usually growing over time (due to the publications of new titles);

Application of audio classification includes intelligent recommender systems as a promising technology for music search; they aim to assist users in exploring large-scale music collections by identifying suitable songs based on their preferences [9], while, as mentioned in the previous section, application of song classification is usually airplay monitoring service.

In literature, song recognition issue is usually solved by means of a fingerprint approach, then, in this section, we will focus only on relevant works covering song recognition using this approach.

Fingerprint generation approaches in song recognition systems can be divided into three different types [33]: the first one describes the energy differences between adjacent frequency bands [15]; the second one locates spectral peaks, using either the relationship with other peaks [23, 24, 27, 29] or the energy information around the peaks to form a fingerprint [1]; the last one uses image retrieval techniques [3, 32].

Recently, some works, as papers of Yao et al. [33, 34], use a fingerprint extraction approach based on the technique proposed by Philips in [15] whereas Sonnleitner and Widmer [27] introduce a compact “four - dimensional”, continuous hash representation of quadruples of points called quads. Although this latter approach can efficiently identify audio in large song collections, and it is robust to noise and audio quality degradation, as well as to severe distortions of speed, tempo and frequency, the generation of each “quad” appears rather complex as reported in [33] making it difficult to be used in real time applications. Moreover, memory requirements to store the data structures used to recognize the songs are very large. In [25], the fundamental frequency components extracted from the audio were matched with the frame-fundamental frequency domain and used to compose what the authors call fundamental frequency map (FFMAP). Authors employed also a new hashing method named spatial adaptive hashing (SAH) in the similarity calculation process, to compare the audio contents. Even though the approach appears less complex then the quad-based one, it works with an entire song differently from the approach we proposed [5, 19] that is capable of locating the time position of very short snippets inside songs. Authors of [17] present an audio fingerprinting method based on locally linear embedding (LLE). In their approach, the bands around each peak in the frequency domain is divided into four groups of sub-regions and the energy of every sub-region is computed. The LLE is performed in each group and the audio fingerprint is encoded by comparing adjacent energies. Moreover, a matching strategy based on dynamic time warping (DTW) is adopted to solve the distortion due to linear speed changes. The authors of [2] introduced an unsupervised deep learning framework for generating audio fingerprints based on a Sequence-to-Sequence Autoencoder (SA) model composed of two linked Recurrent Neural Networks (RNN). This latter work is, to the best of our knowledge, the only approach who shows results for queries with length shorter than 3s: although experimental results in [2] appear very good (100% of accuracy using excerpt of 1s), in our opinion the presented research reveals two drawbacks: 1) the dataset used to perform the experimentation (VoxCeleb1) is a speech based corpus, accordingly it is not clear how performance are affected in the context of music recognition; 2) the complexity appears high when compared to a simple fingerprint based approach: it is not said if the SA has to be retrained when the size of the dataset grows with new songs insertion; in this latter case, the computation time could be too high in order to perform real time recognition of several audio tracks.

3 The proposed approach

3.1 Mel-PSD audio fingerprinting

Our proposed fingerprints are based on the estimation of the short time power spectral density (STPSD) of the audio signal obtained on a Mel frequency-scale [28]. Starting from an incoming audio stream of sufficient time-length, a fingerprint F can be built on-the-fly by extracting NF adjoining linekeys:

$$ F=\left[l_{0},l_{1},\ldots_{N_{F}-1}\right]. $$
(1)

A linekey is a string of B bits built by exploiting both a Welch like approach and an adaptive frequency variant threshold to represent the content of a fixed short piece of audio. We generate a linekey from the samples extracted from a window Wσ(tn) starting at time tn whose length in time domain is Lσ; in this way, we generate a linekey characterizing an Lσ long piece of music at the tn position inside the song. Following the Welch’s approach, we use a shifting subwindow Ws, whose length is Ls, inside Wσ(tn) for computing K modified periodograms \({\mathscr{I}}_{k}(t_{n})\), with \(k=0,\dots ,K-1\). We evaluate the periodogram \({\mathscr{I}}_{k}(t_{n})\) of the k-th subwindow by applying an Hamming windowing to each subwindow and computing the squared magnitude of the FFT over M > 16 ⋅ B points, according to the following (2)

$$ \mathscr{I}_{k}(t_{n})[m]=\left\lvert \sum\limits_{j=0}^{N-1}x_{j}\cdot w_{j}\cdot {e^{-\frac{i2\pi m j}{N}}} \right\rvert^{2}, \quad \forall m=0,\ldots,M-1 $$
(2)

where N is the number of samples in each subwindow Ws, xj are the samples in each subwindow and wj are the samples of the Hamming window. A Mel frequency-scale bank of B filters is then applied to \({\mathscr{I}}_{k}(t_{n})\) in order to obtain the energy contained in each of the B sub-bands \({\mathscr{E}}_{k}(t_{n})\)

$$ \mathscr{E}_{k}(t_{n})[i]=\sum\limits_{m=M_{i-1}}^{M_{i}-1} \mathscr{I}_{k}(t_{n})[m], \quad \forall i=1,\dots,B $$
(3)

The last step is to sum the \({\mathscr{E}}_{k}(t_{n})\) for all the K periodograms and, to convert all the values to deciBel, obtaining an estimation of the power spectral density \({\mathscr{P}}(t_{n})\) over B frequency sub-bands

$$ \mathscr{P}(t_{n})[i]=10\log_{10}\sum\limits_{k=1}^{K} \mathscr{E}_{k}(t_{n})[i], \quad \forall i=1,\dots,B $$
(4)

Starting from \({\mathscr{P}}(t_{n})\), we need a threshold based on which we set the binary value at each frequency of the spectrum. To this purpose, we decided to adopt an adaptive frequency variant threshold. Specifically, we use an exponential approximation of the \({\mathscr{P}}(t_{n})\) trend to build the frequency variant threshold using the Least Squares Fitting-Exponential approach [31]. According to this theory, the points of the fitting curve y[i](tn) are obtained by means of the following (5)

$$ y[i](t_{n})=a(t_{n})\cdot e^{b(t_{n})\cdot i},~\forall i=1,\dots,B $$
(5)

where a(tn) and b(tn) are adaptive parameters function of the current \({\mathscr{P}}(t_{n})\). In order to avoid generation of different linekeys due to small oscillations around the fitting curve, we add a constant margin value m to the fitting curve y[i] for deriving the threshold values T(tn)[i]:

$$ T(t_{n})[i]=y[i]+m, \quad \forall i=1,\ldots,B. $$
(6)

Exploiting both the \({\mathscr{P}}(t_{n})[i]\) values and the frequency variant threshold T(tn)[i], the binary sequence of B bits is evaluated as

$$ l(t_{n})[i] = \left\{\begin{array}{ll} 1, & \text{if} \quad \mathscr{P}(t_{n})[i] > T(t_{n})[i]\\ 0, & \text{otherwise} \end{array},\right. \forall i=1,\ldots,B. $$
(7)

Figure 1 shows an example of bit sequence generation from \({\mathscr{P}}(t_{n})\) estimation and the related threshold.

Fig. 1
figure 1

Binary Linekey extraction

3.2 Song recognition by binary hamming distance measure

Exploiting the methodology described in Section 3.1, we represent songs as an ordered set of linekeys, by extracting them from each time interval τ from the beginning of a song. A song Sa, whose time length is La, is thus represented as

$$S_{a}=\left\{l_{a}(0), l_{a}(\tau), \ldots, l_{a}(n_{a}\cdot\tau)\right\},$$

with \(n_{a}=\lceil \frac {L_{a}}{\tau }\rceil \). We denote the number of songs in the collection with Ns, and we refer to them with the corresponding indices.

Our song recognition process is based on another data structure \({\mathscr{L}}\) where we store all the linekeys generated by the overall songs collection and associate to each linekey \({\mathscr{L}}_{i}\) a list \(\mathcal {C}_{i}\) of couples (id,p), with i = 0,⋯ ,Nl − 1, where Nl is the overall number of linekeys; a couple represents the index id of a song where the linekey \({\mathscr{L}}_{i}\) is located and its position p inside the song.

When a fingerprint \(F=(l_{0},l_{1},\ldots ,l_{N_{F}-1})\) is generated for recognition purposes, we compare each of its linekeys with the linekeys in \({\mathscr{L}}\) searching for the nearest ones in terms of Hamming distance \(d({\mathscr{L}}_{j}, l_{i})\). We estimate the song whose F belongs to with two different approaches.

Approach 1. :

The first approach simply counts the times a song is referenced to: i.e., we increase the counters associated to the songs whose id is in the list \(\mathcal {C}_{j}\) such that \(d({\mathscr{L}}_{j}, l_{i})\) is minimum; the song with the highest counter is marked as recognized. The song with the smallest index will be picked up, if more than one obtains the same score.

Approach 2. :

The second approach introduces a penalty for the linekeys far (in terms of hamming distance) from the searched one.

A set of Ns counters is used: one counter for each song in the corpus. For each liF, the counters are increased with the distance \(d({\mathscr{L}}_{i}, l_{j})\), when the song identifier id is found in the list \(\mathcal {C}_{j}\) such that \(d({\mathscr{L}}_{j}, l_{i})\) is minimum, otherwise it is increased with B (i.e. the maximum possible distance). The song whose counter has the lowest value is selected as recognized. The song with the smallest index will be picked up, if more than one obtains the same score.

3.3 Discussion about the parameters

The behaviour of the algorithm and, accordingly, the performance in terms of accuracy is affected by the value assigned to each parameter. User can tune the values in order to meet the requirements of the needed task. Specifically, the characteristic parameters in the algorithm are:

  • Lσ - The time length of the audio chunk used to evaluate a single linekey; the value should be short enough in order to include only a stationary audio signal (tipically not greater then 150 ms): small values permits to increase granularity but, on the other hand, it will increase the size of the dataset;

  • M - The resolution of the FFT used to evaluate each periodogram; this value should be tuned taking into account both the sampling rate of the audio signal fs and the number of bits B which will form the linekey; anyway, it has to be greater then 16 ⋅ B in order to have a enough values for obtaining at least a bin inside each filter of the Mel filter bank; accordingly to the FFT algorithm used, it should be a power of 2; it should be chosen in order to be greater than or equal to the number of samples inside Ls audio time: i.e., greater or equal to N = fsLs.

  • B - The number of bits making up a linekey; this value should be multiple of 8 (one byte) and it corresponds to the number of Mel-filters used in the Mel-filter bank; it affects the frequency resolution of the linekey: greater values permit to obtain higher resolution and, accordingly, more spreading of the space of linekeys; a wider linekey space permits to increase the recognition accuracy using the same number of linekeys in fingerprints but it generate a larger dataset and increases the execution time (both in terms of linekey generation time and linekey searching time).

  • K - The number of periodograms used in the Welch’s approach in order to perform the power spectral density estimation of the audio signal; accordingly to the theory, its value should be chosen in order to obtain a right compromise between the variance of the estimation and the frequency resolution.

  • τ - The time step used to extract each linekey from the audio signal; its value heavily affects the number of linekeys extracted from audio signal; it can be smaller, equal or greater than Lσ: in the first case, overlapped linekeys are generated, i.e. greater resolution and granularity but larger dataset size and longer search time; in the second case, adjoining linekeys are generated, i.e., we use the minimum value in order to not lose any portion of audio signal; in the latter case, linekeys are generated with time jumps in between, i.e. with not covered pieces of audio; the case should be avoided because both the granularity and the accuracy are negatively affected, even if both dataset size and search time are reduced.

  • m (margin value) - is used for tuning the effect of the magnitude of each frequency component during the linekey generation: when this value increases, the gap between the magnitude of the frequency component and the frequency variant fitting curve must be higher in order to change the corresponding bit of the linekey from the value 0 to the value 1; in this the robustness of the linekeys improves but the number of different generated linekey is reduced and, accordingly, the recognition accuracy using few linekeys will result lower;

  • NF - the size of fingerprints in terms of linekeys: greater values mean more audio material is used to build a fingerprint and, accordingly, the accuracy will be positively affected; on the other hand, the time necessary to be able to carry out the recognition will proportionally increase and will not be possible to recognize audio excerpts of shorter duration than the fingerprint.

The reader who wants to deepen the proposed methodology can refer to [20].

4 Results evaluation

4.1 Songs corpus and queries material

In order to evaluate the performance of the different approaches, we used both a subset of the MTG-Jamendo dataset [6] and a specific corpus consisting of Ns = 7000 commercial songs. The first one is an open dataset for music auto-tagging. It is built using music available at Jamendo under Creative Commons licenses grouped in 100 folders labeled from 00 to 99 (we used only the folder 00 which contains 586 songs because of the very long time spent by the training phase of the Audfprint, Dejavu, Olaf and Panako implementation as provided by the authors). The second one is a subset of the larger one used for its business by the company involved in the project. It was carefully built in order to contain a broad range of different musical genres which belong to contemporary pop music currently played and broadcasted in Italy.Footnote 4 To evaluate the impact of corpus size in terms of number of songs, we split this corpus in 7 nested subsets from 1000 to 7000 with a step of 1000. We used each subset of the corpus to perform separated recognition experiments as explained later.

The tests were performed extracting four different excerpts of audio from each song of the datasets. The starting point of each excerpt was randomly selected in the range [0 s,tmax − 4 s] where tmax is the time length of the song. We extracted excerpts of 1, 2, 3 and 4 seconds starting from the same initial point.

4.2 Performance evaluation

The following terms are used in defining our performance measures: tp (true positives) is the number of cases in which the correct reference is identified from the query. Recognition Accuracy is the proportion of queries whose reference is correctly identified and it is defiend as

$$ Accuracy=\frac{t_{p}}{N} $$

where N is the total number of queries. When true positive queries are performed, we evaluate the error e of the estimated position of the excerpts inside the song as

$$ e=p_{r}-p_{e} $$

where pr is the right position and pe is the estimated one. Therefore, we evaluate the root-mean-square error (RMSE) of the time-positioning as

$$ RMSE=\sqrt{\frac{{\sum}_{n=1}^{N_{t_{p}}}{e_{n}^{2}}}{N_{t_{p}}}} $$

where en is the error on the nth query with true positive result and \(N_{t_{p}}\) is the total number of queries with true positive result.

For comparison purposes, on the MTG-Jamendo subset 00, we used as baselines: 1) Audfprint [11] (using two different parameter settings, denoted as A1 and A2 from here on, where in this latter we modified the “density” parameter in order to obtain the same number of linekeys per second as in our approach); 2) Dejavu [10] (labelled as D); 3) the algorithms implemented in the last released version of Panako [24] (i.e., Olaf labelled as O and Panako, labelled as P). We kept all the parameters of baselines algorithms at the implementation defaults except those specified in Table 1 in such a way the reject option is turned off. Table 2 summarizes the values used for the parameters of our approach, denoted as ME1 and ME2.

Table 1 Modified parameters in baseline algorithms
Table 2 Set of parameters used in the experimentation

Table 3 shows results in terms of accuracy obtained on the MTG-Jamendo subset. As expected, performances increase with the size of the excerpts for each analysed approach. Both Panako and Olaf were unable to identify queries with 1-second-long samples, and, both approaches obtain very poor results also for excerpts with higher time-lengths. Audfprint obtains higher accuracy compared to previous approaches and it is affected by the “density” parameter in particular for short excerpts. Accuracy obtained with Dejavu are higher then 90% in all excerpts size condition. Anyway, both variants of our proposed approach outperforms all the others. In particular, for queries with 1-second-long samples, the accuracy of our proposed approaches is almost 8% higher than that obtained with Dejavu, which is the best of all the baselines.

Table 3 Comparisons of accuracy performance (%) obtained on the MTJ-Jamendo-00 dataset for different excerpts lengths

Results in terms of time-positioning are showed in Table 4. The proposed approach permits to obtain best result for excerpts of 1s while Olaf and Panako give better result for longer excerpts. Anyway, the poor performance in terms of accuracy of Panako and Olaf algorithms should be considered to better compare these results. The proposed approach shows a remarkable improvements compared to Dejavu and Audfprint algorithms which have similar results in terms of accuracy.

Table 4 Comparisons of time-positioning RMSE (s) obtained on the MTJ-Jamendo-00 dataset for different excerpts lengths

Figure 2 shows the experimental Cumulative Density Functions of the time-positioning errors obtained with the considered algorithms for all the excerpts lengths. Each curve in the graphs represents the probability that the error in time-positioning is lower than the time value at the corresponding abscissa. Accordingly, at the same abscissa, an higher value corresponds to better performance. Specifically, Fig. 2-(a) shows the CDFs of time-positioning errors for excerpts lasting 1s: it contains only the curves related to Audfprint, Dejavu and the proposed approach because Panako and Olaf are not able to recognize so small excerpts; Fig. 2-(b) shows the CDFs of time-positioning errors for excerpts lasting 2s: in this case, only the curve related to Panako is missing for the same reason as in the previous subfigure; Fig. 2-(c) shows the CDFs of time-positioning errors for excerpts lasting 3s and Fig. 2-(d) shows the CDFs of time-positioning errors for excerpts lasting 4s. The trend of these curves permits to detect important information about the time-positioning errors. Of course, these results should be analyzed taking into account the performance in terms of accuracy of each algorithm. Analyzing the curves of the proposed approach is evident that the time-positioning error is lower than 0.1 seconds in 90% of the queries with true positives results. Results of Olaf and Panako approaches appear always slightly better than the proposed one but the performances in terms of accuracy of these approaches are significantly worse. The proposed approach behaves always better than Audfrpint and Dejavu which have almost the same performance in terms of accuracy.

Fig. 2
figure 2

Cumulative Density Functions (CDFs) of time-positioning errors for excerpts of (a) 1s, (b) 2s, (c) 3s and (d) 4s

We then run an experiment on the larger commercial song corpus using as baselines A1, A2 and the algorithm currently used by the company for its business. We didn’t consider both P and O due to their very poor accuracy obtained in the previous experiment, and D for the very long time spent in training by its open source implementation. The company algorithm has been provided as a “black box”: the number of linekeys generated per second is equal to 100 and the size of each linekey is equal to 64 bits. Two different parameter settings were used labelled as C1 and C2.

Results about this second dataset are shown in Fig. 3, where the recognition ratio is depicted by varying Ns, given different excerpt sizes. They show that the analyzed approaches are not particularly affected by the corpus size making them robust with respect to the number of songs in the database. Instead, our approach outperforms both Audfprint and the company one for excerpts of length less than or equal to 2 seconds. Figure 3(b) shows that A2 obtains an accuracy not greater than 90% while both our approaches, ME1 and ME2, and the second company approach (C2) obtains an accuracy slightly lower than 100%. The difference in performance is even more pronounced using excerpts of 1s (Fig. 3(a)): in this condition, accuracy of both Audfprint approaches dramatically falls down under 50%, both company approaches obtain an accuracy slightly greater than 60%, while both ME1 and ME2 maintain the recognition ratio almost near to 100%. In order to better show the results of our approaches, we depicted the details of accuracy results in Fig. 4. We obtain a recognition ratio higher than 98.8% into all the experiments. Specifically, ME2 outperforms ME1, and it permits to obtain accuracy higher than 99.2% for all the corpus sizes also with a granularity of 1s. Using shorter excerpts, the accuracy performance degrades using both our approaches. Anyway, this degradation is always lower than 0.3%.

Fig. 3
figure 3

Accuracy varying number of songs using the company corpus and excerpts of (a) 1s, (b) 2s, (c) 3s and (d) 4s

Fig. 4
figure 4

Accuracy level of ME approaches vs number of songs in the company corpus with different excerpt sizes

To justify the good performance of our approach in terms of accuracy, we investigated the characteristics of the generated linekeys. We evaluated the number of unique linekeys obtained increasing the size of the corpus. The results are shown in Fig. 5(a); as can be seen, the number of unique linekeys (Nl) grows linearly with the corpus size (Ns). This means that a linekey appears peculiarly associated with one song, or few at most. Accordingly, only few linekeys are able to correctly classify a song. For comparison purposes, we show in Fig. 5(b) the number of unique hashes obtained by Audfprint A1 approach on the same corpora: it is considerably less and, primarily, it saturates to about 106 when the number of songs increases.

Fig. 5
figure 5

(a) Number of unique linekeys obtained with proposed approach and (b) number of unique hashes obtained with Audfprint A1 varying the size of the company corpus

Finally, Fig. 6 shows results in noise condition. We added white noise to the audio queries in order to obtain SNRs equal to 20, 15, 10, and 5dB. Figure 6(a) shows results using excerpts of 4s: our approaches and the company ones outperform Landmark-based approach for SNR higher than 5dB; a slight performance degradation of C1 and ME1 approaches is observable at 5dB, anyway ME2 and C2 approaches permit to obtain the best results also at 5dB. Figure 6(b) shows result of accuracy in the same previous noise conditions using the proposed approaches and excerpts of 1, 2, 3 and 4 seconds. For SNRs higher than 10dB, the excerpts size only slightly affects the accuracy (higher than 92.5%). Instead, as expected, we obtained higher degradation for smaller excerpts (Fig. 2).

Fig. 6
figure 6

Accuracy performance comparisons adding white noise to the queries at different SNRs: (a) the proposed approach vs Audfprint and company approach with excerpts of 4s; (b) the proposed approach with excerpts shorter than 4s

5 Conclusions and future work

In this paper, we compared the behaviour of a new approach for generating song fingerprints with other well known algorithms when short excerpts of audio are considered. Comparisons with the considered baseline algorithms performed on the MTG-Jamendo subset showed our approach outperforms others and it maintains an accuracy almost equal to 100% varying excerpts lengths (1s, 2s, 3s and 4s). We also investigated the error in time-positioning of the small excerpts inside original track showing our approach exhibits an error lower than 0.1s in 90% of right classifications. We also extended our experiments using a large dataset of 7000 commercial songs highlighting that our proposed approach is not affected by the size of the dataset. Moreover, we added white noise to the audio queries at several SNRs and we showed that accuracy is higher than 98% in clean conditions and remains higher than 90% when SNRs ≥ 10dB for all excerpt sizes.

Our future research will focus on the complexity of the classification algorithm and its improvement by means of fast search in Hamming space. Moreover, we also want to investigate the performance of the proposed approach taking into account larger datasets of songs (with more than one hundred thousands songs) and pitch/tempo variations of the excerpts to be recognized.