Abstract
This paper describes a language independent method for automatic syllabification of speech signal. This method utilizes the valleys in short time energy (STE) contour and location of vowel onset points (VOP) for marking the syllable boundaries. In the proposed method, automatic syllabification is performed in three steps. First, long silence/pause regions are marked with the help of speech/non-speech detection. Then VOPs are located from the Hilbert Envelope of LP residual. The existence of more than one VOP in a continuous speech region (identified using speech/non-speech detection in the first step) is an indication of syllable boundaries within the region. Location with minimum energy in the STE contour between two consecutive VOP is identified as the syllable boundary. Since automatic VOP detection algorithm fails to detect some of the VOPs, certain syllable boundaries will be missed. Therefore, at the third step, additional syllable boundaries are detected from STE contour by fixing a valley threshold which is equal to the mean value of STE corresponding to each speech region between two consecutive syllable boundaries. This method is evaluated for 50 sentences each in read, extempore and conversational mode speech of Malayalam and Bengali languages. Overall accuracy of 80% is obtained with ± 50 ms tolerance with reference to manually marked syllable boundaries for this database. Method also shows good accuracy in case of TIMIT and NTIMIT data without tuning of thresholds and other parameters. This method is useful for applications that do not require exact syllable boundaries, rather a meaningful separation of syllables. Application of this technique for prosody based emotion recognition is illustrated using Emo-DB German emotional database.
Similar content being viewed by others
References
Mary, L., Anish, Babu K. K., & joseph, Aju. (2012). Analysis and detection of mimicked speech based on prosodic features. International Journal of Speech Technology, 15, 407–417.
Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosody for language and speaker recognition. Speech Communication, 50(10), 782–796.
Mermelstein, P. (1975). Automatic segmentation of speech into syllabic units. The Journal of the Acoustical Society of America, 58(4), 880–883.
Mohanan, V., & Mary, L. (2016). Prosody based emotion recognition using SVM. In Proceedings of the International Conference on Signal & Speech Processing (ICSSP-2016), Kollam.
Nagarajan, T., Murthy, Hema A., Hegde, Rajesh M. (2003). Automatic segmentation of speech into syllable-like units. Eurospeech. Geneva, pp.2893-2896
Nair, L. M., & Mary, L. (2015). Pair-wise language discrimination using phonotactic information. In Proceedings of the 2015 International Conference on Control Communication & Computing India (ICCC), Trivandrum (pp. 544-547).
Nel, P., & du Preez, J. (2003). Automatic syllabification using hierarchical hidden markov models. In Proceedings of the ICASSP (pp. 768–771) Cambridge, MA: MIT Press.
Pradhan, G., & Prasanna, S. R. M. (2011). Significance of vowel onset point information for speaker verification. International Journal of Computer and Communication Technology, 2, 56–61.
Prasad, V. K., Nagarajan, T., & Murthy, Hema A. (2004). Automatic segmentation of continuous speech using minimum phase group delay functions. Speech Communication, 42, 429–446.
Prasanna, S. R. M. (2004). Event-based analysis of speech, Ph.D thesis, Indian Institute of Technology Madras, Department of Computer Science and Engg., Chennai
Prasanna, S.R. M., Yegnanarayana, B. (2005). Detection of vowel onset point events using excitation information, INTERSPEECH, pp.1133-1136
Rao, K. S., & Yegnanarayana, B. (2009). Intonation modeling for Indian languages. Computer, Speech and Language, 23(2), 240–256.
Sebastian, K., & Mary, L. (2016). FASR: Effect of voice disguise. Paper presented at the 2016 International Conference on Emerging Technological Trends (ICETT), Kollam (pp. 1–4).
Villing, R., Timoney, J., & Ward, T. (2004). Automatic blind syllable segmentation for continuous speech.ISSC. Belfast
Zhang, Y., & Glass, J. (2009). Speech rhythm guided syllable nuclei detection. In Proceeding of the ICASSP (pp. 3797–3800). Cambridge, MA: MIT Press.
Acknowledgements
The authors would like to thank Kerala State Council for Science, Technology and Environment (KSCSTE), Government of Kerala, India for their support.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mary, L., Antony, A.P., Babu, B.P. et al. Automatic syllabification of speech signal using short time energy and vowel onset points. Int J Speech Technol 21, 571–579 (2018). https://doi.org/10.1007/s10772-018-9517-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-018-9517-6