ABSTRACT
A method is presented for segmenting text into subtopic areas. The proportion of related pairwise words is calculated between adjacent windows of text to determine their lexical similarity. The lexical cohesion relations of reiteration and collocation are used to identify related words. These relations are automatically located using a combination of three linguistic features: word repetition, collocation and relation weights. This method is shown to successfully detect known subject changes in text and corresponds well to the segmentations placed by test subjects.
- Beeferman D., Berger A. and Lafferty J. (1997) Text segmentation using exponential models, Proceedings of the 2nd Conference on Empirical Methods in Natural Language ProcessingGoogle Scholar
- Church K. W. and Hanks P. (1990) Word association norms, mutual information and lexicography, Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics, pp. 76--83 Google ScholarDigital Library
- Grosz, B. J. and Sidner, C. L. (1986) Attention, intentions and the structure of discourse, Computational Linguistics, 12(3), pp. 175--204 Google ScholarDigital Library
- Halliday M. A. K. and Hasan R. (1976) Cohesion in English, Longman GroupGoogle Scholar
- Hearst M. A. (1993) TextTiling: A quantitative approach to discourse segmentation, Technical Report 93/24, Sequoia 2000, University of California, Berkeley Google ScholarDigital Library
- Hearst M. A. (1994) Multi-paragraph segmentation of expository texts, Report No. UCB/CSD 94/790, University of California, Berkeley Google ScholarDigital Library
- Jobbins A. C and Evett L. J. (1995) Automatic identification of cohesion in texts: Exploiting the lexical organisation of Roget's Thesaurus, Proceedings of ROCLING VIII, Taipei, TaiwanGoogle Scholar
- Jobbins A. C. and Evett L. J. (1998) Semantic Information from Roget's Thesaurus: Applied to the Correction of Cursive Script Recognition Output, Proceedings of the International Conference on Computational Linguistics, Speech and Document Processing, India, pp. 65--70Google Scholar
- Keenan F. G and Evett L. J. (1989) Lexical structure for natural language processing, Proceedings of the 1st International Lexical Acquisition Workshop at IJCAIGoogle Scholar
- Kozima H. (1993) Text segmentation based on similarity between words, Proceedings of the 31st Annual Meeting on the Association for Computational Linguistics, pp. 286--288 Google ScholarDigital Library
- Litman D. J. and Passonneau R. J. (1996) Combining knowledge sources for discourse segmentation, Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics Google ScholarDigital Library
- Morris J. and Hirst G. (1991) Lexical cohesion computed by thesaural relations as an indicator of the structure of text, Computational Linguistics, 17(1), pp. 21--48 Google ScholarDigital Library
- Ponte J. M. and Croft W. B. (1997) Text Segmentation by Topic, 1st European Conference on Research and Advanced Technology for Digital Libraries (ECDL'97), pp. 113--125 Google ScholarDigital Library
- Reynar J. C. (1994) An automatic method of finding topic boundaries, Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (Student Session), pp. 331--333 Google ScholarDigital Library
- Rotondo J. A. (1984) Clustering analysis of subjective partitions of text, Discourse Processes, 7, pp. 69--88Google ScholarCross Ref
- Salton G. and Buckley C. (1991) Global text matching for information retrieval, Science, 253, pp. 1012--1015Google ScholarCross Ref
- Salton G. and Buckley C. (1992) Automatic text structuring experiments in "Text-Based Intelligent Systems: Current Research and Practice in Information Extraction and Retrieval," P. S. Jacobs, ed, Lawrence Earlbaum Associates, New Jersey, pp. 199--210 Google ScholarDigital Library
- Salton G., Allen J. and Buckley C. (1994) Automatic structuring and retrieval of large text files, Communications of the Association for Computing Machinery, 37(2), pp. 97--108 Google ScholarDigital Library
- Stairmand M. A. (1997) Textual context analysis for information retrieval, Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, pp. 140--147 Google ScholarDigital Library
- Yaari Y. (1997) Segmentation of expository texts by hierarchical agglomerative clustering, RANLP'97, BulgariaGoogle Scholar
- Text segmentation using reiteration and collocation
Recommendations
Two-Word Collocation Extraction Using Monolingual Word Alignment Method
Statistical bilingual word alignment has been well studied in the field of machine translation. This article adapts the bilingual word alignment algorithm into a monolingual scenario to extract collocations from monolingual corpus, based on the fact ...
Collocation extraction using monolingual word alignment method
EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2Statistical bilingual word alignment has been well studied in the context of machine translation. This paper adapts the bilingual word alignment algorithm to monolingual scenario to extract collocations from monolingual corpus. The monolingual corpus is ...
Synonymous collocation extraction using translation information
ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1Automatically acquiring synonymous collocation pairs such as <turn on, OBJ, light> and <switch on, OBJ, light> from corpora is a challenging task. For this task, we can, in general, have a large monolingual corpus and/or a very limited bilingual corpus. ...
Comments