Shallow semantic labeling using two-phase feature-enhanced string matching
Introduction
Automatic information extraction has received a great deal of attention in the latest development of information retrieval. While a plethora of issues relating to questions of accuracy and efficiency have been thoroughly discussed, the problem of extracting meaning from natural language has scarcely been addressed. When the size and quantity of documents available on the Internet are considered, the demand for a highly efficient system that identifies the semantic meaning is clear. Case frame is one of the most important structures that are used to represent the meaning of sentences (Fillmore, 1968). One could consider a case frame to be a special, or distinguishing, form of knowledge structure about sentences. Although several criteria for recognizing case frames in sentences have been considered in the past, none of the criteria serves as a completely adequate decision procedure. Most of the studies in natural language processing (NLP) do not provide any hints on how to map input sentences into case frames automatically. As a result, both the efficiency and robustness of the techniques used in information extraction is highly in doubt when they are applied to real world applications.
The objective of this research is twofold. First, a shallow but effective sentence chunking process is developed. The process is to extract all the phrases from the input sentences, without being bogged down into deep semantic parsing and understanding. Second, a novel semantic labeling technique that is based on the syntactic and semantic tags of the latest Treebank is being constructed (CKIP, 2004). One of our primary goals in this research is to design a shallow but robust mechanism which can annotate sentences using a set of semantic labels (Her, 1990). While the classical syntactic and semantic analysis is extremely difficult, if not impossible, to systematize the current research in NLP, our approach does not require any deep linguistic analysis to be formalized. The annotation will provide the underlying semantic labels of the sentence, without being mired into the formalism. The organization of the paper is as follows. The related work in semantic labeling and sentence chunking are first described in Section 2. In this research, each word in sentences will have two attributes, namely, part-of-speech (POS) and semantic classes (SC). Any input sentence is first transformed into a feature-enhanced string. The detailed discussion on how the two-phase feature-enhanced string matching algorithm can be applied in the semantic labeling is shown in Section 3. The system has already been implemented using Java language. In order to demonstrate the capability of our system, an experiment with 10,000 sentences is conducted. A detailed evaluation is explained in Section 4 followed by a conclusion.
Section snippets
Related work
Following the framework of case grammar which is originally proposed by Fillmore in 1968, it has been accepted that every nominal constituent in every language bears a single syntactic–semantic case relation (Dowty, 1991, Jackendoff, 1983). In earlier systems, Somers describes a prototype computer program that attempts to map surface strings of English onto a formalism representing one level of a deep structure (Somers, 1982). Weischedel, Meteer, Schwartz, Ramshaw, and Palmucci (1993) predict
Two-phase feature-enhanced string matching algorithm
In this section, we will first outline the concepts of edit operations which are essential components of our feature-enhanced string matching algorithm. The two-phase shallow semantic labeling will be discussed thoroughly in Section 3.2.
Implementation and experimental results
We have implemented the system using Java JDK1.4.2 under Sun Microsystems. The whole system development is designed under unified modeling language (UML). In our design, for every input sentence, the best-matching tree with minimum edit distance in the Treebank is calculated. The information-based case grammar (ICG) in the Sinica Chinese Treebank stipulates that each lexical item contains both semantic and syntactic features. The grammar indicates the way that lexical items in the sentences are
Conclusion
In this paper, we have illustrated a shallow technique in which semantic labels are extracted in forms of chunks of phrases or words using a two-phase feature-enhanced string matching algorithm. While the first phase is to shortlist the potential trees in the Treebank, chunks are further tagged with semantic labels in the second phase. Based on the linguist’s conception of phrase structure, our approach does not require a full syntactic parse to pursue semantic analysis and the recursively
Acknowledgement
The work described in this paper was partially supported by the grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project Nos. CUHK470605 and CUHK440607).
References (33)
- et al.
Performance structures: A psycholinguistic and linguistic appraisal
Cognitive Psychology
(1983) Parsing by chunks
- Bod, R., Scha, R., & Sima’an, K. (2003). Data-Oriented Parsing. Stanford: California,...
- Carreras, X., & Màrquez, L. (2005). Introduction to the CoNLL-2005 shared task: Semantic role labeling. In Proceedings...
- et al.
Sinica Treebank
Computational Linguistics and Chinese Language Processing
(2000) - Chen, K.-J., Huang, C.-R., Chang, L.-P., & Hsu. H.-L. (1996). Sinica Corpus: Design Methodology for Balanced Corpora....
- Church, K. (1988). A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of second...
- CKIP (2004). Sinica Chinese Treebank: An Introduction of Design Methodology, Academic...
Thematic proto-roles and argument selection
Language
(1991)- Erik Tiong, K. S., Canisius, S., van den Bosch, A., & Bogers, T. (2005). Applying spelling error correction techniques...