Shallow semantic labeling using two-phase feature-enhanced string matching

https://doi.org/10.1016/j.eswa.2009.02.031Get rights and content

Abstract

A two-phase annotation method for semantic labeling in natural language processing is proposed. The dynamic programming approach stresses on a non-exact string matching which takes full advantage of the underlying grammatical structure of the parse trees in a Treebank. The first phase of the labeling is a coarse-grained syntactic parsing which is complementary to a semantic dissimilarities analysis in its latter phase. The approach goes beyond shallow parsing to a deeper level of case role identification, while preserving robustness, without being bogged down into a complete linguistic analysis. The paper presents experimental results for recognizing more than 50 different semantic labels in 10,000 sentences. Results show that the approach improves the labeling, even though with incomplete information. Detailed evaluations are discussed in order to justify its significances.

Introduction

Automatic information extraction has received a great deal of attention in the latest development of information retrieval. While a plethora of issues relating to questions of accuracy and efficiency have been thoroughly discussed, the problem of extracting meaning from natural language has scarcely been addressed. When the size and quantity of documents available on the Internet are considered, the demand for a highly efficient system that identifies the semantic meaning is clear. Case frame is one of the most important structures that are used to represent the meaning of sentences (Fillmore, 1968). One could consider a case frame to be a special, or distinguishing, form of knowledge structure about sentences. Although several criteria for recognizing case frames in sentences have been considered in the past, none of the criteria serves as a completely adequate decision procedure. Most of the studies in natural language processing (NLP) do not provide any hints on how to map input sentences into case frames automatically. As a result, both the efficiency and robustness of the techniques used in information extraction is highly in doubt when they are applied to real world applications.

The objective of this research is twofold. First, a shallow but effective sentence chunking process is developed. The process is to extract all the phrases from the input sentences, without being bogged down into deep semantic parsing and understanding. Second, a novel semantic labeling technique that is based on the syntactic and semantic tags of the latest Treebank is being constructed (CKIP, 2004). One of our primary goals in this research is to design a shallow but robust mechanism which can annotate sentences using a set of semantic labels (Her, 1990). While the classical syntactic and semantic analysis is extremely difficult, if not impossible, to systematize the current research in NLP, our approach does not require any deep linguistic analysis to be formalized. The annotation will provide the underlying semantic labels of the sentence, without being mired into the formalism. The organization of the paper is as follows. The related work in semantic labeling and sentence chunking are first described in Section 2. In this research, each word in sentences will have two attributes, namely, part-of-speech (POS) and semantic classes (SC). Any input sentence is first transformed into a feature-enhanced string. The detailed discussion on how the two-phase feature-enhanced string matching algorithm can be applied in the semantic labeling is shown in Section 3. The system has already been implemented using Java language. In order to demonstrate the capability of our system, an experiment with 10,000 sentences is conducted. A detailed evaluation is explained in Section 4 followed by a conclusion.

Section snippets

Related work

Following the framework of case grammar which is originally proposed by Fillmore in 1968, it has been accepted that every nominal constituent in every language bears a single syntactic–semantic case relation (Dowty, 1991, Jackendoff, 1983). In earlier systems, Somers describes a prototype computer program that attempts to map surface strings of English onto a formalism representing one level of a deep structure (Somers, 1982). Weischedel, Meteer, Schwartz, Ramshaw, and Palmucci (1993) predict

Two-phase feature-enhanced string matching algorithm

In this section, we will first outline the concepts of edit operations which are essential components of our feature-enhanced string matching algorithm. The two-phase shallow semantic labeling will be discussed thoroughly in Section 3.2.

Implementation and experimental results

We have implemented the system using Java JDK1.4.2 under Sun Microsystems. The whole system development is designed under unified modeling language (UML). In our design, for every input sentence, the best-matching tree with minimum edit distance in the Treebank is calculated. The information-based case grammar (ICG) in the Sinica Chinese Treebank stipulates that each lexical item contains both semantic and syntactic features. The grammar indicates the way that lexical items in the sentences are

Conclusion

In this paper, we have illustrated a shallow technique in which semantic labels are extracted in forms of chunks of phrases or words using a two-phase feature-enhanced string matching algorithm. While the first phase is to shortlist the potential trees in the Treebank, chunks are further tagged with semantic labels in the second phase. Based on the linguist’s conception of phrase structure, our approach does not require a full syntactic parse to pursue semantic analysis and the recursively

Acknowledgement

The work described in this paper was partially supported by the grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project Nos. CUHK470605 and CUHK440607).

References (33)

  • J. Gee et al.

    Performance structures: A psycholinguistic and linguistic appraisal

    Cognitive Psychology

    (1983)
  • S. Abney

    Parsing by chunks

  • Bod, R., Scha, R., & Sima’an, K. (2003). Data-Oriented Parsing. Stanford: California,...
  • Carreras, X., & Màrquez, L. (2005). Introduction to the CoNLL-2005 shared task: Semantic role labeling. In Proceedings...
  • F.-Y. Chen et al.

    Sinica Treebank

    Computational Linguistics and Chinese Language Processing

    (2000)
  • Chen, K.-J., Huang, C.-R., Chang, L.-P., & Hsu. H.-L. (1996). Sinica Corpus: Design Methodology for Balanced Corpora....
  • Church, K. (1988). A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of second...
  • CKIP (2004). Sinica Chinese Treebank: An Introduction of Design Methodology, Academic...
  • D. Dowty

    Thematic proto-roles and argument selection

    Language

    (1991)
  • Erik Tiong, K. S., Canisius, S., van den Bosch, A., & Bogers, T. (2005). Applying spelling error correction techniques...
  • Fillmore, C. J. (1968). The case for case. In E. Bach & R. T. Harms (Eds.), Universals in linguistic theory (pp. 1–90)....
  • D. Gusfield

    Algorithms on strings, trees, and sequences: Computer science and computational biology

    (1997)
  • Haghighi, A., Toutanova, K., & Manning, C. D. (2005). A joint model for semantic role labeling. In Proceedings of the...
  • O.S. Her

    Grammatical functions and verb subcategorization in Mandarin Chinese

    (1990)
  • R. Jackendoff

    Semantics and cognition

    (1983)
  • S. Kurohashi et al.

    A method of case structure analysis for Japanese sentences based on examples in case frame dictionary

    IEICE Transactions on Information and Systems

    (1994)
  • Cited by (0)

    View full text