HyperLex: lexical cartography for information retrieval

https://doi.org/10.1016/j.csl.2004.05.002Get rights and content

Abstract

This article describes an algorithm called HyperLex that is capable of automatically determining word uses in a textbase without recourse to a dictionary. The algorithm makes use of the specific properties of word cooccurrence graphs, which are shown as having “small world” properties. Unlike earlier dictionary-free methods based on word vectors, it can isolate highly infrequent uses (as rare as 1% of all occurrences) by detecting “hubs” and high-density components in the cooccurrence graphs. The algorithm is applied here to information retrieval on the Web, using a set of highly ambiguous test words. An evaluation of the algorithm showed that it only omitted a very small number of relevant uses. In addition, HyperLex offers automatic tagging of word uses in context with excellent precision (97%, compared to 73% for baseline tagging, with an 82% recall rate). Remarkably good precision (96%) was also achieved on a selection of the 25 most relevant pages for each use (including highly infrequent ones). Finally, HyperLex is combined with a graphic display technique that allows the user to navigate visually through the lexicon and explore the various domains detected for each word use.

Introduction

Keyword-based information retrieval on the Web, or any other large textbase, runs into the problem of the multiple uses of most words. The inescapable homography and polysemy of human languages generate considerable noise in the results. A query on the French word barrage, for example, may return pages on dams, play-offs, barriers, roadblocks, police cordons, barricades, etc. depending on the global frequencies and the particular ranking techniques used by search engines. Retrieving infrequent uses can prove quite tricky.

Of course, users can usually narrow down their queries by combining keywords with Boolean operators, but this is not always a straightforward task. Continuing with the above example, combining the word barrage with the word match does not necessarily produce all of the pages about “matchs de barrage” (play-offs): many pages about this topic do not contain the word match. To get around this, one would have to list all lexical possibilities and write a query of the type barrage AND (jouer OR jeu OR championnat OR rencontre OR football OR basket-ball OR…) (play OR game OR championship OR encounter OR soccer OR basketball OR…), which is cumbersome and may still not produce the desired results. Besides, the general public is not very skillful when it comes to formulating such complicated queries. In a large-scale study on the Excite search engine, Spink et al. (2001) showed that less than 5% of the queries contained Boolean operators, and 50% were incorrect.1 Less than 1% of the queries contained nested operators (as in the above example). Spink et al. even concluded:

For an overwhelming number of Web users, the advanced search features do not exist. The low use of advanced search features raises questions of their usability, functionality, and even desirability, as currently presented in search engines.

It thus seems worthwhile to carefully reconsider the applicability of word sense disambiguation methods to search engines. Within the past few years, an idea has been circulating that word sense disambiguation, and more generally natural language processing techniques or NLP, are useless in information retrieval (IR), and may even lower performance. I will show below that this claim rests on an erroneous interpretation of repeatedly cited articles like Voorhees (1999). The present study will hopefully demonstrate that this idea is false.

To be useful, a word sense disambiguation technique must exhibit sufficiently high performance. Many recent studies conducted in the Senseval competition (Kilgarriff, 1998) proposed substantial improvements in the available techniques and resources (see also Stevenson and Wilks, 2001). However, in my mind, one of the main problems in word sense disambiguation lies upstream, in the very sense lists used by systems. Conventional dictionaries are not suited to this task; they usually contain definitions that are too general (in our barrage example, the “act of blocking”), and there is no guarantee that they reflect the exact content of the particular textbase being queried. I showed experimentally that linguists have trouble matching the “senses” found in a dictionary and the occurrences found in a corpus (Véronis, 1998). What is more, textbase documents would still have to be automatically categorized on the basis of the dictionary “senses”, an extremely difficult task that has eluded half a century of ongoing research efforts, and for which progress has been very recent (see Ide and Véronis, 1998; for a detailed description of the state of the art, and Stevenson and Wilks, 2001; for recent developments).

Schütze (1998) proposed a method based on “word vectors” that automatically extracts the list of “senses” (I prefer to speak of “uses”) from a given corpus, while also offering a robust categorization technique. However, vector-based techniques come up against a major and highly crippling problem: large frequency differences between the uses of the same word cause most of the useful distinctions below the model's noise threshold to be thrown out.

In the present article, I propose a radically different algorithm, HyperLex, capable of automatically determining the uses of a word in a textbase without recourse to a dictionary. The algorithm exploits the specific properties of word cooccurrence graphs, which, as I will show below, turn out to be special graphs called “small worlds” (Watts and Strogatz, 1998; Barabási and Albert, 1999). Unlike the earlier word-vector methods, this approach can isolate highly infrequent uses (as rare as 1% of all occurrences) by detecting graph “hubs” and high-density components. The algorithm is applied here to information retrieval on the Web, using a set of highly ambiguous test words. An evaluation of the algorithm showed that it only omitted a very small number of relevant uses. In addition, HyperLex offers automatic tagging of word uses in context, with an excellent precision level (97% compared to 73% for baseline tagging, with a 82% recall rate). Remarkably good precision (96%) was also achieved when the 25 most relevant pages were selected for each use (including the highly infrequent ones). Finally, HyperLex comes with a graphic display technique that allows the user to visually navigate throughout the lexicon and explore the various domains detected for each use.

Section snippets

Past research

Word sense disambiguation techniques were first applied to IR about 30 years ago, with Weiss's work (1973), but it was not until the 1990s that this type of application was tested in full scale (Krovetz and Croft, 1992; Voorhees, 1993; Wallis, 1993). The results obtained so far have been modest, and some studies have even reported a decline in performance. As mentioned above in the introduction, it was no doubt these studies – especially the widely cited one by Voorhees (1999) – that

Small lexical worlds

One can construct a graph for each word to be disambiguated in a corpus (or target word – details on how such a graph is generated will be presented later). The graph's nodes are the words that cooccur with the target word (in a window of a given size, e.g., a sentence or a paragraph). An edge connects two nodes, A and B, whenever the corresponding words are cooccurrent with each other. For instance, in the graph of the target word barrage (Fig. 4), the nodes corresponding to production and é

Detection of high-density components

The basic assumption underlying the method proposed here is that the different uses of a target word form highly interconnected “bundles” in a small world of cooccurrences, or in terms of graph theory, high density components. Accordingly, barrage (in the sense of a hydraulic dam) must cooccur frequently with eau, ouvrage, rivière, crue, irrigation, production, électricité (water, work, river, flood, irrigation, production, electricity), etc., and these words themselves are likely to be

Disambiguation

The minimum spanning tree can be used to easily construct a disambiguator for tagging target word occurrences in the corpus. Each tree node v is assigned a score vector s with as many dimensions as there are components:si=11+d(hi,v)ifvbelongstocomponenti,si=0otherwise.In (11), d(hi,v) is the distance between root hub hi and node v in the tree.

Formula (11) assigns a score of 1 to root hubs, whose distance from themselves is 0. The score gradually approaches 0 as the nodes move away from their

Viewing and navigation

Difficult viewing problems arise for large graphs, given that most drawings are NP-hard. We use a method recently devised by Munzner (2000), which allows a very fast display using hyperbolic trees.11 We feed the algorithm with the MST previously described, adjusted by a number or heuristics.

Fig. 10 shows the main view (top of the tree) for the word barrage. The user can navigate from domain to domain inside the

Evaluation

The results of the HyperLex algorithm were evaluated on the Web page corpus, using the list of ten test words described above.

The first step was to make sure that the algorithm correctly extracted most uses in the corpus, irrespective of any tagging of the contexts. This subtask is interesting in its own right (for proposing query refinements to users, for example).

The next step was to assess the quality of the context tagging by drawing a random sample from the corpus and taking the standard

Conclusion

This article presented an efficient algorithm for disambiguating the senses of words in information retrieval tasks. The algorithm, HyperLex, makes use of the particular structure of cooccurrence graphs, shown here to be “small worlds”, a topic of extensive research in recent years. As in previously proposed methods (word vectors), the algorithm automatically extracts a use list for the words in a corpus (here, the Web), a feature that sidesteps problems brought about by recourse to a

Epilogue

After this study was conducted, I discovered, owing to the (impressive) review by Albert and Barabási (2002) that other researchers have been independently investigating the possibility of modelling the semantic aspects of human language using small world networks. Albert and Barabási cite a study in their own group, by Yook, Jeong and Barabasi, which shows that the network of synonyms extracted from the Merriam-Webster Dictionary exhibits a small-world and scale-free structure (however this

Acknowledgements

The author is grateful to Vivian Waltz for her help with the translation, and to anonymous reviewers for their very useful comments.

References (45)

  • A.M. Collins et al.

    Retrieval time from semantic memory

    Journal of Verbal Learning and Verbal Behaviour

    (1969)
  • B.J. Jansen et al.

    Real life, real users, and real needs: a study and analysis of user queries on the web

    Information Processing and Management

    (2000)
  • S.F. Weiss

    Learning to disambiguate

    Information Storage and Retrieval

    (1973)
  • T.E. Ahlswede

    Sense disambiguation strategies for humans and machines

  • T.E. Ahlswede

    Word sense disambiguation by human informants

  • T.E. Ahlswede et al.

    The ambiguity questionnaire: a study of lexical disambiguation by human informants

  • R. Albert et al.

    Statistical mechanics of complex networks

    Review of Modern Physics

    (2002)
  • Amsler, R.A., White, J.S., 1979. Development of a computational methodology for deriving natural language semantic...
  • A.-L. Barabási et al.

    Emergence of scaling in random networks

    Science

    (1999)
  • Bruce, R., Wiebe, J., 1998. Word sense distinguishability and inter-coder agreement. In: Proceedings of the 3rd...
  • C. Fellbaum et al.

    Performance and confidence in a semantic annotation task

  • i. Ferrer et al.

    The small world of human language

    Proceedings of the Royal Society of London. Series B, Biological Sciences

    (2001)
  • Gelbukh, A., Sidorov, G., Chanona-Hernandez, L., 2003. Is Word Sense Disambiguation Useful in Information Retrieval?...
  • Z.S. Harris

    Distributional Structure

    Word

    (1954)
  • Hornby, A.S., 1954. A guide to patterns and usage in English. London,...
  • Hornby, A.S., Gatenby, E.V., Wakefield, H., 1942. Idiomatic and Syntactic English Dictionary. [Photographically...
  • N.M. Ide et al.

    Introduction to the special issue on word sense disambiguation: the state of the art

    Computational Linguistics

    (1998)
  • J. Jorgensen

    The psychological reality of word senses

    Journal of Psycholinguistic Research

    (1990)
  • A. Kilgarriff

    SENSEVAL: an exercise in evaluating word sense disambiguation programs

  • R. Krovetz et al.

    Lexical ambiguity and information retrieval

    ACM Transactions on Information Systems

    (1992)
  • Kruskal, J.B., 1956. On the shortest spanning subtree of a graph and the traveling salesman problem. In: Proceedings of...
  • A. Meillet
    (1926)
  • Cited by (0)

    This article is an extended version of a paper given at the TALN'2003 conference (Véronis, 2003).

    View full text