Swedish full text retrieval: Effectiveness of different combinations of indexing strategies with query terms

Ahlgren, Per; Kekäläinen, Jaana

doi:10.1007/s10791-006-9009-1

Swedish full text retrieval: Effectiveness of different combinations of indexing strategies with query terms

Published: 01 September 2006

Volume 9, pages 681–697, (2006)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Swedish full text retrieval: Effectiveness of different combinations of indexing strategies with query terms

Download PDF

Per Ahlgren¹ &
Jaana Kekäläinen²

163 Accesses
4 Citations
Explore all metrics

Abstract

In this paper, which treats Swedish full text retrieval, the problem of morphological variation of query terms in the document database is studied. The Swedish CLEF 2003 test collection was used, and the effects of combination of indexing strategies with query terms on retrieval effectiveness were studied. Four of the seven tested combinations involved indexing strategies that used normalization, a form of conflation. All of these four combinations employed compound splitting, both during indexing and at query phase. SWETWOL, a morphological analyzer for the Swedish language, was used for normalization and compound splitting. A fifth combination used stemming, while a sixth attempted to group related terms by right hand truncation of query terms. The truncation was performed by a search expert. These six combinations were compared to each other and to a baseline combination, where no attempt was made to counteract the problem of morphological variation of query terms in the document database. Both the truncation combination, the four combinations based on normalization and the stemming combination outperformed the baseline. Truncation had the best performance. The main conclusion of the paper is that truncation, normalization and stemming enhanced retrieval effectiveness in comparison to the baseline. Further, normalization and stemming were not far below truncation.

Information Retrieval with Hindi, Bengali, and Marathi Languages: Evaluation and Analysis

The Effects of Word Frequency Distortions Occasioned by Compounding on the Automatic Indexing of Yorùbá Text

Performance of Turkish Information Retrieval: Evaluating the Impact of Linguistic Parameters and Compound Nouns

Introduction

Swedish differs from English in many respects. Examples of properties of Swedish but not of English are high frequency of compounds, the use of glue morphemes in compounds and high proportion of homographs. Moreover, Swedish is inflectionally more complicated than English. Thus, the results obtained for English information retrieval cannot be directly applied to Swedish.

The problem of morphological variation of query terms in the document collection is a well-known problem in information retrieval (IR) research. IR researchers have attempted to counteract the morphological variation problem by applying different conflation methods, like stemming (which usually refers to the removal of suffixes, inflectional and derivational, from word forms) and normalization (transformation of inflected word forms to their base forms), in the indexing process. We use the term conflation for the process of grouping morphological term variants (Frakes, 1992). Stemming and normalization attempt to group morphological variants in the documents by associating them with a common form. This form acts as a representative of the variants, and can, instead of the variants, be placed in the index with pointers to the documents where the variants occur. Then, if the representative form is used also in the query, documents that contain different variants are retrieved. Stemming can be considered as a more liberal method than normalization, and it is possible that semantically unrelated words are associated with the same stem (the outcome of the stemming process).

In this paper, we look at the problem of morphological variation of query terms in the document collection. The study involves four indexing strategies. The baseline strategy of the study is to place each word form that occurs in the texts of the collection in the index. In particular, all the inflected variants of a given word are placed in the index as such. The application of the baseline strategy gives rise to an inflected word form index. Two of the other three indexing strategies used in the study are based on normalization. Each inflected word form that occurs in the text collection is transformed to a base form, which is then placed in the index. An index generated in this manner is called a base word form index, and such an index does not contain, in principle, inflectional word forms. The two normalization strategies involve compound splitting. A compound that occurs in a document in the text collection is split into its components. Then these components (in base form) and the compound itself are placed in the index, pointing to the same address. In this study, compound splitting was performed also at query phase. The fourth indexing strategy is to stem the word forms that occur in the texts, and then place the resulting stems in the index. In this way, a stem index is generated.

This study compares seven different combinations of indexing strategies with query terms with respect to retrieval effectiveness. The IR system used in the experiment is a best match (probabilistic) system. The aim of the study is to generate knowledge of the behavior of the seven combinations of indexing strategies with query terms with respect to Swedish texts and retrieval effectiveness. This study is based on (Ahlgren, 2004). However, some methods tested in this study were not tested in the mentioned work.

The remainder of this paper is organized as follows. In Section 2, some properties of Swedish related to IR are given. Section 3 reports earlier research, while data and methods are described in Section 4. Findings are reported in Section 5, and these are discussed in Section 6. Conclusions, finally, are given in Section 7.

Some properties of the Swedish language related to IR

For nouns, a phenomenon that is interesting from an IR point of view is mutation (Thorell 1977). Consider the transition from the base form stad (“city”) to the plural städer (“cities”). If we apply right hand truncation (which is applied in this study, see Section 4.3) to stad, where stad is truncated immediately to the right of the ending d, at least one inflectional form of stad would clearly be missed. Alternatively, if a stemming algorithm for Swedish does not associate these two word forms with the same stem, the query stad would miss städer.

The Swedish verbs can be divided into weak and strong, according to the way the past tense is formed (Thorell 1977). In Swedish so called strong verbs, i.e., verbs where the past tense is formed by changing of the vowel of the stem, give rise to similar retrieval problems as mutation. Consider the strong verb ligga (“lie”): ligga, låg (“lay”) and legat (“lain”). Again, if right hand truncation is applied, where ligga is truncated immediately to the right of the second occurrence of g, both the past tense form låg and the supine form legat are missed.

In contrast to English, in Swedish the process of compounding normally yields a single word. For example, the Swedish equivalent of jazz music is jazzmusik. This property of Swedish, that compounding normally yields single words, is relevant to IR. A document in Swedish about jazz music, indexed by jazzmusik but neither by jazz nor musik, would neither be retrieved by the query musik nor by the query jazz. However, if jazzmusik is split into its components during indexing, the document would be retrieved by both queries. Thus, without compound splitting relevant documents may not be found.

In Swedish, the derivational suffixes play a more important role than the derivational prefixes (Malmgren, 1994). The most of the Swedish suffix derivatives belong to one of the parts of speech noun, adjective and verb, and the part of speech of the derivative is determined by the suffix (Malmgren, 1994). For instance, -het is a substantival, -aktig an adjectival and -era a verbal suffix. If we combine these with nykter (“sober”), röd (“red”) and analys (“analysis”), respectively, we get the derivatives nykterhet (“soberness”), rödaktig (“reddish”) and analysera (“analyze”).

Another property of Swedish relevant to IR is the use of glue morphemes in compounding. One such glue morpheme is -s. If compound splitting is performed during indexing, it is desirable to split a compound like märkesvaror (“proprietary (branded) goods”) without retaining the -s. The reason is that märke and not märkes would be used as a query term. Finally, homography is a very usual phenomenon in Swedish (Malmgren, 1994). For example, plan (“airplane”) and plan (“open space”) are homographs. If a user is interested in airplanes and use plan in the query, irrelevant documents about open spaces may be retrieved.

Earlier research

Several earlier studies have shown that conflation can be useful in relation to retrieval effectiveness. The research by Lennon et al. (1981), Krovetz (1993) and Hull (1996) supports the idea that stemming is useful for the English language. Harman (1991), though, did not see any significant differences between stemming and non-stemming, and the same holds for the control experiment performed by Popoviˇ and Willett (1992).

The picture is clearer for morphologically more complex languages than for English. Kalamboukis (1995) and Popoviˇ and Willett (1992) reported improvements in retrieval effectiveness due to stemming for Greek and Slovene, respectively. With respect to Dutch, Kraaij and Pohlman (1996) report that for each recall measure used in the study, an inflectional algorithm, extended with compound splitting and compound generation, was significantly better than non-stemming.

Savoy (1999) used both long and short French test documents. Not only conflation was studied, but also stopwording with regard to retrieval effectiveness. It was observed that, for the test collection with short documents, conflation in the form of stemming was beneficial for the most of the used weighting scheme/ranking function combinations, irrespective of wether a stop list was applied or not. For Turkish, Ekmekçioglu and Willett (2000) applied a morphological analyzer at query phase and compared conflation to non-conflation. They report that conflation was beneficial. The differences in effectiveness between conflation and non-conflation were statistically significant.

Alkula (2001) dealt with Finnish and used three indices, an inflected word form index and two base word form indices. With regard to one of these latter indices, compound splitting was performed. For both precision and recall, methods that involved normalization performed best. However, the baseline was manual truncation of search words, and a Boolean IR system was used. Kettunen et al. (2005) compared normalization with compound splitting and stemming as indexing strategies for Finnish in a probabilistic IR environment. Queries were processed according to the indexing strategy. Further, they compared truncation (automatic) in an inflected word form index to the former methods. Their results showed that normalization performed significantly better than the other methods.

Carlberger et al. (2001) used a database of Swedish news articles and a Swedish stemmer. It turned out that stemming improved precision, as well as recall. In Tomlinson (2002), the effects of stemming on retrieval effectiveness was investigated for several European languages, among them Swedish. CLEF 2002 test collections were used in the experiments. For Swedish, stemming improved mean average precision with 4%. The improvement was not statistically significant, though. The corresponding (statistically significant) improvement for German was as high as 27%. However, German compounds were split, both during indexing and at query phase, while Swedish compounds were not.

Tomlinson (2003) compared non-conflation, inflectional stemming (lexicon-based) and inflectional/derivational stemming, using CLEF 2003 test collections. With respect to inflectional/derivational stemming, Porter stemmers were used. The study involved several languages, among them Swedish (for which compound splitting was not employed). For this language, inflectional stemming had a significantly better performance (mean average precision) than non-conflation, and inflectional/derivational stemming had a significantly better performance than inflectional stemming. However, with regard to mean precision at DCVs 5, 10 and 20, the differences between the latter two approaches were fairly small.

Savoy (2003) used CLEF 2003 test collections. The study dealt with several languages, and tested the effect of several variables (like indexing strategy, retrieval model and pseudo-relevance feedback). Further, inflectional stemming was utilized. For Swedish, three indexing strategies were compared: a word-based, a compound splitting (also with respect to queries) and a 4-gram strategy. For seven of the 11 tested retrieval models, the compound splitting strategy performed best with respect to mean average precision. This approach outperformed the word-based indexing strategy for each retrieval model. However, the maximal difference was no more than 2.26 percentage units.

Finally, Braschler and Ripplinger (2004) dealt with German and studied the effects of conflation and compound splitting on retrieval effectiveness. Compound splitting took place both during indexing and at query phase. Further, both short and long queries were used. Methods that used conflation in combination with compound splitting had better effectiveness than methods that only used conflation, and these latter methods performed better than the baseline of non-conflation and non-splitting. However, the improvements obtained, in relation to the baseline, were higher for short queries than for long.

In this study we will compare indexing strategies utilizing grammatical morphology analysis (normalization) to non-grammatical approaches (stemming and inflectional indices) for Swedish. Further, we will consider several grammatical and non-grammatical, intellectual and automatic methods for handling query terms in query construction phase, respectively for each index type. Since Swedish is a morphologically relatively complex language, it is reasonable to expect that conflation would enhance retrieval effectiveness also with respect to the documents and requests used in this study. Moreover, the results reported in Savoy (2003) and Braschler and Ripplinger (2004) supports the hypothesis that compound splitting, both during indexing and at query phase, would be fruitful.

Data and methods

The indexing strategies of the study

The first indexing strategy of the study, which we call INFL, is to place the document terms as such in the index. In particular, all the inflected variants of a given word are placed in the index as such. The application of INFL yields an inflected word form index. The second indexing strategy, SPLIT, employs compound splitting and can be described as follows: normalize each inflected word form and place the resulting base form(s) in the index. For compounds, split them and place the components (in base form) in the index. The application of SPLIT yields a base word form index.

Consider a compound like stålindustrin (“the steel industry”). An index generated by SPLIT will have an entry, not just for stålindustri, but also for stål and industri. stålindustrin is first normalized to stålindustri. stålindustri is then split up into its components (in base form), stål and industri. Finally, stålindustri, stål and industri are placed in the index, pointing to the same address. With an index generated by SPLIT, a user can employ the components of a compound as separate query terms. Such an index can therefore be said to address, besides the problem of inflectional variation, the problem of (in compounds) embedded query terms.

In this study, the indexing strategies based on normalization employs SWETWOL (Karlsson, 1992), a morphological analyzer for the Swedish language based on TWOL, the two-level model created by Koskenniemi (1983). SWETWOL is meant to be used as a basic morphological tool in, e.g., information indexing (Karlsson, 1992). For a given word form as input, the output of SWETWOL is a cohort. A cohort has two parts: the input word form and 0 or more readings. A reading consists of a base form of the input word form and a description of the input word with respect to part of speech and other grammatical categories. The following cohort is produced by SWETWOL as a response to the input word form organisationerna (“the organizations”):

$$\displaylines{ ``<organisationerna> " \cr \noindent \hspace*{30mm}``organisation" N UTR DEF PL NOM }$$

SWETWOL analyzes organisationerna as a noun (N), with gender non-neuter (UTR), in definite form (DEF) plural (PL), in nominative case (NOM), and with the base form organisation (“organization”).

When a compound is given as input to SWETWOL, it is split in the output. The cohort for pappersbruken (“the paper-mills”) is

$$\displaylines{ \noindent \hspace*{2pc}``<pappersbruken>'' \cr \noindent \hspace*{30mm}``pappers\#bruk'' N NEU DEF PL NOM }$$

where NEU stands for gender neuter. Note that the glue morpheme -s is retained. Obviously, it is desirable to place papper (“paper”) and not pappers (“papers”) in an index like SPLIT. In this study, SWETWOL has been tuned to return base forms also of components of compounds. In the present case, pappers would be normalized to papper, which is then placed in the index.

We define a word form to be morphologically ambiguous if SWETWOL gives more than one reading for it. As an example of a morphologically ambiguous word form, consider the compound kulturdebatt (“cultural debate”), with the following cohort:

$$\displaylines{ \noindent \hspace*{2pc}``<kulturdebatt>''\cr \noindent \hspace*{28.6mm}``kultur\#debatt'' N UTR INDEF SG NOM \cr \noindent \hspace*{24mm}*\hspace{2.8mm}``kult\#ur\#debatt'' N UTR INDEF SG NOM \cr \noindent \hspace*{24mm}*\hspace{2.8mm}``kul\#tur\#debatt'' N UTR INDEF SG NOM } $$

In this cohort, the first reading is appropriate, the other two readings are spurious. (We use a star, “*”, as a prefix for spurious readings.) The precision, i.e., the fraction of the appropriate readings of a cohort in relation to all the readings of a cohort, is in this case 1/3. One can ask if the precision of the compound mechanism of SWETWOL is high enough for the indexing strategy SPLIT to be fruitful with respect to retrieval effectiveness. Given a compound as input to SWETWOL, low precision may characterize the output, as we saw in the kulturdebatt example. If SPLIT is used in such a situation, the components of a spurious compound reading are placed in the index in base form. If these components in base form are semantically unrelated to the content of the source document of the input compound, retrieval effectiveness (precision, in this case) may be affected negatively if the components (in base form) are used as query terms. Let us consider the compound marinbiologer (“marine biologists”). Given this compound as input, SWETWOL gives the following cohort as output:

$$\displaylines{ ``<marinbiologer>''\cr \noindent \hspace*{29mm}``marin\#biolog'' N UTR INDEF PL NOM \cr \noindent \hspace*{24.4mm}*\hspace{2.9mm}``marin\#bio\#loge'' N UTR INDEF PL NOM }$$

The last reading is spurious. Its last component means “barn”, “box” or “lodge”. Assume that marinbiologer occurs in a document about marine biologists. If SPLIT is used, the following terms are placed in the index:

bio biolog loge marin marinbiolog marinbiologe

bio (“cinema”), e.g., is then placed in the index, with a pointer to the document about marine biologists. A user, interested in cinema documents and employing bio as query term, will retrieve a least one document about marine biologists.

Local disambiguation refers to the elimination of readings in a cohort (Karlsson, 1992). Such morphological disambiguation uses no cohort-external information. Only the information in the cohort itself is used for deciding which of the readings in the cohort should be eliminated.

With the intention to reduce, for a given document d and in relation to SPLIT, the number of index entries e such that (1) the term t of e is semantically unrelated to d, and (2) d is represented at e, local disambiguation was applied in a third indexing strategy, SPLIT-EL. Like SPLIT, SPLIT-EL involves compound splitting. The sole difference between SPLIT-EL and SPLIT is that SPLIT-EL makes use of the Compound Elimination Principle (Karlsson, 1992):

If a cohort C contains readings with n and m compound boundaries, discard all readings with m compound boundaries if m > n.

This means that only the readings with a minimum number of compound boundaries are considered. The other readings are eliminated. With respect to the term marinbiologer, SPLIT-EL yields that the spurious last reading is eliminated and only the terms

biolog marin marinbiolog

are taken to the index.

Terms that are not recognized by SWETWOL are preceded by the character “@” in the indices. This markup facilitates access to terms not recognized. Further, it has an impact on query construction (see Section 4.3.3).

The remaining indexing strategy is to stem the word forms that occur in the documents, and then place the resulting stems in the index. We let STEM denote this strategy. In this study, stemming was performed by the Snowball stemmer for Swedish (Porter, 2001).

Combinations of indexing strategies with query terms

Seven IR methods were evaluated in the study. These seven methods constitute the combinations indexing strategies with query terms. In the following list, these combinations are described and numbered.

1
Application of the indexing strategy INFL and use of original words, from a topic text, as query terms. This is the baseline combination of the study, and it is denoted by INFL-orig.
2
Application of the indexing strategy INFL and use of truncation stems as query terms. The combination is denoted by INFL-trunc.
3
Application of the indexing strategy SPLIT and use of intellectually selected words in base form (compounds split) as query terms. The combination is denoted by SPLIT-split-int.
4
Application of the indexing strategy SPLIT and use of automatically selected words in base form (compounds split) as query terms. The combination is denoted by SPLIT-split-aut.
5
Application of the indexing strategy SPLIT-EL and use of intellectually selected words in base form (compounds split) as query terms. The combination is denoted by SPLIT-EL-split-int.
6
Application of the indexing strategy SPLIT-EL and use of automatically selected words in base form (compounds split and the Compound Elimination Principle applied) as query terms. The combination is denoted by SPLIT-EL-split-el-aut.
7
Application of the indexing strategy STEM and use of automatically generated word stems as query terms. The combination is denoted by STEM-stems.

Test collection, IR system and construction of queries

We used the Swedish CLEF 2003 test collection. The document set of this collection consists of 142,819 news articles (352 MB) from Tidningarnas Telegrambyrå (1994/1995). The number of topics in the collection is 60, and the topics are numbered 141–200. Since six topics had no (known) relevant documents, they were excluded. Thus, the sample size of the study is n = 54. The number of known relevant documents for the 54 topics is 1,006.

InQuery (Version 3.1), a probabilistic system based on the inference network model (Turtle, 1990; Turtle and Croft, 1990, 1991), was used as test system. InQuery involves uses a best match technique, and has a wide range of operators. (For a detailed description, see Allan et al., 1997; Callan et al., 1995; Rajashekar and Croft 1995.)

For each topic, a word list was constructed from the description field (see Appendix 1). In this process, a stop list for Swedish was used. A word list L _i for a topic T _i is such that each word in L _i occurs in the description field of T _i. For each of the seven tested combinations of indexing strategies with query terms, a set of 54 queries, to be run against a method's corresponding index, was generated on the basis of the word lists.

Queries for INFL-orig

The baseline method of the study, the combination INFL-orig, uses original words from the word lists as query terms. Let L _i be the word list obtained from a topic T _i. The query, corresponding to T _i , for the combination INFL-orig was constructed by placing each word w in L _i in the query, as an argument of the # sum operator of InQuery. Informally, # sum specifies that the more of its arguments that are satisfied by the document, the better. The query has thus the form