Introduction

Swedish differs from English in many respects. Examples of properties of Swedish but not of English are high frequency of compounds, the use of glue morphemes in compounds and high proportion of homographs. Moreover, Swedish is inflectionally more complicated than English. Thus, the results obtained for English information retrieval cannot be directly applied to Swedish.

The problem of morphological variation of query terms in the document collection is a well-known problem in information retrieval (IR) research. IR researchers have attempted to counteract the morphological variation problem by applying different conflation methods, like stemming (which usually refers to the removal of suffixes, inflectional and derivational, from word forms) and normalization (transformation of inflected word forms to their base forms), in the indexing process. We use the term conflation for the process of grouping morphological term variants (Frakes, 1992). Stemming and normalization attempt to group morphological variants in the documents by associating them with a common form. This form acts as a representative of the variants, and can, instead of the variants, be placed in the index with pointers to the documents where the variants occur. Then, if the representative form is used also in the query, documents that contain different variants are retrieved. Stemming can be considered as a more liberal method than normalization, and it is possible that semantically unrelated words are associated with the same stem (the outcome of the stemming process).

In this paper, we look at the problem of morphological variation of query terms in the document collection. The study involves four indexing strategies. The baseline strategy of the study is to place each word form that occurs in the texts of the collection in the index. In particular, all the inflected variants of a given word are placed in the index as such. The application of the baseline strategy gives rise to an inflected word form index. Two of the other three indexing strategies used in the study are based on normalization. Each inflected word form that occurs in the text collection is transformed to a base form, which is then placed in the index. An index generated in this manner is called a base word form index, and such an index does not contain, in principle, inflectional word forms. The two normalization strategies involve compound splitting. A compound that occurs in a document in the text collection is split into its components. Then these components (in base form) and the compound itself are placed in the index, pointing to the same address. In this study, compound splitting was performed also at query phase. The fourth indexing strategy is to stem the word forms that occur in the texts, and then place the resulting stems in the index. In this way, a stem index is generated.

This study compares seven different combinations of indexing strategies with query terms with respect to retrieval effectiveness. The IR system used in the experiment is a best match (probabilistic) system. The aim of the study is to generate knowledge of the behavior of the seven combinations of indexing strategies with query terms with respect to Swedish texts and retrieval effectiveness. This study is based on (Ahlgren, 2004). However, some methods tested in this study were not tested in the mentioned work.

The remainder of this paper is organized as follows. In Section 2, some properties of Swedish related to IR are given. Section 3 reports earlier research, while data and methods are described in Section 4. Findings are reported in Section 5, and these are discussed in Section 6. Conclusions, finally, are given in Section 7.

Some properties of the Swedish language related to IR

For nouns, a phenomenon that is interesting from an IR point of view is mutation (Thorell 1977). Consider the transition from the base form stad (“city”) to the plural städer (“cities”). If we apply right hand truncation (which is applied in this study, see Section 4.3) to stad, where stad is truncated immediately to the right of the ending d, at least one inflectional form of stad would clearly be missed. Alternatively, if a stemming algorithm for Swedish does not associate these two word forms with the same stem, the query stad would miss städer.

The Swedish verbs can be divided into weak and strong, according to the way the past tense is formed (Thorell 1977). In Swedish so called strong verbs, i.e., verbs where the past tense is formed by changing of the vowel of the stem, give rise to similar retrieval problems as mutation. Consider the strong verb ligga (“lie”): ligga, låg (“lay”) and legat (“lain”). Again, if right hand truncation is applied, where ligga is truncated immediately to the right of the second occurrence of g, both the past tense form låg and the supine form legat are missed.

In contrast to English, in Swedish the process of compounding normally yields a single word. For example, the Swedish equivalent of jazz music is jazzmusik. This property of Swedish, that compounding normally yields single words, is relevant to IR. A document in Swedish about jazz music, indexed by jazzmusik but neither by jazz nor musik, would neither be retrieved by the query musik nor by the query jazz. However, if jazzmusik is split into its components during indexing, the document would be retrieved by both queries. Thus, without compound splitting relevant documents may not be found.

In Swedish, the derivational suffixes play a more important role than the derivational prefixes (Malmgren, 1994). The most of the Swedish suffix derivatives belong to one of the parts of speech noun, adjective and verb, and the part of speech of the derivative is determined by the suffix (Malmgren, 1994). For instance, -het is a substantival, -aktig an adjectival and -era a verbal suffix. If we combine these with nykter (“sober”), röd (“red”) and analys (“analysis”), respectively, we get the derivatives nykterhet (“soberness”), rödaktig (“reddish”) and analysera (“analyze”).

Another property of Swedish relevant to IR is the use of glue morphemes in compounding. One such glue morpheme is -s. If compound splitting is performed during indexing, it is desirable to split a compound like märkesvaror (“proprietary (branded) goods”) without retaining the -s. The reason is that märke and not märkes would be used as a query term. Finally, homography is a very usual phenomenon in Swedish (Malmgren, 1994). For example, plan (“airplane”) and plan (“open space”) are homographs. If a user is interested in airplanes and use plan in the query, irrelevant documents about open spaces may be retrieved.

Earlier research

Several earlier studies have shown that conflation can be useful in relation to retrieval effectiveness. The research by Lennon et al. (1981), Krovetz (1993) and Hull (1996) supports the idea that stemming is useful for the English language. Harman (1991), though, did not see any significant differences between stemming and non-stemming, and the same holds for the control experiment performed by Popoviˇ and Willett (1992).

The picture is clearer for morphologically more complex languages than for English. Kalamboukis (1995) and Popoviˇ and Willett (1992) reported improvements in retrieval effectiveness due to stemming for Greek and Slovene, respectively. With respect to Dutch, Kraaij and Pohlman (1996) report that for each recall measure used in the study, an inflectional algorithm, extended with compound splitting and compound generation, was significantly better than non-stemming.

Savoy (1999) used both long and short French test documents. Not only conflation was studied, but also stopwording with regard to retrieval effectiveness. It was observed that, for the test collection with short documents, conflation in the form of stemming was beneficial for the most of the used weighting scheme/ranking function combinations, irrespective of wether a stop list was applied or not. For Turkish, Ekmekçioglu and Willett (2000) applied a morphological analyzer at query phase and compared conflation to non-conflation. They report that conflation was beneficial. The differences in effectiveness between conflation and non-conflation were statistically significant.

Alkula (2001) dealt with Finnish and used three indices, an inflected word form index and two base word form indices. With regard to one of these latter indices, compound splitting was performed. For both precision and recall, methods that involved normalization performed best. However, the baseline was manual truncation of search words, and a Boolean IR system was used. Kettunen et al. (2005) compared normalization with compound splitting and stemming as indexing strategies for Finnish in a probabilistic IR environment. Queries were processed according to the indexing strategy. Further, they compared truncation (automatic) in an inflected word form index to the former methods. Their results showed that normalization performed significantly better than the other methods.

Carlberger et al. (2001) used a database of Swedish news articles and a Swedish stemmer. It turned out that stemming improved precision, as well as recall. In Tomlinson (2002), the effects of stemming on retrieval effectiveness was investigated for several European languages, among them Swedish. CLEF 2002 test collections were used in the experiments. For Swedish, stemming improved mean average precision with 4%. The improvement was not statistically significant, though. The corresponding (statistically significant) improvement for German was as high as 27%. However, German compounds were split, both during indexing and at query phase, while Swedish compounds were not.

Tomlinson (2003) compared non-conflation, inflectional stemming (lexicon-based) and inflectional/derivational stemming, using CLEF 2003 test collections. With respect to inflectional/derivational stemming, Porter stemmers were used. The study involved several languages, among them Swedish (for which compound splitting was not employed). For this language, inflectional stemming had a significantly better performance (mean average precision) than non-conflation, and inflectional/derivational stemming had a significantly better performance than inflectional stemming. However, with regard to mean precision at DCVs 5, 10 and 20, the differences between the latter two approaches were fairly small.

Savoy (2003) used CLEF 2003 test collections. The study dealt with several languages, and tested the effect of several variables (like indexing strategy, retrieval model and pseudo-relevance feedback). Further, inflectional stemming was utilized. For Swedish, three indexing strategies were compared: a word-based, a compound splitting (also with respect to queries) and a 4-gram strategy. For seven of the 11 tested retrieval models, the compound splitting strategy performed best with respect to mean average precision. This approach outperformed the word-based indexing strategy for each retrieval model. However, the maximal difference was no more than 2.26 percentage units.

Finally, Braschler and Ripplinger (2004) dealt with German and studied the effects of conflation and compound splitting on retrieval effectiveness. Compound splitting took place both during indexing and at query phase. Further, both short and long queries were used. Methods that used conflation in combination with compound splitting had better effectiveness than methods that only used conflation, and these latter methods performed better than the baseline of non-conflation and non-splitting. However, the improvements obtained, in relation to the baseline, were higher for short queries than for long.

In this study we will compare indexing strategies utilizing grammatical morphology analysis (normalization) to non-grammatical approaches (stemming and inflectional indices) for Swedish. Further, we will consider several grammatical and non-grammatical, intellectual and automatic methods for handling query terms in query construction phase, respectively for each index type. Since Swedish is a morphologically relatively complex language, it is reasonable to expect that conflation would enhance retrieval effectiveness also with respect to the documents and requests used in this study. Moreover, the results reported in Savoy (2003) and Braschler and Ripplinger (2004) supports the hypothesis that compound splitting, both during indexing and at query phase, would be fruitful.

Data and methods

The indexing strategies of the study

The first indexing strategy of the study, which we call INFL, is to place the document terms as such in the index. In particular, all the inflected variants of a given word are placed in the index as such. The application of INFL yields an inflected word form index. The second indexing strategy, SPLIT, employs compound splitting and can be described as follows: normalize each inflected word form and place the resulting base form(s) in the index. For compounds, split them and place the components (in base form) in the index. The application of SPLIT yields a base word form index.

Consider a compound like stålindustrin (“the steel industry”). An index generated by SPLIT will have an entry, not just for stålindustri, but also for stål and industri. stålindustrin is first normalized to stålindustri. stålindustri is then split up into its components (in base form), stål and industri. Finally, stålindustri, stål and industri are placed in the index, pointing to the same address. With an index generated by SPLIT, a user can employ the components of a compound as separate query terms. Such an index can therefore be said to address, besides the problem of inflectional variation, the problem of (in compounds) embedded query terms.

In this study, the indexing strategies based on normalization employs SWETWOL (Karlsson, 1992), a morphological analyzer for the Swedish language based on TWOL, the two-level model created by Koskenniemi (1983). SWETWOL is meant to be used as a basic morphological tool in, e.g., information indexing (Karlsson, 1992). For a given word form as input, the output of SWETWOL is a cohort. A cohort has two parts: the input word form and 0 or more readings. A reading consists of a base form of the input word form and a description of the input word with respect to part of speech and other grammatical categories. The following cohort is produced by SWETWOL as a response to the input word form organisationerna (“the organizations”):

$$\displaylines{ ``<organisationerna> " \cr \noindent \hspace*{30mm}``organisation" N UTR DEF PL NOM }$$

SWETWOL analyzes organisationerna as a noun (N), with gender non-neuter (UTR), in definite form (DEF) plural (PL), in nominative case (NOM), and with the base form organisation (“organization”).

When a compound is given as input to SWETWOL, it is split in the output. The cohort for pappersbruken (“the paper-mills”) is

$$\displaylines{ \noindent \hspace*{2pc}``<pappersbruken>'' \cr \noindent \hspace*{30mm}``pappers\#bruk'' N NEU DEF PL NOM }$$

where NEU stands for gender neuter. Note that the glue morpheme -s is retained. Obviously, it is desirable to place papper (“paper”) and not pappers (“papers”) in an index like SPLIT. In this study, SWETWOL has been tuned to return base forms also of components of compounds. In the present case, pappers would be normalized to papper, which is then placed in the index.

We define a word form to be morphologically ambiguous if SWETWOL gives more than one reading for it. As an example of a morphologically ambiguous word form, consider the compound kulturdebatt (“cultural debate”), with the following cohort:

$$\displaylines{ \noindent \hspace*{2pc}``<kulturdebatt>''\cr \noindent \hspace*{28.6mm}``kultur\#debatt'' N UTR INDEF SG NOM \cr \noindent \hspace*{24mm}*\hspace{2.8mm}``kult\#ur\#debatt'' N UTR INDEF SG NOM \cr \noindent \hspace*{24mm}*\hspace{2.8mm}``kul\#tur\#debatt'' N UTR INDEF SG NOM } $$

In this cohort, the first reading is appropriate, the other two readings are spurious. (We use a star, “*”, as a prefix for spurious readings.) The precision, i.e., the fraction of the appropriate readings of a cohort in relation to all the readings of a cohort, is in this case 1/3. One can ask if the precision of the compound mechanism of SWETWOL is high enough for the indexing strategy SPLIT to be fruitful with respect to retrieval effectiveness. Given a compound as input to SWETWOL, low precision may characterize the output, as we saw in the kulturdebatt example. If SPLIT is used in such a situation, the components of a spurious compound reading are placed in the index in base form. If these components in base form are semantically unrelated to the content of the source document of the input compound, retrieval effectiveness (precision, in this case) may be affected negatively if the components (in base form) are used as query terms. Let us consider the compound marinbiologer (“marine biologists”). Given this compound as input, SWETWOL gives the following cohort as output:

$$\displaylines{ ``<marinbiologer>''\cr \noindent \hspace*{29mm}``marin\#biolog'' N UTR INDEF PL NOM \cr \noindent \hspace*{24.4mm}*\hspace{2.9mm}``marin\#bio\#loge'' N UTR INDEF PL NOM }$$

The last reading is spurious. Its last component means “barn”, “box” or “lodge”. Assume that marinbiologer occurs in a document about marine biologists. If SPLIT is used, the following terms are placed in the index:

bio biolog loge marin marinbiolog marinbiologe

bio (“cinema”), e.g., is then placed in the index, with a pointer to the document about marine biologists. A user, interested in cinema documents and employing bio as query term, will retrieve a least one document about marine biologists.

Local disambiguation refers to the elimination of readings in a cohort (Karlsson, 1992). Such morphological disambiguation uses no cohort-external information. Only the information in the cohort itself is used for deciding which of the readings in the cohort should be eliminated.

With the intention to reduce, for a given document d and in relation to SPLIT, the number of index entries e such that (1) the term t of e is semantically unrelated to d, and (2) d is represented at e, local disambiguation was applied in a third indexing strategy, SPLIT-EL. Like SPLIT, SPLIT-EL involves compound splitting. The sole difference between SPLIT-EL and SPLIT is that SPLIT-EL makes use of the Compound Elimination Principle (Karlsson, 1992):

If a cohort C contains readings with n and m compound boundaries, discard all readings with m compound boundaries if m > n.

This means that only the readings with a minimum number of compound boundaries are considered. The other readings are eliminated. With respect to the term marinbiologer, SPLIT-EL yields that the spurious last reading is eliminated and only the terms

biolog marin marinbiolog

are taken to the index.

Terms that are not recognized by SWETWOL are preceded by the character “@” in the indices. This markup facilitates access to terms not recognized. Further, it has an impact on query construction (see Section 4.3.3).

The remaining indexing strategy is to stem the word forms that occur in the documents, and then place the resulting stems in the index. We let STEM denote this strategy. In this study, stemming was performed by the Snowball stemmer for Swedish (Porter, 2001).

Combinations of indexing strategies with query terms

Seven IR methods were evaluated in the study. These seven methods constitute the combinations indexing strategies with query terms. In the following list, these combinations are described and numbered.

  1. 1

    Application of the indexing strategy INFL and use of original words, from a topic text, as query terms. This is the baseline combination of the study, and it is denoted by INFL-orig.

  2. 2

    Application of the indexing strategy INFL and use of truncation stems as query terms. The combination is denoted by INFL-trunc.

  3. 3

    Application of the indexing strategy SPLIT and use of intellectually selected words in base form (compounds split) as query terms. The combination is denoted by SPLIT-split-int.

  4. 4

    Application of the indexing strategy SPLIT and use of automatically selected words in base form (compounds split) as query terms. The combination is denoted by SPLIT-split-aut.

  5. 5

    Application of the indexing strategy SPLIT-EL and use of intellectually selected words in base form (compounds split) as query terms. The combination is denoted by SPLIT-EL-split-int.

  6. 6

    Application of the indexing strategy SPLIT-EL and use of automatically selected words in base form (compounds split and the Compound Elimination Principle applied) as query terms. The combination is denoted by SPLIT-EL-split-el-aut.

  7. 7

    Application of the indexing strategy STEM and use of automatically generated word stems as query terms. The combination is denoted by STEM-stems.

Test collection, IR system and construction of queries

We used the Swedish CLEF 2003 test collection. The document set of this collection consists of 142,819 news articles (352 MB) from Tidningarnas Telegrambyrå (1994/1995). The number of topics in the collection is 60, and the topics are numbered 141–200. Since six topics had no (known) relevant documents, they were excluded. Thus, the sample size of the study is n = 54. The number of known relevant documents for the 54 topics is 1,006.

InQuery (Version 3.1), a probabilistic system based on the inference network model (Turtle, 1990; Turtle and Croft, 1990, 1991), was used as test system. InQuery involves uses a best match technique, and has a wide range of operators. (For a detailed description, see Allan et al., 1997; Callan et al., 1995; Rajashekar and Croft 1995.)

For each topic, a word list was constructed from the description field (see Appendix 1). In this process, a stop list for Swedish was used. A word list L i for a topic T i is such that each word in L i occurs in the description field of T i . For each of the seven tested combinations of indexing strategies with query terms, a set of 54 queries, to be run against a method's corresponding index, was generated on the basis of the word lists.

Queries for INFL-orig

The baseline method of the study, the combination INFL-orig, uses original words from the word lists as query terms. Let L i be the word list obtained from a topic T i . The query, corresponding to T i , for the combination INFL-orig was constructed by placing each word w in L i in the query, as an argument of the # sum operator of InQuery. Informally, # sum specifies that the more of its arguments that are satisfied by the document, the better. The query has thus the form

$$ \#\mathbf{sum}(Q_1, Q_2, \ldots, Q_n) $$

where each Q i is a word from L i . Clearly, no attempt to counteract the problem of word variation in the document collection is made.

Queries for INFL-trunc

With regard to INFL-trunc, an attempt to counteract the problem of word variation in the document collection was made by using a search expert. The 60 topics were submitted to the search expert, together with the 60 word lists and instructions. For each wordlist, the search expert performed right hand truncation on the words he found appropriate to right hand truncate (see Appendix 1). The character “,” was used as a boundary character.

Like the queries for INFL-orig (and for SPLIT-base, SPLIT-EL-base and STEM-stems, see below), a query for INFL-trunc is a # sum expression. Since InQuery does not support truncation, it was improper to use truncated words in queries. Instead, retrieval using truncated words as query terms was simulated.

Let L i be the word list obtained from topic T i . Further, let w be a word that occurs in L i . The query, corresponding to T i , for INFL-trunc was constructed in the following way. Assume that the search expert has right hand truncated w. Let s be the truncation stem of w, i.e., the sequence of characters to the left of the boundary character. Each string in the inflected word form index that begins with s was retrieved, with the aid of a Unix script, and the retrieved strings were placed inside the # syn operator of InQuery. This operator is such that its arguments are treated as different expressions for the same concept, i.e., as synonyms. The resulting # syn expression was then placed in the query, as an argument of # sum. If w was not right hand truncated, then w itself was placed in the query, as an argument of # sum. The query has thus the form

$$ \#\mathbf{sum}(Q_1, Q_2, \ldots, Q_n) $$

where each Q k is either a word from L i , or a # syn expression such that its arguments are terms retrieved from the inflected word form index of the study.

Queries for the combinations based on normalization

Let L i be the word list generated from a topic T i . Further, let w be a word that occurs in L i . The query, corresponding to T i , for SPLIT-split-int and SPLIT-EL-split-int was intellectually constructed. The aim of the intellectual construction was to detect whether omitting semantically spurious terms bound to appear in the automatic query construction process would enhance effectiveness. w was given to SWETWOL as input. First, assume that w was recognized. Irrespective of whether w is a compound or not, we selected the contextually right reading and used the corresponding base form for the query. If w is non-compound, the base form was placed in the query, as an argument of # sum. If w is a compound, we placed the base form together with its components (in base form) inside the # syn operator. The resulting # syn expression was then placed in the query, as an argument of # sum. A compound component was normalized by us, if the base form was not given by SWETWOL. Second, if w was not recognized, w itself, preceded by “@”, was placed in the query, as an argument of # sum. The query has thus the form

$$ \#\mathbf{sum}(Q_1, Q_2, \ldots, Q_n) $$

where each Q i is a word from L i in base form, a # syn expression such that its arguments are a compound from L i in base form and the components (in base form) of the compound, or an unrecognized word from L i preceded by “@”.

The query, corresponding to T i , for SPLIT-split-aut was automatically constructed. w was given to SWETWOL as input. Assume that w was recognized. If only one base form occurred in (the readings part of) the cohort, the base form was placed in the query as an argument of # sum, unless the base form was analyzed as a compound. In that case, the base form and the compound components that occurred in the cohort were placed inside # syn, and the resulting # syn expression was placed in the query, as an argument of # sum. If more than one base form occurred in the cohort, each of the base forms was placed inside # syn, and the resulting # syn expression was placed in the query, as an argument of # sum. In the case of compounds, also each compound component that occurred in the cohort was placed inside # syn. If w was not recognized, w was treated as in the intellectual case. The query has thus the form

$$ \#\mathbf{sum}(Q_1, Q_2, \ldots, Q_n) $$

where each Q i is a word from L i in base form, a # syn expression such that its arguments are one or more base forms given by SWETWOL, possibly together with compound components in the form given by SWETWOL, or an unrecognized word from L i preceded by “@”.

Like the query for SPLIT-split-aut, the query for SPLIT-EL-split-el-aut was automatically constructed. However, if w was recognized, the Compound Elimintion Principle was applied to the generated cohort. After this application, the procedure described above for SPLIT-split-aut was followed. If w was not recognized, w was treated as in the intellectual case. The query has thus the same form as the query for SPLIT-split-aut. (See Appendix 1 for example queries for the combinations based on normalization.)

Queries for STEM-stems

For the combination STEM-stems, the query corresponding to T i was constructed in the following way. Each word w in L i was stemmed by the employed stemmer. The resulting word stem, say w s , was then placed in the query, as an argument of # sum.

$$ \#\mathbf{sum}(Q_1, Q_2, \ldots, Q_n) $$

where each Q i is a stem obtained from a word from L i .

Findings

In this section, we report the results of the experiment and the outcome of the statistical analysis of the results. The retrieval effectiveness of the seven combinations of indexing strategies with query terms was evaluated as mean uninterpolated average precision (MAP), heavily used in the TREC environment (Voorhees, 2004), and as (interpolated) precision at 11 standard recall levels. Uninterpolated average precision (AP) for an IR method and a given topic is the sum of the precision values, obtained after each new, known relevant document is observed in the ranking of m documents associated with the IR-method and the topic, divided by the number of known relevant documents for the topic. In our case, m was set to 1000.

MAP and precision at standard recall levels

The MAP values for the seven combinations of indexing strategies with query terms are given in Table 1. The combinations are ordered according to descending MAP values. The improvements in percent relative to the baseline (INFL-orig), where no attempt was made to counteract the problem of morphological variation of query terms in the document database, are reported inside parentheses. The best performing combination with respect to MAP is INFL-trunc, which uses manually generated truncation stems as query terms. A performance gain of 32.1% is obtained by this combination. The stemming combination, STEM-stems, performs second best, and obtains a performance gain of 22.5%.

Table 1 MAP values for the seven combinations of indexing strategies with query terms

The four combinations based on normalization (the SPLIT combinations) employ compound splitting, both during indexing and at query phase. Also these combinations outperform the baseline, and they exhibit similar performance. The percentage improvements are between 19% and 22%, approximately. Note that the difference in absolute MAP values between the best performing combination among these four (SPLIT-split-int) and the worst performing combination (SPLIT-split-aut) is less than 1% unit. The former combination involves intellectually constructed queries, while the latter involves automatically constructed. Neither of these combinations makes use of the Compound Elimination Principle.

With respect to precision at standard recall levels (Fig. 1), INFL-trunc has the highest precision values at 10 of the 11 recall levels. The only exception occurs at the lowest level 0.0, where STEM-stem performs slightly better than INFL-trunc. STEM-stem and the normalization/splitting combinations exhibit a similar behavior over the recall levels. However, at the lower levels 0.0, 0.1, 0.2 and 0.3, SPLIT-split-aut is below STEM-stem and the other SPLIT combinations by a few % units. The poor performance of the baseline is clearly reflected in (Fig. 1), where the curve for INFL-orig is below each other curve.

Fig. 1
figure 1

Mean interpolated precision for the combinations at 11 standard recall levels

Comparison to the median AP

To have a more detailed view on the differences between the combinations, we also made a topic by topic comparison to the median AP of all seven combinations (see Appendix 2). All normalized and split (SPLIT) combinations deviate only a little from the median AP. Combinations with inflected indices (INFL) and stemming (STEM) have more variation, both in positive and negative ways. For INFL-orig the number of negatively affected topics and also the extent of the effect is large (see Appendix 2 and Table 2). INFL-trunc has more positive than negative cases, and the negative effect is also less extensive; only for two topics (original CLEF numbers 165 and 185) the negative deviance from the median is great. In these two cases there was only one known relevant document per topic which the combination ranked rather low. By contrast, INFL-orig suffered from missing relevant documents in general. The stemmed combination has more negative than positive cases; however, the negative effect is less extensive than the positive.

Table 2 Number of topics in which a combination has positive, negative or no effect compared to the median AP

Significance testing

Associated with a given combination of indexing strategies with query terms is a distribution of 54 AP values, one for each topic. We explored the seven distributions of AP values, graphically and statistically. The exploration gave evidence against the normality assumption made in oneway analysis of variance for repeated measures. Therefore, we decided to use the Friedman test, a non-parametric counterpart to one-way analysis of variance for repeated measures (Siegel and Castellan 1988). The Friedman test uses ranks as data, and tests the null hypothesis that the population medians for the treatment conditions are identical, i.e., that the treatment conditions have identical effects. In the present context, (1) the ranks are obtained from AP values, and (2) the null hypothesis states that the different combinations of indexing strategies with query terms have no differential effect on AP. The significance level was set to 0.05.

The outcome of the Friedman test showed that there were significant differences between the combinations (F r = 24.73, df = 7−1, n = 54, p < 0.001). The rank sums over the combinations are shown in Table 3, where the combinations are ordered according to descending rank sums. Clearly, the rank sum of SINFL-owqt is considerably less than the rank sums of the other combinations. Note that STEM-stem, which has the next best MAP value, has the next lowest rank sum.

Table 3 Rank sums over the seven combinations of indexing strategies with query terms

Since the differences turned out to be significant, multiple comparisons between the combinations were performed in order to find out which of the combinations that differ from each other. The adopted test compares the absolute value of the difference between the rank sums R i and R j (where R i (R j) is the rank sum for the ith (jth) treatment condition, and ij) to a critical z value (Siegel and Castellan 1988). Also for these tests, the significance level was set to 0.05. It turned out that the performance of the baseline combination INFL-orig is significantly worse than the performance of the combinations INFL-trunc, SPLIT-split-int and SPLIT-split-aut. However, the baseline is not significantly worse than the combinations STEM-stems, SPLIT-EL-split-int and SPLIT-EL-split-el-aut. For a pair of combinations such that both combinations are distinct from INFL-orig, the difference between the combinations was non-significant.

Discussion

With respect to Swedish, the results show that if no action against the morphological variation of words is taken at the index construction phase, the action is necessary at the query phase. By comparing the INFL-orig and INFL-trunc combinations we can see that query terms without any processing run against an inflected word form index yield a significantly worse result than truncated query terms. The truncated query terms, on the other hand, provide the best effectiveness in the overall comparison. (See 1 and Table 1.) This might seem surprising since truncation is bound to expand queries intensively. For example, in the topic 152 the truncation stem barn (“child”) had 997 corresponding index entries starting with the string barn. Here, collecting the query terms into synonym groups is essential; the query structure restrains the retrieval of irrelevant documents.

A topic by topic analysis of INFL-orig and INFL-trunc reveals their tendency to deviate from the median performance (see Appendix 2); for the former the deviation is mainly negative due to missed relevant documents; for the latter the deviation is more often positive, yet in some cases relevant documents are ranked low with a loss in AP.

The two other indexing approaches, normalization and stemming, seem very similar in MAP performance. The differences between the SPLIT and STEM combinations vary form about 0.4 percentage units to 1 percentage unit, so the practical meaning of the differences is not consequential (see Table 1). This is an interesting result because for Finnish, which is also a highly inflectional language, normalization performed significantly better than Snowball stemming (Kettunen et al. 2005). If we compare normalization and stemming topic by topic, it is obvious that stemming is less steady: it deviates much more both positively and negatively from the median performance. In case of normalization it is also worth noting that the intellectual and automatic query construction yield practically similar results in terms of MAP.

The results are in accordance with the earlier studies confirming that any method of conflation as well as truncation is better than using query terms as such in an inflected word form index. Although the test was executed in a Swedish collection only, the results may be indicative to indexing and retrieval methods of Internet: there are inflectional indices for most languages (including highly inflectional ones) and typically no support to overcome the morphological variation at the query phase.

Conclusion

In this study we have considered problems of word form variation in Swedish full text retrieval. Four different indexing methods were combined with different methods for processing query terms, and retrieval effectiveness of the combinations was tested in the CLEF 2003 collection with 54 topics. The first indexing method was to store all text words as they appear in texts resulting in an inflected word form index. Two query types were combined to this index: first, query terms taken as they appear in topics (original query terms); second, intellectual truncation of query terms. These two combinations generated the best and the worst performance: the inflected word form index with original query terms was significantly worse than combinations INFL-trunc, SPLIT-split-int and SPLIT-split-aut; the combination with truncated query terms had the best performance overall though it differed significantly only from the worst combination.

The second indexing method was to normalize text words, split compound words and normalize compound components. In the third method, a further variation of this was to eliminate (certain) readings of compound components from the cohort in order to omit spurious readings. These two normalized indices were combined with intellectually formulated normalized queries, resulting in the SPLIT-split-int and SPLIT-EL-split-int combinations. Automatic query construction with query term normalization and splitting (SPLIT-split-aut and SPLIT-EL-split-el-aut) was also tested to detect whether there is a difference in performance between the intellectual and automatic method. All normalized combinations were very close in performance, whether compared in terms of MAP or topic by topic.

Stemming text words was the fourth indexing method combined with stemmed query terms. This combination had the second best MAP overall. Stemming, which is an ungrammatical way of conflation compared to normalization, proved to be a good alternative for Swedish indexing. However, the number of negatively affected topics is greater for stemming than normalization with respect to the median AP.