Classical retrieval and overlap measures satisfy the requirements for rankings based on a Lorenz curve

doi:10.1016/j.ipm.2004.05.004

Information Processing & Management

Volume 42, Issue 1, January 2006, Pages 106-120

https://doi.org/10.1016/j.ipm.2004.05.004 Get rights and content

Abstract

Classical information retrieval and overlap measures such as the Jaccard index, the Dice coefficient and Salton’s cosine measure can be characterized by Lorenz curves. This result demonstrates the existence of a formal link between information retrieval and the information sciences on the one hand, and concentration and diversity theory, as used, e.g., in social economics and ecology on the other.

Introduction

Using presence–absence data is a basic approach for representing categorical data. As presence is often represented by 1 and absence by 0, such data are also known as 1–0 data. Their binary nature is especially suited for computing, and makes their use in science ubiquitous. This is also the case in the information sciences (Huot, Quoniam, & Dou, 1992; Magurran, 1991). Indeed, a standard approach to information retrieval and the study of overlap represents documents by an array of keywords and/or phrases. Note that we chose the term ‘array’ and not ‘vector’ because these entities are not vectors in the strict mathematical sense of the word. If keywords are chosen in a particular order, a document is represented by a presence–absence (1–0) array. Similarity between documents is determined by comparing these document representations. In information retrieval and in overlap studies it is customary not to consider common zeros (Salton & McGill, 1983). Indeed, keywords or phrases that do not occur in at least one of the documents do not make the documents more similar. Mathematical articles are not more similar if they both do not use the keyword “Tanzania”. For this reason we will refer to this approach as the zero insensitive case. This will be the only case studied in this article.

Another way of representing such data is through set theory. An entity such as a document is represented by the set of properties it possesses (here keywords or phrases present in the document, or by which the document is indexed). Note that absence data are usually not explicitly represented in the set–theoretic approach. This corresponds to the fact that in the array representation, common zeros are not considered. Fig. 1 illustrates these different approaches.

We will show how a Lorenz curve, consisting of just three line segments, leads to an intrinsic notion of similarity. These Lorenz similarity curves define a partial order in the set of presence–absence data representations. We will further show that for zero insensitive presence–absence data well-known measures such as the Jaccard index, the Dice coefficient and Salton’s cosine measure respect this partial order, hereby revealing new good properties of these important measures.

Section snippets

Lorenz similarity curves

Let r = (x_i)_{i = 1, …, N} and s = (y_i)_{i = 1, …, N} be two presence–absence arrays of length N (in short: N-arrays) for the documents r and s. The similarity of D = {r, s} must not depend on the order in which we consider r and s. It must, moreover, not depend on the order in which the keywords of r and s are enumerated. Of course, in practice one uses a particular order (always the same for the two items involved) but the point is that this must not influence their similarity. These requirements are also imposed

The partial order derived from Lorenz similarity curves

For D = {r, s} and D′ = {r′, s′} we will say that the Lorenz similarity curve L_D(x) is situated below the Lorenz similarity curve L_D′(x) if for every x ∈ [0, 1] either L_r, s(x) ⩽ L_r′, s′(x), with strict inequality in at least one point (and hence in infinitely many), or L_s, r(x) ⩽ L_r′, s′(x), with strict inequality in at least one point. Note that by the previous observation L_r, s (x) ⩽ L_r′, s′(x) automatically implies that L_s, r(x) ⩽ L_s′, r′(x). Similarly, we say that the Lorenz similarity curve L_D(x) coincides with

General properties of Lorenz similarity

In this section we will prove properties that are intrinsically true for Lorenz similarity. This means that these properties always hold whatever the similarity function one uses. Such properties are the most important, basic, ones for the notion of Lorenz similarity.

Theorem Lorenz similarity is replication invariant

Replication means that every absence–presence value is transformed to f times this value, with f a natural number larger than one. Some examples: the duo D = {(1, 0, 1), (1, 1, 0)} is transformed to D′ = {(1, 1, 0, 0, 1, 1), (1, 1, 1, 1, 0, 0)} (for f =

Lorenz similarity functions

Definition

A Lorenz similarity function, f, is a real-valued function mapping a duo D to its Lorenz similarity value f(D). This function must, moreover, respect the Lorenz similarity partial order. This means that, if $D < D^{'} then f (D) < f (D^{'}) and if D and D^{″} are equivalent then f (D) = f (D^{″})$ In the case the function f only satisfies the requirement $D < D^{'} then f (D) ⩽ f (D^{'}) and if D and D^{″} are equivalent then f (D) = f (D^{″})$ then we say that f is a weak Lorenz similarity function.

These requirements imply that f({r, s}) = f({s, r}). Hence, expressing

The relation between retrieval and overlap measures, and Lorenz similarity

We first show that the cases where strictly different Lorenz similarity measures lead to the same overlap value do not co-occur for O₁ and O₂.

Proposition using the same notation as above

Assume that D′ > (strictly) D, then either $O_{1}^{'} > O_{1}$ or $O_{2}^{'} > O_{2}$ .

Proof

We know already that under this assumption $O_{1}^{'} ⩾ O_{1}$ and $O_{2}^{'} ⩾ O_{2}$ . If the conclusion of this proposition were not correct then $O_{1}^{'} = O_{1}$ and $O_{2}^{'} = O_{2}$ . Under this assumption $\frac{c}{ρ} = \frac{c^{'}}{ρ^{'}}$ and $\frac{c}{σ} = \frac{c^{'}}{σ^{'}}$ implying that $\frac{N}{c} = \frac{N^{'}}{c^{'}}$ . Now three cases are possible $\begin{matrix} (I) & \frac{ρ - c}{N} < \frac{ρ^{'} - c^{'}}{N^{'}} \Leftrightarrow \frac{ρ}{N} < \frac{ρ^{'}}{N^{'}} \\ (II) & \frac{ρ - c}{N} > \frac{ρ^{'} - c^{'}}{N^{'}} \Leftrightarrow \frac{ρ}{N} > \frac{ρ^{'}}{N^{'}} \\ (III) & \frac{ρ - c}{N} = \frac{ρ^{'} - c^{'}}{N^{'}} \Leftrightarrow ρ \end{matrix}$

Conclusion

We have shown that classical measures used in information retrieval, studies of indexer consistency and overlap studies can be characterized by Lorenz similarity curves. This provides a visual, geometric picture of similarity, different from the geometric approach based on iso-similarity curves, as studied by Jones and Furnas (1987). We have shown that the Jaccard index, the Dice coefficient and Salton’s cosine measure respect the partial order determined by these Lorenz similarity curves,

References (14)

A.B. Atkinson
On the measurement of inequality
Journal of Economic Theory
(1970)
L. Egghe
Construction of concentration measures for general Lorenz curves using Riemann–Stieltjes integrals
Mathematical and Computer Modelling
(2002)
L. Egghe
A universal method of information retrieval evaluation: the missing link M and the universal IR surface
Information Processing and Management
(2004)
L. Egghe et al.
Strong similarity measures for ordered sets of documents in information retrieval
Information Processing and Management
(2002)
H. Dalton
The measurement of the inequality of incomes
The Economic Journal
(1920)
L. Egghe et al.
Introduction to informetrics
(1990)
L. Egghe et al.
Symmetric and asymmetric theory of relative concentration and applications
Scientometrics
(2001)

There are more references available in the full text version of this article.

Cited by (25)

Fuzzy partitions: A way to integrate expert knowledge into distance calculations
2013, Information Sciences
Citation Excerpt :
Some of these similarity measures consider the sample size, the number of values taken by the attribute and the frequencies of these values in the data set. The Lorentz curve allows for the ranking of multi-attribute categorical data [9]. To our knowledge, scant attention has been paid to similarity measures for ordered categorical data, such as price={cheap, average, expensive}.
This work proposes a new pseudo-metric based on fuzzy partitions (FPs). This pseudo-metric allows for the introduction of expert knowledge into distance computations performed on numerical data and can be used in various types of statistical clustering or other applications. The knowledge is formalized by a FP, in which each fuzzy set represents a linguistic concept. The pseudo-metric is designed to respect the FP semantics. The univariate case is first studied, and the pseudo-metric behavior is discussed using synthetic experiments. Then, a multivariate version is proposed as a Minkowski-like combination of univariate distances or semi-distances. The value of the proposal is illustrated with two real-world case studies in the fields of Biology and Precision Agriculture.
Modeling the probabilistic distribution of the impact factor
2012, Journal of Informetrics
Citation Excerpt :
In this section we study several concentration measures associated to the quantile function defined in (5). In this sense, the Lorenz curve (LC) and the Leimkuhler curve (LKC) are two important instruments in informetrics (Burrel, 2006, 2007; Egghe, 2002, 2005a, 2005b; Egghe & Rousseau, 2006; Rousseau, 2007) and economics. These curves plot the cumulative proportion of total productivity against the cumulative proportion of sources and can be obtained each other.
The study of the informetric distributions, such as distributions of citations and impact factors is one of the most relevant topics in the current informetric research. Several laws for modeling impact factor based on ranks have been proposed, including Zipf, Lavalette and the two-exponent law proposed by Mansilla et al. (2007). In this paper, the underlying probabilistic quantile function corresponding to the Mansilla's two-exponent law is obtained. This result is particularly relevant, since it allows us to know the underlying population, to learn about all its features and to use statistical inference procedures. Several probabilistic descriptive measures are obtained, including moments, Lorenz and Leimkuhler curves and Gini index. The distribution of the order statistics is derived. Least squares estimates are obtained. The different results are illustrated using the data of the impact factors in eight relevant scientific fields.
A general method for generating parametric Lorenz and Leimkuhler curves
2010, Journal of Informetrics
Citation Excerpt :
The Lorenz curve (LC) and the Leimkuhler curve (LKC) are two important instruments in informetrics, information sciences (Burrell, 1991, 2006, 2007; Egghe, 2002, 2005a, 2005b; Egghe & Rousseau, 1990, 2006; Rousseau, 2007) and economics.
Let $L_{0}$ consider an initial Lorenz curve. In this paper we propose a general methodology for obtaining new classes of parametric Lorenz or Leimkuhler curves that contain the original curve as limiting or special case. The new classes introduce additional parameters in the original family, providing more flexibility for the new families. The new classes are built from an ordered sequence of power Lorenz curves, assuming that the powers are distributed according to some convenient discrete random variable. Using this method we obtain many of the families proposed in the literature, including the classical proposal of Bradford, 1934, Kakwani and Podder, 1973 and others. We obtain some inequality measures and population functions for the proposed families.
Informetrics at the beginning of the 21st century-A review
2008, Journal of Informetrics
This paper reviews developments in informetrics between 2000 and 2006. At the beginning of the 21st century we witness considerable growth in webometrics, mapping and visualization and open access. A new topic is comparison between citation databases, as a result of the introduction of two new citation databases Scopus and Google Scholar. There is renewed interest in indicators as a result of the introduction of the h-index. Traditional topics like citation analysis and informetric theory also continue to develop. The impact factor debate, especially outside the informetric literature continues to thrive. Ranked lists (of journal, highly cited papers or of educational institutions) are of great public interest.
Lorenz theory of symmetric relative concentration and similarity, incorporating variable array length
2006, Mathematical and Computer Modelling
This paper extends the Lorenz theory, developed in [L. Egghe, R. Rousseau, Symmetric and asymmetric theory of relative concentration and applications, Scientometrics 52 (2) (2001) 261–290], so that it can deal with comparing arrays of variable length. We show that in this case we need to divide the Lorenz curves by certain types of increasing functions of the array length $N$ .
We then prove that, in this theory, adding zeros to two arrays increases their similarity, a property that is not satisfied by the Pearson correlation coefficient.
Among the many good similarity measures satisfying the developed Lorenz theory, we deduce the correlation coefficient of Spearman, hence showing that this measure can be used as a good measure of symmetric relative concentration (or similarity).
A study of results overlap and uniqueness among major Web search engines
2006, Information Processing and Management
The performance and capabilities of Web search engines is an important and significant area of research. Millions of people world wide use Web search engines very day. This paper reports the results of a major study examining the overlap among results retrieved by multiple Web search engines for a large set of more than 10,000 queries. Previous smaller studies have discussed a lack of overlap in results returned by Web search engines for the same queries. The goal of the current study was to conduct a large-scale study to measure the overlap of search results on the first result page (both non-sponsored and sponsored) across the four most popular Web search engines, at specific points in time using a large number of queries. The Web search engines included in the study were MSN Search, Google, Yahoo! and Ask Jeeves. Our study then compares these results with the first page results retrieved for the same queries by the metasearch engine Dogpile.com. Two sets of randomly selected user-entered queries, one set was 10,316 queries and the other 12,570 queries, from Infospace’s Dogpile.com search engine (the first set was from Dogpile, the second was from across the Infospace Network of search properties were submitted to the four single Web search engines). Findings show that the percent of total results unique to only one of the four Web search engines was 84.9%, shared by two of the three Web search engines was 11.4%, shared by three of the Web search engines was 2.6%, and shared by all four Web search engines was 1.1%. This small degree of overlap shows the significant difference in the way major Web search engines retrieve and rank results in response to given queries. Results point to the value of metasearch engines in Web retrieval to overcome the biases of individual search engines.

View all citing articles on Scopus

View full text

Classical retrieval and overlap measures satisfy the requirements for rankings based on a Lorenz curve

Abstract

Introduction

Section snippets

Lorenz similarity curves

The partial order derived from Lorenz similarity curves

General properties of Lorenz similarity

Lorenz similarity functions

The relation between retrieval and overlap measures, and Lorenz similarity

Conclusion

Journal of Economic Theory

Mathematical and Computer Modelling

Information Processing and Management

Information Processing and Management

The measurement of the inequality of incomes

The Economic Journal

Introduction to informetrics

Symmetric and asymmetric theory of relative concentration and applications

Scientometrics