Classical retrieval and overlap measures satisfy the requirements for rankings based on a Lorenz curve
Introduction
Using presence–absence data is a basic approach for representing categorical data. As presence is often represented by 1 and absence by 0, such data are also known as 1–0 data. Their binary nature is especially suited for computing, and makes their use in science ubiquitous. This is also the case in the information sciences (Huot, Quoniam, & Dou, 1992; Magurran, 1991). Indeed, a standard approach to information retrieval and the study of overlap represents documents by an array of keywords and/or phrases. Note that we chose the term ‘array’ and not ‘vector’ because these entities are not vectors in the strict mathematical sense of the word. If keywords are chosen in a particular order, a document is represented by a presence–absence (1–0) array. Similarity between documents is determined by comparing these document representations. In information retrieval and in overlap studies it is customary not to consider common zeros (Salton & McGill, 1983). Indeed, keywords or phrases that do not occur in at least one of the documents do not make the documents more similar. Mathematical articles are not more similar if they both do not use the keyword “Tanzania”. For this reason we will refer to this approach as the zero insensitive case. This will be the only case studied in this article.
Another way of representing such data is through set theory. An entity such as a document is represented by the set of properties it possesses (here keywords or phrases present in the document, or by which the document is indexed). Note that absence data are usually not explicitly represented in the set–theoretic approach. This corresponds to the fact that in the array representation, common zeros are not considered. Fig. 1 illustrates these different approaches.
We will show how a Lorenz curve, consisting of just three line segments, leads to an intrinsic notion of similarity. These Lorenz similarity curves define a partial order in the set of presence–absence data representations. We will further show that for zero insensitive presence–absence data well-known measures such as the Jaccard index, the Dice coefficient and Salton’s cosine measure respect this partial order, hereby revealing new good properties of these important measures.
Section snippets
Lorenz similarity curves
Let r = (xi)i = 1, …, N and s = (yi)i = 1, …, N be two presence–absence arrays of length N (in short: N-arrays) for the documents r and s. The similarity of D = {r, s} must not depend on the order in which we consider r and s. It must, moreover, not depend on the order in which the keywords of r and s are enumerated. Of course, in practice one uses a particular order (always the same for the two items involved) but the point is that this must not influence their similarity. These requirements are also imposed
The partial order derived from Lorenz similarity curves
For D = {r, s} and D′ = {r′, s′} we will say that the Lorenz similarity curve LD(x) is situated below the Lorenz similarity curve LD′(x) if for every x ∈ [0, 1] either Lr, s(x) ⩽ Lr′, s′(x), with strict inequality in at least one point (and hence in infinitely many), or Ls, r(x) ⩽ Lr′, s′(x), with strict inequality in at least one point. Note that by the previous observation Lr, s (x) ⩽ Lr′, s′(x) automatically implies that Ls, r(x) ⩽ Ls′, r′(x). Similarly, we say that the Lorenz similarity curve LD(x) coincides with
General properties of Lorenz similarity
In this section we will prove properties that are intrinsically true for Lorenz similarity. This means that these properties always hold whatever the similarity function one uses. Such properties are the most important, basic, ones for the notion of Lorenz similarity. Theorem Lorenz similarity is replication invariant Replication means that every absence–presence value is transformed to f times this value, with f a natural number larger than one. Some examples: the duo D = {(1, 0, 1), (1, 1, 0)} is transformed to D′ = {(1, 1, 0, 0, 1, 1), (1, 1, 1, 1, 0, 0)} (for f =
Lorenz similarity functions
Definition A Lorenz similarity function, f, is a real-valued function mapping a duo D to its Lorenz similarity value f(D). This function must, moreover, respect the Lorenz similarity partial order. This means that, ifIn the case the function f only satisfies the requirementthen we say that f is a weak Lorenz similarity function.
These requirements imply that f({r, s}) = f({s, r}). Hence, expressing
The relation between retrieval and overlap measures, and Lorenz similarity
We first show that the cases where strictly different Lorenz similarity measures lead to the same overlap value do not co-occur for O1 and O2. Proposition using the same notation as above Assume that D′ > (strictly) D, then either or . Proof We know already that under this assumption and . If the conclusion of this proposition were not correct then and . Under this assumption and implying that . Now three cases are possible
Conclusion
We have shown that classical measures used in information retrieval, studies of indexer consistency and overlap studies can be characterized by Lorenz similarity curves. This provides a visual, geometric picture of similarity, different from the geometric approach based on iso-similarity curves, as studied by Jones and Furnas (1987). We have shown that the Jaccard index, the Dice coefficient and Salton’s cosine measure respect the partial order determined by these Lorenz similarity curves,
References (14)
On the measurement of inequality
Journal of Economic Theory
(1970)Construction of concentration measures for general Lorenz curves using Riemann–Stieltjes integrals
Mathematical and Computer Modelling
(2002)A universal method of information retrieval evaluation: the missing link M and the universal IR surface
Information Processing and Management
(2004)- et al.
Strong similarity measures for ordered sets of documents in information retrieval
Information Processing and Management
(2002) The measurement of the inequality of incomes
The Economic Journal
(1920)- et al.
Introduction to informetrics
(1990) - et al.
Symmetric and asymmetric theory of relative concentration and applications
Scientometrics
(2001)
Cited by (25)
Fuzzy partitions: A way to integrate expert knowledge into distance calculations
2013, Information SciencesCitation Excerpt :Some of these similarity measures consider the sample size, the number of values taken by the attribute and the frequencies of these values in the data set. The Lorentz curve allows for the ranking of multi-attribute categorical data [9]. To our knowledge, scant attention has been paid to similarity measures for ordered categorical data, such as price={cheap, average, expensive}.
Modeling the probabilistic distribution of the impact factor
2012, Journal of InformetricsCitation Excerpt :In this section we study several concentration measures associated to the quantile function defined in (5). In this sense, the Lorenz curve (LC) and the Leimkuhler curve (LKC) are two important instruments in informetrics (Burrel, 2006, 2007; Egghe, 2002, 2005a, 2005b; Egghe & Rousseau, 2006; Rousseau, 2007) and economics. These curves plot the cumulative proportion of total productivity against the cumulative proportion of sources and can be obtained each other.
A general method for generating parametric Lorenz and Leimkuhler curves
2010, Journal of InformetricsCitation Excerpt :The Lorenz curve (LC) and the Leimkuhler curve (LKC) are two important instruments in informetrics, information sciences (Burrell, 1991, 2006, 2007; Egghe, 2002, 2005a, 2005b; Egghe & Rousseau, 1990, 2006; Rousseau, 2007) and economics.
Informetrics at the beginning of the 21st century-A review
2008, Journal of InformetricsLorenz theory of symmetric relative concentration and similarity, incorporating variable array length
2006, Mathematical and Computer ModellingA study of results overlap and uniqueness among major Web search engines
2006, Information Processing and Management