Term-relevance computations and perfect retrieval performance

doi:10.1016/0306-4573(95)00011-5

Information Processing & Management

Volume 31, Issue 4, July 1995, Pages 491-498

https://doi.org/10.1016/0306-4573(95)00011-5 Get rights and content

Abstract

Computing formulas for binary independent (BI) term relevance weights are evaluated as a function of query representations and retrieval expectations in the CF database. Query representations consist of the limited set of terms appearing in each query statement and the complete set of terms appearing in the database. Retrieval expectations include comprehensive searches, for which many relevant documents are sought, and specific searches, for which only a few documents have merit. Conventional computing equations, which are known to over estimate term relevance weights, are shown to produce mediocre results for all combinations of query representations and retrieval expectations. Modified computing equations, which do not over estimate relevance weights, produce essentially perfect retrieval results for both comprehensive and specific searches, when the query representation is complete. Probabilistic retrieval, based on BI assumptions and applied to simple subject descriptions of documents and queries, can retrieve all relevant documents and only relevant documents, when term relevance weights are computed accurately.

References (14)

T.M.T. Sembok et al.
SILOL: A simple logical-linguistic document retrieval system
Information Processing & Management
(1990)
W.M. Shaw
Retrieval expectations, cluster-based effectiveness, and performance standards in the CF database
Information Processing & Management
(1994)
C.J. van Rijsbergen et al.
The selection of good search terms
Information Processing & Management
(1981)
D.R. Cox
The analysis of binary data
(1970)
S.E. Robertson
The probability ranking principle in IR
Journal of Documentation
(1977)
S.E. Robertson
On relevance weight estimation and query expansion
Journal of Documentation
(1986)
S.E. Robertson et al.
Relevance weighting of search terms
Journal of the American Society for Information Science
(1976)

There are more references available in the full text version of this article.

Cited by (31)

Social media analysis by innovative hybrid algorithms with label propagation
2022, Expert Systems with Applications
Citation Excerpt :
Here, pk denotes the probability that the word k appears in a relevant text, uk shows the probability of word k appearing in a non-relevant text, and wk presents the relevance weight of term k. According to discussions and analysis in the literature (Shaw Jr, 1995), if term k occurs frequently in relevant texts while it rarely occurs in non-relevant texts, then this means that term k has the capability of discriminating relevant texts from non-relevant texts, which is called the distinguishing characteristic of relevance computation. A positive value of wk means that k appears in relevant texts, while a negative value of wk means that k appears in non-relevant documents.
Due to the huge size of the data accumulated on microblogging sites, recently, two fundamental questions have become very popular: 1) What percentage of this accumulated data has positive or negative sentiment polarity? 2) How is the distribution of this accumulated data on different topics? Inspired by these motivated necessities, this paper presents several different algorithms which are based on the Label Propagation Algorithm (LPA) in order to handle previously mentioned two fundamentals tasks: sentiment polarity detection task and topic-based text classification task. These algorithms are the Label Propagated- Relevance Frequency Classifier (LP-RFC) and LP-Abstract Frequency Classifier (LP-AFC). These algorithms can be defined as new semantic smoothing classifiers, which take advantage of the semantic connections among terms in the label propagation phase of the LPA. Additionally, another classifier, namely LP-Com_RFC+AFC, was built. LP-Com_RFC+AFC is actually a weighted summation classifier of the individual LP-RFC and LP-AFC. Furthermore, considering the shortage of labeled data in real-world scenarios, a semi-supervised version of LP-RFC and LP-AFC, namely “Merging Unlabeled and Labeled Instances with Semantic Values of Terms” (MULIS), was designed and implemented. For the experiments of the sentiment polarity detection task, three different datasets were use and for the experiments of topic-based text classification task, a self-collected tweet dataset was use. According to the experimental results, the suggested algorithms, and their composite form, LP-Com_RFC+AFC, generated higher F1 scores than all of the baseline algorithms at nearly all of the training splits on the datasets.
An evaluation study on text categorization using automatically generated labeled dataset
2017, Neurocomputing
Naïve Bayes, k-nearest neighbors, Adaboost, support vector machines and neural networks are five among others commonly used text classifiers. Evaluation of these classifiers involves a variety of factors to be considered including benchmark used, feature selections, parameter settings of algorithms, and the measurement criteria employed. Researchers have demonstrated that some algorithms outperform others on some corpus, however, inconsistency of human labeling and high dimensionality of feature spaces are two issues to be addressed in text categorization. This paper focuses on evaluating the five commonly used text classifiers by using an automatically generated text document collection which is labeled by a group of experts to alleviate subjectivity of human category assignments, and at the same time to examine the influence of the number of features on the performance of the algorithms.
Achieving efficient and privacy-preserving multi-feature search for mobile sensing
2015, Computer Communications
Citation Excerpt :
But it is usually difficult for a search entity to express its information need precisely; thus the value defined by itself may not be accurate. To overcome this impreciseness, the technique of relevance feedback is used [9–11]. It is the process of automatically adjusting an existing query using information feedback by the search entity about the preference of previously retrieved documents.
Currently, more and more mobile terminals embed a number of sensors and generate massive data. Effective utilization to such information can enable people to get more personalized services, and also help service providers to sell their products accurately. As the information may contain privacy information of people, they are typically encrypted before transmitted to the service providers. This, however, significantly limits the usability of data due to the difficulty of searching over the encrypted data. To address the above issues, in this paper, we first leverage the secure kNN technique to propose an efficient and privacy-preserving multi-feature search scheme for mobile sensing. Furthermore, we propose an extended scheme, which can personalize query based on the historical search information and return more accurate result. Using analysis, we prove the security of the proposed scheme on privacy protection of index and trapdoor and unlinkability of trapdoor. Via extensive experiment on real-world cloud systems, we validate the performance of the proposed scheme in terms of functionalities, computation and communication overhead.
Feature selection on hierarchy of web documents
2003, Decision Support Systems
The paper describes feature subset selection used in learning on text data (text learning) and gives a brief overview of feature subset selection commonly used in machine learning. Several known and some new feature scoring measures appropriate for feature subset selection on large text data are described and related to each other. Experimental comparison of the described measures is given on real-world data collected from the Web. Machine learning techniques are used on data collected from Yahoo, a large text hierarchy of Web documents. Our approach includes some original ideas for handling large number of features, categories and documents. The high number of features is reduced by feature subset selection and additionally by using ‘stop-list’, pruning low-frequency features and using a short description of each document given in the hierarchy instead of using the document itself. Documents are represented as feature-vectors that include word sequences instead of including only single words as commonly used when learning on text data. An efficient approach to generating word sequences is proposed. Based on the hierarchical structure, we propose a way of dividing the problem into subproblems, each representing one of the categories included in the Yahoo hierarchy. In our learning experiments, for each of the subproblems, naive Bayesian classifier was used on text data. The result of learning is a set of independent classifiers, each used to predict probability that a new example is a member of the corresponding category. Experimental evaluation on real-world data shows that the proposed approach gives good results. The best performance was achieved by the feature selection based on a feature scoring measure known from information retrieval called Odds ratio and using relatively small number of features.
A feature mining based approach for the classification of text documents into disjoint classes
2002, Information Processing and Management
This paper proposes a new approach for classifying text documents into two disjoint classes. The new approach is based on extracting patterns, in the form of two logical expressions, which are defined on various features (indexing terms) of the documents. The pattern extraction is aimed at providing descriptions (in the form of two logical expressions) of the two classes of positive and negative examples. This is achieved by means of a data mining approach, called One Clause At a Time (OCAT), which is based on mathematical logic. The application of a logic-based approach to text document classification is critical when one wishes to be able to justify why a particular document has been assigned to one class versus the other class. This situation occurs, for instance, in declassifying documents that have been previously considered important to national security and thus are currently being kept as secret. Some computational experiments have investigated the effectiveness of the OCAT-based approach and compared it to the well-known vector space model (VSM). These tests also have investigated finding the best indexing terms that could be used in making these classification decisions. The results of these computational experiments on a sample of 2897 text documents from the TIPSTER collection indicate that the first approach has many advantages over the VSM approach for solving this type of text document classification problem. Moreover, a guided strategy for the OCAT-based approach is presented for deciding which document one needs to consider next while building the training example sets.
Performance standards and evaluations in IR test collections: Vector-space and other retrieval models
1997, Information Processing and Management
Low performance standards for each query and for the group of queries in 13 traditional and four TREC test collections have been computed. Predicted by the hypergeometric distribution, the standards represent the highest level of retrieval effectiveness attributable to chance. Operational levels of performance for vector-space, ad-hoc-feature-based, probabilistic, and other retrieval models have been compared to the standards. The effectiveness of these techniques in small, traditional test collections can be explained by retrieving a few more relevant documents for most queries than expected by chance, and the effectiveness of retrieval techniques in the large TREC test collections can only be explained by retrieving many more relevant documents for most queries than expected by chance. The discrepancy between deviations from chance in traditional and TREC text collections is due to a decrease in performance standards for large test collections, not to an increase in operational performance. Retrieving a few more relevant documents than expected by chance leads to mediocre levels of performance; recall and precision are rarely greater than 0.50 for any retrieval strategy in any test collection. However, marginal improvements to expectations based on chance may be sufficient to initiate successful interactions between an end-user and the next generation of retrieval systems, in which relevance judgments will be automatically translated into progressively improving estimates of the capacity of terms and other features to discriminate between relevant and non-relevant documents. Realization of such systems would be enhanced by abandoning uninformative performance summaries and focusing on effectiveness and improvements in effectiveness of individual queries.

View all citing articles on Scopus

View full text

Regular articleTerm-relevance computations and perfect retrieval performance

Abstract

Information Processing & Management

Information Processing & Management

Information Processing & Management

The analysis of binary data

The probability ranking principle in IR

Journal of Documentation

On relevance weight estimation and query expansion

Journal of Documentation

Relevance weighting of search terms

Journal of the American Society for Information Science

Regular article
Term-relevance computations and perfect retrieval performance