Efficient processing of probabilistic set-containment queries on uncertain set-valued data
Introduction
The database community has long realized that set-valued data provide a concise and natural way of modeling complex objects [42], and there have been a handful of research studies in indexing and querying data objects with set-valued attributes [2], [41], [19], [35], [34], [36]. Different from real-valued (or numerical) data, set-valued data do not have a natural order. Set-valued data are ubiquitous, and can be easily used to represent complex objects in many real applications. For example, a set-valued attribute can be used to represent a set of products purchased in a transaction record (market basket databases), a set of objects appearing in a picture (image databases), and a set of symptoms and/or illnesses that a patient suffers from (medical databases). Moreover, set-valued data types are an integral part of the object-relational model, and are supported natively by most modern commercial object-oriented [8] and object relational database management systems [48].
More recent developments in many advanced applications, such as location based service, pattern recognition, and medical diagnosis, have solicited for database techniques which handle inherent uncertainty in data. The uncertainty might be caused by reasons varying from the limitations of the measuring equipment, data noises, to human observations. There is already a volume of research works on modeling, indexing, and querying uncertain data. However, most of these studies focus on real-valued uncertain data. Considering the popularity of the set-valued data in representing complex objects, there is a strong demand for exploring the modeling and query processing techniques of uncertain set-valued data. We illustrate the need for uncertain set-valued data by giving three examples:
- •
In a medical diagnosis application, a patient record contains a set of symptoms that a patient suffers (items). In many cases, the symptoms, which are obtained from a medical assessment or derived from historical data statistics, would best be represented by probabilities indicating their presences in the patients’ records.
- •
In content-based image/video analysis [39], objects (items) are extracted from satellite images or surveillance video clips by object recognition tools. The presence of an interesting object in images or video clips, such as the mark of a terrorist weapon, might be represented by an existential probability, due to limited resolution or noises.
- •
For personalized product recommendation, a supermarket keeping a long history of transaction records can model each customer as a probabilistic itemset, which contains all the products that he/she has ever bought. Each item in this itemset is associated with a probability indicating the likelihood of it being purchased by that particular customer. For example, customer A buys chocolate once and milk twice out of the 10 transactions that he made. Thus the itemset of A should contain chocolate and milk with existential probabilities of 10% and 20% respectively.
In the above examples, the existence of an item in a set is associated with a likelihood measure or a probability. We will define such item as an uncertain item in our paper. Without loss of generality, we consider relations containing uncertain set-valued attributes, where the value of each uncertain set-valued attribute for each tuple contains a set of uncertain items. Motivated by the prevalence of possible world semantics [21], [1], [44], [16] in describing data uncertainty, we shall present a data model for uncertain set-valued attributes relying on the state-of-the-art possible world semantics. In our uncertainty model, uncertainty in set-valued data will be represented by existential probabilities, and be interpreted with a set of possible worlds, where each possible world has a probability indicating its existential likelihood.
A primitive operator for set-valued data is the set containment [41], [34], [36] which examines the inclusion relation between sets. For example, in keyword search, set containment is often used to retrieve documents containing the user-specified keyword set. However, set containment cannot adequately express the extent of inclusion in many applications, as it only checks whether the boolean inclusion relation holds. This limitation motivates a generalization to the conventional set containment called Jaccard containment [7], [44], [3]. Specifically, given a query set and a reference set, the Jaccard containment of a query set in a reference set is defined as the ratio of the size of the intersection of the two sets to the size of the query set. The Jaccard containment provides a natural error-tolerant semantics for set containment [3]. Therefore, it has been widely adopted as an alternative operator for set-valued data in many applications such as record matching and fuzzy keyword search.
Due to the popularity of both set containment and Jaccard containment operations, it is a logical consideration to seek probabilistic formulations of set containment and Jaccard containment on uncertain set-valued attributes. In this paper, we shall redefine novel semantics for set containment and Jaccard containment on uncertain set-valued attributes, from the perspective of possible world semantics. Specifically, set containment between uncertain set-valued attributes, called probabilistic set containment, is identified as a predicate augmented by a probability representing the likelihood that the exact set containment holds in all possible worlds. The Jaccard containment over uncertain set-valued data is a random variable, representing the probability distribution of exact Jaccard containment values in all possible worlds. In our paper we shall introduce the notion of expected Jaccard containment, which is defines as the mathematical expectation of Jaccard containment over all possible worlds. Like the exact Jaccard containment being the generalization of set containment, we also treat the expected Jaccard containment as the generalization of the probabilistic set containment in the rest of our paper.
Based on the above two probabilistic containment operations, our paper proposes two novel probabilistic set containment queries, namely probability threshold containment query (PTCQ) and probability threshold containment join (PTCJ). Given an uncertain set-valued querying object and a user-specified threshold τ, a PTCQ retrieves a set of uncertain set-valued tuples, such that their probabilities of containing the querying object (or their Jaccard containment similarities with respect to the querying object) are no less than τ. Similarly, given two relations with uncertain set-valued attributes and a threshold τ, a PTCJ returns a set of uncertain set-valued tuple pairs, such that their probabilities of probabilistic set containments (or expected Jaccard containments) are no less than τ.
Both PTCQ and PTCJ can find their applications in many scenarios. Consider the medical diagnosis example. Suppose a typical disease causes several symptoms (items) which are uncertain (by human experience), the doctors may need to find all patients who possibly suffer from that disease. Therefore they can issue a PTCQ with probabilistic set containment semantics (using a set of uncertain symptoms of that disease as the query object) to retrieve the patients whose symptoms contain those of that disease with a sufficiently high confidence. In the example of satellite image database, given a list of query images, each of which contains several uncertain target objects (items), the analyst would issue a PTCJ with expected Jaccard containment semantics to produce all the (query image, result image) pairs, such that the query image and result image are similar by considering uncertain target objects appearing in them.
The efficient processing of PTCQ and PTCJ over uncertain set-valued attributes raises non-trivial challenges. Note that the manipulation of probabilistic data usually considers the possible world semantics, where each possible world is a materialized instance of probabilistic data that can occur in the real world. Therefore, any existing techniques for set containment [36], [34] and Jaccard containment [3] semantics over exact set-valued attributes cannot be easily adopted to answer PTCQ and PTCJ over uncertain set-valued attributes. A straightforward solution relies on enumerating all possible worlds. However, since the number of possible worlds in an uncertain set-valued attribute is exponential on its size, it can be computationally expensive. Therefore, it is important to efficiently answer PTCQ and PTCJ under the possible world semantics.
Our work is the first to study the problem of processing probability threshold containment query and probability threshold containment join over uncertain set-valued data. We propose efficient query processing approaches for these two queries. Our contributions can be summarized as follows:
- •
We propose a model for representing uncertain set-valued data. We present novel probabilistic set containment semantics and its generalization (expected Jaccard containment semantics) over uncertain set-valued attributes, and then introduce two probabilistic set containment queries, namely the probability threshold containment query (PTCQ) and the probability threshold containment join (PTCJ).
- •
We derive interesting mathematical results based on the definition of probabilistic set containment. These results are utilized to provide an efficient scheme to compute probabilistic set containment with linear computation complexity. Based on this, we also propose efficient query processing algorithms for answering PTCQ and PTCJ with probabilistic set containment semantics.
- •
We study the efficient computation of expected Jaccard containment, and present a dynamic programming based scheme, which reduces the exponential computation complexity to polynomial, to evaluate expected Jaccard containment efficiently.
- •
Last but not least, we perform extensive experiments to demonstrate the efficiency of the proposed solutions.
The rest of this paper is organized as follows. Section 2 reviews some related works. Section 3 presents an uncertainty model for set-valued data and the problem definition. Section 4 evaluates probabilistic set containment semantics. Sections 5 Processing PTCQs with probabilistic set containment, 6 Processing PTCJs with probabilistic set containment present approaches for answering PTCQs and PTCJs with probabilistic set containment semantics respectively. Section 7 presents the efficient computation of the generalization of probabilistic set containment. Section 8 presents the experimental results. Finally, Section 9 concludes the paper.
Section snippets
Related work
In this section, we review the highly related work from the following three aspects.
Data model and problem definitions
In this section, we first present an uncertainty model for set-valued data. Then we formulate the problem definition.
Evaluating probabilistic set containment
In this section, we present important results which can be used for efficiently evaluating the probabilistic set containment between two USVA values r and s. Lemma 1 The probability of probabilistic set containment P(r ⊆ ps) can be computed as follow: Proof Consider the inner in Eq. (2), for each possible world of r, denoted by , we have:As is 1 if , and 0 otherwise, we can see that is the
Processing PTCQs with probabilistic set containment
In this section, we propose the query processing techniques for PTCQs with probabilistic set containment semantics on USVAs. The key idea of our query processing is to efficiently compute the probability of probabilistic set containment without considering all possible world pairs. Throughout this section, we assume that an USVA query q is performed on a relation R with one USVA R·r.
Given a tuple t ∈ R, the set containment probability P(q ⊆ pt·r) can be computed by multiplying a number of linear
Processing PTCJs with probabilistic set containment
We shall now consider the method of processing probability threshold containment joins with probabilistic set containment semantics. Formally, we shall compute over two relations R and S, where R·r and S·s are their respective uncertain set-valued attributes. In this case, S is called the containing relation, and R is called the containee relation. We first present a straightforward approach which adopts the idea of index nested loop join. Subsequently, we propose a more efficient
Evaluating expected Jaccard containment
In this section, we mainly study the efficient computation of expected Jaccard containment, and we also discuss the query processing of PTCQs and PTCJs with expected Jaccard containment semantics briefly.
Although being a similar measure naturally defined between two USVA values, the EJC is very difficult to compute. Actually, Eq. (3) is with an exponential computation complexity, since it has to enumerate all cross-product possible world pairs of two USVA values and evaluate their Jaccard
Experiments
In this section, we present our experimental studies of the proposed query processing techniques. We first discuss the experiment setup, and then present the detailed experiment results of processing PTCQs and PTCJs with probabilistic set containment, in two separate sets of experiments. At last, we present the results of evaluating expected Jaccard containment.
Conclusion
Due to the wide availability of uncertain data in many real applications, uncertain data management has become an essential topic in the database community. In this paper, we considered the uncertainty in set-valued attributes, and reported our comprehensive investigation on the problem of probabilistic set containment queries over uncertain set-valued attributes. In summary, the novelties of our work are threefold. First, we introduced an uncertain model for set-valued attributes, and proposed
Acknowledgment
This work was supported in part by the National Science Foundation of China (NSFC Grant No. 60970124, 61003050, and 61170034).
References (57)
- et al.
Finding the least influenced set in uncertain databases
Information Systems
(2011) - et al.
Combining intensional with extensional query evaluation in tuple independent probabilistic databases
Information Sciences
(2011) - et al.
Ranking uncertain sky: the probabilistic top-k skyline operator
Information Systems
(2011) - S. Abiteboul, P.C. Kanellakis, G. Grahne, On the representation and querying of sets of possible worlds, in:...
- C.C. Aggarwal, J.L. Wolf, P.S. Yu, A new method for similarity indexing of market basket data, in: Proceedings of ACM...
- P. Agrawal, A. Arasu, R. Kaushik, On indexing error-tolerant set containment, in: Proceedings of ACM SIGMOD...
- A. Arasu, V. Ganti, R. Kaushik, Efficient exact set-similarity joins, in: Proceedings of International Conference on...
- et al.
Scalable probabilistic similarity ranking in uncertain databases
Transactions on Knowledge and Data Engineering (TKDE)
(2010) - et al.
Efficient search for the top-k probable nearest neighbors in uncertain databases
The Proceedings of the VLDB Endowment (PVLDB)
(2008) - A. Broder, On the resemblance and containment of documents, in: Proceedings of the Compression and Complexity of...
The Object Database Standard: ODMG 2.0
Probabilistic reverse nearest neighbor queries on uncertain data
Transactions on Knowledge and Data Engineering (TKDE)
Data integration with uncertainty
The International Journal on Very Large Data Bases (VLDBJ)
Models for incomplete and probabilistic information
IEEE Data Engineering Bulletin
Rule-based systems
Communications of the ACM
A performance study of four index structures for set-valued attributes of low cardinality
The International Journal on Very Large Data Bases (VLDBJ)
Ranking queries on uncertain data
The International Journal on Very Large Data Bases (VLDBJ)
Incomplete information in relational databases
ACM Journal
Semantics of ranking queries for probabilistic data
Transactions on Knowledge and Data Engineering (TKDE)
Semantics of ranking queries for probabilistic data
Transactions on Knowledge and Data Engineering (TKDE)
The art of computer programming
Cited by (5)
Set containment join revisited
2016, Knowledge and Information SystemsEfficient and scalable trie-based algorithms for computing set containment relations
2015, Proceedings - International Conference on Data EngineeringA survey of queries over uncertain data
2013, Knowledge and Information SystemsSimilarity query processing for probabilistic sets
2013, Proceedings - International Conference on Data EngineeringEfficient processing of containment queries on nested sets
2013, ACM International Conference Proceeding Series