Impact analysis of data placement strategies on query efforts in distributed RDF stores

doi:10.1016/j.websem.2018.02.002

Journal of Web Semantics

Volume 50, May 2018, Pages 21-48

https://doi.org/10.1016/j.websem.2018.02.002 Get rights and content

Abstract

In the last years, scalable RDF stores in the cloud have been developed, where graph data is distributed over compute and storage nodes for scaling efforts of query processing and memory needs. One main challenge in these RDF stores is the data placement strategy that can be formalized in terms of graph covers. These graph covers determine whether (a) the triples distribution is well-balanced over all storage nodes (storage balance) (b) different query results may be computed on several compute nodes in parallel (vertical parallelization) and (c) individual query results can be produced only from triples assigned to few – ideally one – storage node (horizontal containment). We analyse the impact of three most commonly used graph cover strategies in these terms and found out that balancing query workload reduces the query execution time more than reducing data transfer over network. To this end, we present our novel benchmark and open source evaluation platform Koral.

Introduction

In the last years, the requirement for RDF stores that can cope with several trillions of triples has emerged. For instance, the number of Schema.org-based facts that are extracted out of the Web have reached the size of three trillions [2]. Another example is the European Bioinformatics Institute (EMBL-EBI) that would like to convert its datasets into RDF resulting in a graph consisting of several trillions of triples. To date no such scalable RDF store exists and the current EBI RDF Platform can handle only 10 billion triples [3].

We pursue the development of a scalable RDF store in the cloud, where graph data is distributed over compute and storage nodes for scaling efforts of query processing and memory needs. The main challenges to be investigated for such development are: (i) strategies for data placement over compute and storage nodes, (ii) strategies for distributed query processing, and (iii) strategies for handling failure of compute and storage nodes. In this paper, we focus on comparing the performance of data placement strategies.

Strategies for data placement may be formalized in terms of graph covers. Each compute and storage node hosts a graph chunk. Each triple is assigned to (at least) one graph chunk and the union of all graph chunks define a (possibly redundant) graph cover. When a query is requested to an RDF store in the cloud, the query is distributed over the different compute and storage nodes. Each node applies the query operators assigned to it on its local data. If the query requires the combination of data from different chunks, the required information has to be transferred between compute nodes.

One graph cover strategy commonly used is the hash cover that assigns triples to compute and storage nodes according to the hash value of, e. g., their subject (e. g., used by Virtuoso Clustered Edition [4], YARS2 [[5], [6]], Clustered TDB [7] and Trinity.RDF [8]). In order to reduce the number of transferred intermediate results, hierarchical hash has been proposed as an extension of the hash cover strategy that computes the hash only on IRI prefixes [9]. Another commonly used graph cover strategy is the minimal edge-cut cover that assigns vertices to similarly-sized partitions in a way that the number of edges connecting vertices assigned to different partitions is minimized (e. g., used by [[10], [11], [12]]). Furthermore, the vertical cover strategy is inspired by relational databases. It partitions the dataset by storing all triples with the same property in one table. Finally, these tables are then distributed among all compute and storage nodes according to the hash on the property. It is used by, e. g. HadoopRDF [13], Jena-HBase [14] and [15]. In order to reduce the number of transferred intermediate results, [11] proposed to replicate triples at the border of the graph chunks. This idea is also used by systems like VB-Partitioner [16] and D-SPARQ [17].

It is a commonly held belief that query completion is optimized by approaches that emphasize local computation such as minimal edge-cut (cf. [[11], [12], [16]]). The first major contribution of this paper is to challenge this assumption by new experiments. Our results indicate that contrary to commonly held beliefs, query answering with hash covers may outperform query answering with, e.g., minimal edge-cut covers since the load on the different machines is more balanced. Furthermore, when replicating triples on several computers, the high number of duplicate computations may overcome the benefits of the reduced data transfer via network and lead to a worse query performance.

We have performed our experiments with the aim to understand interdependencies of the involved query processing. Thus, we have devised new measures and do not only compare graph cover strategies in terms of query processing time, but in addition we investigate the following dimensions:

$•$
Load time describes the time it takes to create a graph cover. This is an indicator how well a graph cover strategy can scale by horizontal scaling of the cloud.
$•$
Storage balance describes to which extent graph chunks are of similar size. This is an indicator that memory needs can be met with increasing data size by horizontal scaling of the cloud.
$•$
Horizontal containment describes to which extent computation of individual query results is local to one (or few) graph chunk(s). This is an indicator that query processing is (to some extent) robust when the cloud is scaled horizontally.
$•$
Vertical parallelization describes to which extent different query results may be computed in parallel on different compute nodes. This is an indicator that query processing can scale with growing result set sizes by horizontal scaling of the cloud.

Using these measurements, we derive the second important contribution of this paper. We discovered from the analysis of query processing using different graph cover strategies that vertical parallelization (i.e. a well-distributed workload) may be more important than horizontal containment (i.e. minimal data transport) for efficient query processing — even in a commodity network environment (1 GB/s). Furthermore, our analysis revealed that previous experiments like [[11], [16]] and [12] suffered from a setting with highly inefficient methods for data transfer (i.e. based on the Hadoop/HDFS infrastructure) (see [18]).

In order to determine to which extent graph cover strategies lead to efficient query answering, they have to be implemented and evaluated in distributed RDF stores. For instance, [8] and [12] evaluate various RDF stores that use different graph cover strategies, but these evaluations compared the RDF stores as wholes. Thus, their results also reflect the effects of, e. g., the different indexing strategies and persistence strategies (i. e., main memory vs. hard disk) used by the different stores. In order to focus on the effects of the different graph cover strategies, other evaluations have used the same system to measure the execution time [16] and [19]. These systems use technologies like Hadoop or HDFS that cause an overhead for data transfer. To avoid the bias of this overhead, the third important contribution of this paper is the flexible open source platform Koral. It executes queries on arbitrary graph covers and transfers the intermediate results within the network.

In short, the contributions of this paper are:

1.
An explanation why previous evaluations concluded that the amount of data transfer caused by a graph cover strategy is crucial for the query execution effort (Section 6).
2.
An analysis indicating that (i) hash covers outperform minimal edge-cut covers and vertical covers, (ii) vertical parallelization is more important than horizontal containment and (iii) triple replication reduces query performance due to a high number of duplicate computations (Section 5).
3.
A benchmark methodology and its implementation that allows for a detailed understanding of the interdependencies of the graph cover strategy and the query processing (Section 4).

This paper is an extended version of the 6 page workshop paper [1].

Section snippets

Formalization of graph cover strategies

In order to illustrate different graph cover strategies, we use Fig. 1 as our running example. The graph represents the knows relationship between two employees of the university institute WeST and one employee of the Leibniz institute GESIS. Additionally, the graph includes the ownership of the dog Bello. The terms r:, e:, w:, g:, and f: abbreviate IRI prefixes.

To formalize the problem, we define RDF graphs like in [20]. Assume a signature $σ = (I, B, L)$ , where $I$ , $B$ and $L$ are the pairwise disjoint

Formalization of query execution strategy

For the impact analysis done in this paper we have extended a state-of-the-art asynchronous execution mechanisms such as realized in TriAD [10]. The extensions render the query execution mechanism independent from the underlying graph cover. In order to formalize our query execution mechanism in Section 3.2 we first introduce the required formal definitions of a small subset of the standard query language SPARQL in the next section. This section contains only common definitions to provide the

Methodology for benchmarking graph cover strategies

When defining a methodology for investigating the effects of graph cover strategies on distributed RDF stores, several challenges arise. Beyond overall performance for the processing of SPARQL queries [31], we want to observe indications that contribute to understanding how graph cover strategies may relate to scalability. Section 1 has already explained several high-level indicators, which are formally defined in Section 4.1.

Ideally, the graph cover strategy would be the only independent input

Evaluation

The experimental setup we have used for the impact analysis of different graph cover strategies on the query execution effort is explained in Section 5.1. Our results are described in Section 5.2.⁵

Related work

There are two categories of work related to our study. The first type consists of other graph cover evaluations and is described in Section 6.1. The second type consists of graph cover strategies that we have not evaluated, yet. They are described in Section 6.2.

Conclusion

We have presented a comprehensive methodology and its implementation for analysing the impact of graph cover strategies on the performance of distributed RDF stores in the cloud. Our systematically varied, broad set of experiments has revealed that contrary to common assumption the minimal edge-cut cover may have a worse overall query execution performance than hash-based data placement strategies. With the provided set of varying metrics, we found out that balancing the query workload across

References (60)

JankeD. et al.
On data placement strategies in distributed RDF stores
NorvigP.
The semantic web and the semantics of the web: where does meaning come from?
J. McMurry, S. Jupp, J. Malone, T. Burdett, A. Jenkinson, H. Parkinson, M. Davies, M. Brandizi, et al., Report on the...
O. Erling, I. Mikhailov, Towards web scale RDF, in: 4th Int. Workshop on Scalable Semantic Web Knowledge Base Systems...
HarthA. et al.
Optimized index structures for querying RDF from the web
A. Harth, J. Umbrich, A. Hogan, S. Decker, YARS2: A federated repository for querying graph structured data from the...
OwensA. et al.
Clustered TDB: A Clustered Triple Store for Jena
ZengK. et al.
A distributed graph engine for web scale RDF data
PVLDB
(2013)
LeeK. et al.
Scaling queries over big RDF graphs with semantic hash partitioning
PVLDB
(2013)
GurajadaS. et al.
TriAD: A distributed shared-nothing RDF engine based on asynchronous message passing

HuangJ. et al.

Scalable SPARQL querying of large RDF graphs

PVLDB

(2011)

X. Zhang, L. Chen, Y. Tong, M. Wang, EAGRE: Towards scalable I/O efficient SPARQL query evaluation on the cloud, in:...

M. Farhan Husain, L. Khan, M. Kantarcioglu, B. Thuraisingham, Data intensive query processing for large RDF graphs...

V. Khadilkar, M. Kantarcioglu, B.M. Thuraisingham, P. Castagna, Jena-HBase: A distributed, scalable and effcient RDF...

ZhangX. et al.

Towards efficient join processing over large RDF graph using MapReduce

LeeK. et al.

Efficient data partitioning model for heterogeneous graphs in the cloud

R. Mutharaju, S. Sakr, A. Sala, P. Hitzler, D-SPARQ: Distributed, scalable and efficient RDF query engine, in: ISWC...

JiangD. et al.

The performance of MapReduce: An in-depth study

PVLDB

(2010)

O. Curé, H. Naacke, M.A. Baazizi, B. Amann, On the evaluation of RDF distribution algorithms implemented over apache...

GutierrezC. et al.

Foundations of semantic web databases

L. Galarraga, K. Hose, R. Schenkel, Partout: A distributed engine for efficient RDF processing, CoRR abs/1212.5, 2012,...

M. Bröcheler, A. Pugliese, V.S. Subrahmanian, COSI: Cloud oriented subgraph identification in massive social networks,...

C. Basca, A. Bernstein, Distributed SPARQL throughput increase: On the effectiveness of Workload-driven RDF...

K. Hose, R. Schenkel, WARP: Workload-aware replication and partitioning for RDF, in: Data Engineering Workshops, ICDEW,...

B. Wu, Y. Zhou, P. Yuan, H. Jin, L. Liu, SemStore: A Semantic-Preserving Distributed RDF Triple Store, in: CIKM-2014,...

KhandelwalA. et al.

ZipG: A Memory-efficient Graph Store for Interactive Queries

KarypisG. et al.

A fast and high quality multilevel scheme for partitioning irregular graphs

SIAM J. Sci. Comput.

(1998)

A. Potter, B. Motik, I. Horrocks, Querying distributed RDF graphs: The effects of partitioning, in: Workshop on...

AbadiD.J. et al.

Scalable semantic web data management using vertical partitioning

JankeD. et al.