doi:10.1016/j.parco.2003.06.002
Copyright © 2003 Elsevier B.V. All rights reserved.
Data webs for earth science data
Asvin Ananthanarayan, Rajiv Balachandran, Robert Grossman
,
, Yunhong Gu, Xinwei Hong, Jorge Levera and Marco Mazzucco
Laboratory for Advanced Computing, University of Illinois at Chicago, M/C 249, 851 South Morgan Street, Chicago, IL 60607, USA
Received 28 May 2002;
accepted 16 June 2003. ;
Available online 16 September 2003.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
We describe high performance data webs for earth science data, which are designed for interactively analyzing small to moderate size remote data sets, as well as mining distributed data sets. Achieving high performance required developing specialized high performance transport services as well as specialized high performance middleware services for merging multiple data streams. Data webs complement data grids, which are grid based infrastructures designed to support arbitrary distributed computation over distributed data using a trusted computing model.
Author Keywords: Data webs; Data grids; High performance web services; Grid data; Correlation keys
Fig. 1. To achieve high end-to-end performance, the local parallel i/o streams are sent in parallel over a long haul network and managed by individual nodes of a cluster by the application. In the architecture described in this paper, the data transport is managed by specialized high performance transport services such as PSockets and SABUL. The striping of the data over DSTP and the merging of the data by UCKs is managed by specialized high performance middleware services such as P-DSTP and P-Merge.
Fig. 2. A sample page of the DSTP-based data web client.
Fig. 3. A 3D plot from the DSTP-based data web client.
Table 1. Today, access to geographical data is primarily through data archives

Data archives are beginning to be supplemented by data webs and data grids. The table summarizes some of the basic distinctions between data archives, data webs, and grids. AAA is an abbreviation for authorization, authentication, and access.
Table 2. This table lists the commands used by the data space transfer protocol (dstp)

Table 3. Our design of high performance data webs for geographical data introduced into new layers into a standard layered network model

The first new layer defines specialized high performance transport protocols over standard protocols such as TCP and UDP. The second new layer defines specialized high performance network services for working with multidimensional data over standard network services such as HTTP, SOAP, and newer protocols such as DSTP.
Table 4. Performance timings for PHP web client

Table 5. This table summarizes the performance gain by using specialized transport services such as PSockets and SABUL when moving data between NCAR and Ann Arbor

Both Iperf and PSockets use TCP for data transmission. Using Iperf along with TCP window tuning we obtained a throughput of 83.3 Mb/s. PSockets detected the best number of parallel sockets to be 19. The maximum throughput obtained using PSockets was found to be 85.27 Mb/s. The maximum packet loss percentage we have observed during a TCP transmission over a well tuned Abilene network was 1 percent. Therefore SABUL was tuned to run with a maximum packet loss of 1 percent. The percentage of packet loss is given in the last column. Table 5 shows that SABUL’s performance is superior to the throughput results obtained by Iperf and PSockets. The data is from [19].
Table 6. This table summarizes the performance of the specialized middleware services we developed for merging multiple streams of data by their UCKs

The algorithm is a windowed merge from [16]. The overall processing rate for merging two data sets by latitude, longitude and time is about 170 Mb/s. If the data is ordered then merging can sometimes be done at line speed. As the data becomes more disordered (measured by the randomness column rand), either the processing speed must be reduced or some unordered records (measured by the match rate column) must be passed through. This is a basic tradeoff in exploratory data analysis, the tradeoff between interactivity and accuracy. The data is from [16]. All times are in seconds.