Workload analysis for scientific literature digital libraries

Li, Huajing; Lee, Wang-Chien; Sivasubramaniam, Anand; Giles, C. Lee

doi:10.1007/s00799-008-0043-z

Workload analysis for scientific literature digital libraries

Regular Paper
Published: 23 September 2008

Volume 9, pages 139–149, (2008)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Huajing Li¹,
Wang-Chien Lee¹,
Anand Sivasubramaniam¹ &
…
C. Lee Giles²

116 Accesses
6 Citations
Explore all metrics

Abstract

Workload studies of large-scale systems may help locating possible bottlenecks and improving performances. However, previous workload analysis for Web applications is typically focused on generic platforms, neglecting the unique characteristics exhibited in various domains of these applications. It is observed that different application domains have intrinsically heterogeneous characteristics, which have a direct impact on the system performance. In this study, we present an extensive analysis into the workload of scientific literature digital libraries, unveiling their temporal and user interest patterns. Logs of a computer science literature digital library, CiteSeer, are collected and analyzed. We intentionally remove service details specific to CiteSeer. We believe our analysis is applicable to other systems with similar characteristics. While many of our findings are consistent with previous Web analysis, we discover several unique characteristics of scientific literature digital library workload. Furthermore, we discuss how to utilize our findings to improve system performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Badue, C.S., Barbosa, R., Golgher, P., Ribeiro-Neto, B., Ziviani N: Distributed processing of conjunctive queries. In: HDIR ’05, SIGIR (2005)
Barford, P., Crovella, M.E.: Generating representative Web workloads for network and server performance evaluation. In: SIGMETRICS ’98, pp. 151–160. July 1998
Bent, L., Rabinovich, M., Voelker, G.M., Xiao, Z.: Characterization of a large web site population with implications for content delivery. In: WWW ’04, pp. 522–533 (2004)
Beran J.: Statistics for Long-Memory Processes. Chapman & Hall, New York (1994)
MATH Google Scholar
Berendt, B., Mobasher, B., Spiliopoulou, M., Wiltshire, J.: Measuring the accuracy of sessionizers for web usage analysis. In: Proceedings of the Web Mining Workshop at the 1st SIAM International Conference on Data Mining. Chicago, April 2001
Box G., Jenkins G.: Time Series Analysis, Forecasting and Control. Holden-Day Inc., San Francisco (1990)
Google Scholar
Chaudhuri, S., Ganesan, P., Narasayya, V.R.: Primitives for workload summarization and implications for SQL. In: VLDB, pp. 730–741 (2003)
Crovella M.E., Bestavros A.: Self-similarity in World Wide Web traffic: Evidence and possible causes. IEEE/ACM Trans. Network 5(6), 835–846 (1997)
Article Google Scholar
Dinda, P., O’Hallaron, D.: An extensible toolkit for resource prediction in distributed systems (1999)
Giles, C.L., Bollacker, K., Lawrence, S.: CiteSeer: An automatic citation indexing system. In: Witten, I., Akscyn, R., Shipman, F.M. III (eds.) Digital Libraries 98—The Third ACM Conference on Digital Libraries, pp. 89–98. ACM Press, Pittsburgh, 23–26 June 1998
Gómez, M.E., Santonja, V.: Analysis of self-similarity in I/O workload using structural modeling. In: MASCOTS, p. 234 (1999)
Hölscher, C.: How internet experts search for information on the web. In: WebNet (1998)
Kelly, T., Mogul, J.: Aliasing on the world wide web: prevalence and performance implications. In: WWW’02, Honolulu, Hawaii, May 2002
Lempel, R., Moran, S.: Predictive caching and prefetching of query results in search engines. In: WWW, pp. 19–28 (2003)
Li, H., Lee, W.-C., Sivasubramaniam, A., Giles, L.: Searchgen: a synthetic workload generator for scientific literature digital libraries and search engines. In: JCDL ’07: Proceedings of the 2007 Conference on Digital Libraries, pp. 137–146. ACM Press, New York (2007)
Lu, Y., Abdelzaher, T., Lu, C., Tao, G.: An adaptive control framework for QoS guarantees and its application to differentiated caching services (2002)
Manavoglu, E., Pavlov, D., Giles, C.L.: Probabilistic user behavior models. In: ICDM ’03, p. 203. Washington (2003)
Markatos E.P.: On caching search engine query results. Comput Commun 24(2), 137–143 (2001)
Article Google Scholar
Saraiva, P.C., de Moura, E.S., Fonseca, R.C., W.M., Jr., B.A., Ribeiro-Neto, Ziviani, N.: Rank-preserving two-level caching for scalable search engines. In: SIGIR, pp. 51–58 (2001)
Silverstein C., Henzinger M.R., Marais H., Moricz M.: Analysis of a very large web search engine query log. SIGIR Forum 33(1), 6–12 (1999)
Article Google Scholar
Simmonds, R., Williamson, C.L., Bradford, R., Arlitt, M.F., Unger, B.: Web server benchmarking using parallel WAN emulation. In: SIGMETRICS’02, pp. 286–287 (2002)
Streit, A.: Self-Tuning Job Scheduling Strategies for the Resource Management of HPC Systems and Computational Grids. PhD thesis, Faculty of Computer Science, Electrical Engineering and Mathematics, University Paderborn (2003)
Tan P., Kumar V.: Discovery of web robot sessions based on their navigational patterns. Data Min Knowl Discov 6, 9–35 (2002)
Article MathSciNet Google Scholar
Tran, N. Reed, D.A.: ARIMA time series modeling and forecasting for adaptive I/O prefetching. In: Proceedings of the 15th International Conference on Supercomputing, pp. 473–485, June 2001
Wang, Y., Rutherford, M.J., Carzaniga, A., Wolf, A.L. Weevil.: a tool to automate experimentation with distributed systems. Technical Report CU-CS-980-04, Department of Computer Science, University of Colorado, October 2004
Zhang, J., Sivasubramaniam, A., Franke, H., Gautam N., Zhang Y., Nagar, S.: Synthesizing representative I/O workloads for TPC-H. In: HPCA, pp. 142–151 (2004)
Zhang, S., Cohen, I., Goldszmidt, M., Symons, J., Fox, A.: Ensembles of models for automated diagnosis of system performance problems. In: DSN ’05: Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05), pp. 644–653 (2005)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA
Huajing Li, Wang-Chien Lee & Anand Sivasubramaniam
Department of Computer Science and Engineering, College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA, 16802, USA
C. Lee Giles

Authors

Huajing Li
View author publications
You can also search for this author in PubMed Google Scholar
Wang-Chien Lee
View author publications
You can also search for this author in PubMed Google Scholar
Anand Sivasubramaniam
View author publications
You can also search for this author in PubMed Google Scholar
C. Lee Giles
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huajing Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, H., Lee, WC., Sivasubramaniam, A. et al. Workload analysis for scientific literature digital libraries. Int J Digit Libr 9, 139–149 (2008). https://doi.org/10.1007/s00799-008-0043-z

Download citation

Published: 23 September 2008
Issue Date: November 2008
DOI: https://doi.org/10.1007/s00799-008-0043-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Workload analysis for scientific literature digital libraries

Abstract

Access this article

Similar content being viewed by others

Building Scalable Digital Library Ingestion Pipelines Using Microservices

Web Performance Pitfalls

Unveiling User Behavior on Summit Login Nodes as a User

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Workload analysis for scientific literature digital libraries

Abstract

Access this article

Similar content being viewed by others

Building Scalable Digital Library Ingestion Pipelines Using Microservices

Web Performance Pitfalls

Unveiling User Behavior on Summit Login Nodes as a User

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation