Skip to main content
Log in

Workload analysis for scientific literature digital libraries

  • Regular Paper
  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Workload studies of large-scale systems may help locating possible bottlenecks and improving performances. However, previous workload analysis for Web applications is typically focused on generic platforms, neglecting the unique characteristics exhibited in various domains of these applications. It is observed that different application domains have intrinsically heterogeneous characteristics, which have a direct impact on the system performance. In this study, we present an extensive analysis into the workload of scientific literature digital libraries, unveiling their temporal and user interest patterns. Logs of a computer science literature digital library, CiteSeer, are collected and analyzed. We intentionally remove service details specific to CiteSeer. We believe our analysis is applicable to other systems with similar characteristics. While many of our findings are consistent with previous Web analysis, we discover several unique characteristics of scientific literature digital library workload. Furthermore, we discuss how to utilize our findings to improve system performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Badue, C.S., Barbosa, R., Golgher, P., Ribeiro-Neto, B., Ziviani N: Distributed processing of conjunctive queries. In: HDIR ’05, SIGIR (2005)

  2. Barford, P., Crovella, M.E.: Generating representative Web workloads for network and server performance evaluation. In: SIGMETRICS ’98, pp. 151–160. July 1998

  3. Bent, L., Rabinovich, M., Voelker, G.M., Xiao, Z.: Characterization of a large web site population with implications for content delivery. In: WWW ’04, pp. 522–533 (2004)

  4. Beran J.: Statistics for Long-Memory Processes. Chapman & Hall, New York (1994)

    MATH  Google Scholar 

  5. Berendt, B., Mobasher, B., Spiliopoulou, M., Wiltshire, J.: Measuring the accuracy of sessionizers for web usage analysis. In: Proceedings of the Web Mining Workshop at the 1st SIAM International Conference on Data Mining. Chicago, April 2001

  6. Box G., Jenkins G.: Time Series Analysis, Forecasting and Control. Holden-Day Inc., San Francisco (1990)

    Google Scholar 

  7. Chaudhuri, S., Ganesan, P., Narasayya, V.R.: Primitives for workload summarization and implications for SQL. In: VLDB, pp. 730–741 (2003)

  8. Crovella M.E., Bestavros A.: Self-similarity in World Wide Web traffic: Evidence and possible causes. IEEE/ACM Trans. Network 5(6), 835–846 (1997)

    Article  Google Scholar 

  9. Dinda, P., O’Hallaron, D.: An extensible toolkit for resource prediction in distributed systems (1999)

  10. Giles, C.L., Bollacker, K., Lawrence, S.: CiteSeer: An automatic citation indexing system. In: Witten, I., Akscyn, R., Shipman, F.M. III (eds.) Digital Libraries 98—The Third ACM Conference on Digital Libraries, pp. 89–98. ACM Press, Pittsburgh, 23–26 June 1998

  11. Gómez, M.E., Santonja, V.: Analysis of self-similarity in I/O workload using structural modeling. In: MASCOTS, p. 234 (1999)

  12. Hölscher, C.: How internet experts search for information on the web. In: WebNet (1998)

  13. Kelly, T., Mogul, J.: Aliasing on the world wide web: prevalence and performance implications. In: WWW’02, Honolulu, Hawaii, May 2002

  14. Lempel, R., Moran, S.: Predictive caching and prefetching of query results in search engines. In: WWW, pp. 19–28 (2003)

  15. Li, H., Lee, W.-C., Sivasubramaniam, A., Giles, L.: Searchgen: a synthetic workload generator for scientific literature digital libraries and search engines. In: JCDL ’07: Proceedings of the 2007 Conference on Digital Libraries, pp. 137–146. ACM Press, New York (2007)

  16. Lu, Y., Abdelzaher, T., Lu, C., Tao, G.: An adaptive control framework for QoS guarantees and its application to differentiated caching services (2002)

  17. Manavoglu, E., Pavlov, D., Giles, C.L.: Probabilistic user behavior models. In: ICDM ’03, p. 203. Washington (2003)

  18. Markatos E.P.: On caching search engine query results. Comput Commun 24(2), 137–143 (2001)

    Article  Google Scholar 

  19. Saraiva, P.C., de Moura, E.S., Fonseca, R.C., W.M., Jr., B.A., Ribeiro-Neto, Ziviani, N.: Rank-preserving two-level caching for scalable search engines. In: SIGIR, pp. 51–58 (2001)

  20. Silverstein C., Henzinger M.R., Marais H., Moricz M.: Analysis of a very large web search engine query log. SIGIR Forum 33(1), 6–12 (1999)

    Article  Google Scholar 

  21. Simmonds, R., Williamson, C.L., Bradford, R., Arlitt, M.F., Unger, B.: Web server benchmarking using parallel WAN emulation. In: SIGMETRICS’02, pp. 286–287 (2002)

  22. Streit, A.: Self-Tuning Job Scheduling Strategies for the Resource Management of HPC Systems and Computational Grids. PhD thesis, Faculty of Computer Science, Electrical Engineering and Mathematics, University Paderborn (2003)

  23. Tan P., Kumar V.: Discovery of web robot sessions based on their navigational patterns. Data Min Knowl Discov 6, 9–35 (2002)

    Article  MathSciNet  Google Scholar 

  24. Tran, N. Reed, D.A.: ARIMA time series modeling and forecasting for adaptive I/O prefetching. In: Proceedings of the 15th International Conference on Supercomputing, pp. 473–485, June 2001

  25. Wang, Y., Rutherford, M.J., Carzaniga, A., Wolf, A.L. Weevil.: a tool to automate experimentation with distributed systems. Technical Report CU-CS-980-04, Department of Computer Science, University of Colorado, October 2004

  26. Zhang, J., Sivasubramaniam, A., Franke, H., Gautam N., Zhang Y., Nagar, S.: Synthesizing representative I/O workloads for TPC-H. In: HPCA, pp. 142–151 (2004)

  27. Zhang, S., Cohen, I., Goldszmidt, M., Symons, J., Fox, A.: Ensembles of models for automated diagnosis of system performance problems. In: DSN ’05: Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05), pp. 644–653 (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huajing Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, H., Lee, WC., Sivasubramaniam, A. et al. Workload analysis for scientific literature digital libraries. Int J Digit Libr 9, 139–149 (2008). https://doi.org/10.1007/s00799-008-0043-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-008-0043-z

Keywords

Navigation