Abstract
Workload studies of large-scale systems may help locating possible bottlenecks and improving performances. However, previous workload analysis for Web applications is typically focused on generic platforms, neglecting the unique characteristics exhibited in various domains of these applications. It is observed that different application domains have intrinsically heterogeneous characteristics, which have a direct impact on the system performance. In this study, we present an extensive analysis into the workload of scientific literature digital libraries, unveiling their temporal and user interest patterns. Logs of a computer science literature digital library, CiteSeer, are collected and analyzed. We intentionally remove service details specific to CiteSeer. We believe our analysis is applicable to other systems with similar characteristics. While many of our findings are consistent with previous Web analysis, we discover several unique characteristics of scientific literature digital library workload. Furthermore, we discuss how to utilize our findings to improve system performance.
Similar content being viewed by others
References
Badue, C.S., Barbosa, R., Golgher, P., Ribeiro-Neto, B., Ziviani N: Distributed processing of conjunctive queries. In: HDIR ’05, SIGIR (2005)
Barford, P., Crovella, M.E.: Generating representative Web workloads for network and server performance evaluation. In: SIGMETRICS ’98, pp. 151–160. July 1998
Bent, L., Rabinovich, M., Voelker, G.M., Xiao, Z.: Characterization of a large web site population with implications for content delivery. In: WWW ’04, pp. 522–533 (2004)
Beran J.: Statistics for Long-Memory Processes. Chapman & Hall, New York (1994)
Berendt, B., Mobasher, B., Spiliopoulou, M., Wiltshire, J.: Measuring the accuracy of sessionizers for web usage analysis. In: Proceedings of the Web Mining Workshop at the 1st SIAM International Conference on Data Mining. Chicago, April 2001
Box G., Jenkins G.: Time Series Analysis, Forecasting and Control. Holden-Day Inc., San Francisco (1990)
Chaudhuri, S., Ganesan, P., Narasayya, V.R.: Primitives for workload summarization and implications for SQL. In: VLDB, pp. 730–741 (2003)
Crovella M.E., Bestavros A.: Self-similarity in World Wide Web traffic: Evidence and possible causes. IEEE/ACM Trans. Network 5(6), 835–846 (1997)
Dinda, P., O’Hallaron, D.: An extensible toolkit for resource prediction in distributed systems (1999)
Giles, C.L., Bollacker, K., Lawrence, S.: CiteSeer: An automatic citation indexing system. In: Witten, I., Akscyn, R., Shipman, F.M. III (eds.) Digital Libraries 98—The Third ACM Conference on Digital Libraries, pp. 89–98. ACM Press, Pittsburgh, 23–26 June 1998
Gómez, M.E., Santonja, V.: Analysis of self-similarity in I/O workload using structural modeling. In: MASCOTS, p. 234 (1999)
Hölscher, C.: How internet experts search for information on the web. In: WebNet (1998)
Kelly, T., Mogul, J.: Aliasing on the world wide web: prevalence and performance implications. In: WWW’02, Honolulu, Hawaii, May 2002
Lempel, R., Moran, S.: Predictive caching and prefetching of query results in search engines. In: WWW, pp. 19–28 (2003)
Li, H., Lee, W.-C., Sivasubramaniam, A., Giles, L.: Searchgen: a synthetic workload generator for scientific literature digital libraries and search engines. In: JCDL ’07: Proceedings of the 2007 Conference on Digital Libraries, pp. 137–146. ACM Press, New York (2007)
Lu, Y., Abdelzaher, T., Lu, C., Tao, G.: An adaptive control framework for QoS guarantees and its application to differentiated caching services (2002)
Manavoglu, E., Pavlov, D., Giles, C.L.: Probabilistic user behavior models. In: ICDM ’03, p. 203. Washington (2003)
Markatos E.P.: On caching search engine query results. Comput Commun 24(2), 137–143 (2001)
Saraiva, P.C., de Moura, E.S., Fonseca, R.C., W.M., Jr., B.A., Ribeiro-Neto, Ziviani, N.: Rank-preserving two-level caching for scalable search engines. In: SIGIR, pp. 51–58 (2001)
Silverstein C., Henzinger M.R., Marais H., Moricz M.: Analysis of a very large web search engine query log. SIGIR Forum 33(1), 6–12 (1999)
Simmonds, R., Williamson, C.L., Bradford, R., Arlitt, M.F., Unger, B.: Web server benchmarking using parallel WAN emulation. In: SIGMETRICS’02, pp. 286–287 (2002)
Streit, A.: Self-Tuning Job Scheduling Strategies for the Resource Management of HPC Systems and Computational Grids. PhD thesis, Faculty of Computer Science, Electrical Engineering and Mathematics, University Paderborn (2003)
Tan P., Kumar V.: Discovery of web robot sessions based on their navigational patterns. Data Min Knowl Discov 6, 9–35 (2002)
Tran, N. Reed, D.A.: ARIMA time series modeling and forecasting for adaptive I/O prefetching. In: Proceedings of the 15th International Conference on Supercomputing, pp. 473–485, June 2001
Wang, Y., Rutherford, M.J., Carzaniga, A., Wolf, A.L. Weevil.: a tool to automate experimentation with distributed systems. Technical Report CU-CS-980-04, Department of Computer Science, University of Colorado, October 2004
Zhang, J., Sivasubramaniam, A., Franke, H., Gautam N., Zhang Y., Nagar, S.: Synthesizing representative I/O workloads for TPC-H. In: HPCA, pp. 142–151 (2004)
Zhang, S., Cohen, I., Goldszmidt, M., Symons, J., Fox, A.: Ensembles of models for automated diagnosis of system performance problems. In: DSN ’05: Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05), pp. 644–653 (2005)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, H., Lee, WC., Sivasubramaniam, A. et al. Workload analysis for scientific literature digital libraries. Int J Digit Libr 9, 139–149 (2008). https://doi.org/10.1007/s00799-008-0043-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-008-0043-z