Skip to main content
Log in

Summarization – compressing data into an informative representation

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In this paper, we formulate the problem of summarization of a data set of transactions with categorical attributes as an optimization problem involving two objective functions – compaction gain and information loss. We propose metrics to characterize the output of any summarization algorithm. We investigate two approaches to address this problem. The first approach is an adaptation of clustering and the second approach makes use of frequent itemsets from the association analysis domain. We illustrate one application of summarization in the field of network data where we show how our technique can be effectively used to summarize network traffic into a compact but meaningful representation. Specifically, we evaluate our proposed algorithms on the 1998 DARPA Off-Line Intrusion Detection Evaluation data and network data generated by SKAION Corp for the ARDA information assurance program.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Skaion Corporation (2003) Skaion intrusion detection system evaluation data.

  2. Afrati F, Gionis A, Mannila H (2004) Approximating a collection of frequent sets. In: Proceedings of the 2004 ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 12–19

    Chapter  Google Scholar 

  3. Agrawal R, Imieliski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data. ACM Press, pp 207–216

  4. Barbara D, Couto J, Jajodia S, Wu N (2001) ADAM: a testbed for exploring the use of data mining in intrusion detection. SIGMOD Rec 30(4):15–24

    Article  Google Scholar 

  5. Boulicaut J-F, Bykowski A, Rigotti C (2003) Free-sets: a condensed representation of boolean data for the approximation of frequency queries. Data Min Knowl Discov 7(1):5–22

    Article  Google Scholar 

  6. Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data. ACM Press, New York, NY, USA, pp 93–104

    Chapter  Google Scholar 

  7. Calders T, Goethals B (2002) Mining all non-derivable frequent itemsets. In: Proceedings of the 6th European conference on principles of data mining and knowledge discovery. Springer-Verlag, London, UK, pp 74–85

    Google Scholar 

  8. Chandola V, Kumar V (2005) Summarization—compressing data into an informative representation. Technical report TR 05-024, Department of Computer Science, University of Minnesota, Minneapolis, MN, USA

    Google Scholar 

  9. Ertöz L, Eilertson E, Lazarevic A, Tan P-N, Kumar V, Srivastava J, Dokas P (2004) MINDS—Minnesota intrusion detection system. In: Data mining—next generation challenges and future directions. MIT Press

  10. Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA

    Google Scholar 

  11. Han J, Wang J, Lu Y, Tzvetkov P (2002) Mining top-k frequent closed patterns without minimum support. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM'02). IEEE Computer Society, Washington, DC, USA, p 211

  12. Hu M, Liu B (2004) Mining and summarizing customer reviews. In: Proceedings of the 2004 ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 168–177.

    Chapter  Google Scholar 

  13. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc.

  14. Karypis G (XXXX) Cluto 2.1.1 software for clustering high-dimensional datasets

  15. Lippmann RP et al (2000) Evaluating intrusion detection systems—the 1998 DARPA off-line intrusion detection evaluation. In: DISCEX'00, vol 2, pp 12–26

    Google Scholar 

  16. Liu B, Hu M, Hsu W (2000) Multi-level organization and summarization of the discovered rules. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 208–217

    Chapter  Google Scholar 

  17. Mahoney MV, Chan PK (2002) Learning non-stationary models of normal network traffic for detecting novel attacks. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 376–385

    Google Scholar 

  18. Mani I (1999) Advances in automatic text summarization. MIT Press, Cambridge, MA, USA

    Google Scholar 

  19. Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th international conference on database theor. London, UK, Springer-Verlag, pp 398–416

    Google Scholar 

  20. Pei J, Dong G, Zou W, Han J (2004) Mining condensed frequent-pattern bases. Knowl Inf Syst 6(5):570–594

    Article  Google Scholar 

  21. Sayal M, Scheuermann P (2001) Distributed web log mining using maximal large item sets. Knowl Inf Syst 3(4):389–404

    Article  MATH  Google Scholar 

  22. Stolfo SJ, Lee W, Chan PK, Fan W, Eskin E (2001) Data mining-based intrusion detectors: an overview of the columbia ids project. SIGMOD Rec 30(4):5–14

    Article  Google Scholar 

  23. Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining, chapter 8. Addison-Wesley

  24. Wang J, Karypis G (2006) On efficiently summarizing categorical databases. Knowl Inf Syst 9(1):19–37

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Varun Chandola.

Additional information

Vipin Kumar is currently William Norris Professor and Head of the Computer Science and Engineering Department at the University of Minnesota. His research interests include high-performance computing and data mining. He has authored over 200 research articles, and has coedited or coauthored nine books including the widely used text booksIntroduction to Parallel Computing andIntroduction to Data Mining, both published by Addison Wesley. He has served as chair/co-chair for many conferences/workshops in the area of data mining and parallel computing, including the IEEE International Conference on Data Mining (2002) and the 15th International Parallel and Distributed Processing Symposium (2001). He serves as the chair of the steering committee of the SIAM International Conference on Data Mining, and is a member of the steering committee of the IEEE International Conference on Data Mining. Dr. Kumar serves or has served on the editorial boards of several journals includingKnowledge and Information Systems,Journal of Parallel and Distributed Computing andIEEE Transactions of Data and Knowledge Engineering (1993–1997). He is a Fellow of the ACM and IEEE, and a member of SIAM.

Varun Chandola received his BTech degree in Computer Science from the Indian Institute of Technology, Madras, India, in 2002. He is currently a PhD student in the Computer Science and Engineering Department at the University of Minnesota. His research interests include data mining, cyber-security and machine learning.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chandola, V., Kumar, V. Summarization – compressing data into an informative representation. Knowl Inf Syst 12, 355–378 (2007). https://doi.org/10.1007/s10115-006-0039-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-006-0039-1

Keywords

Navigation