skip to main content
10.1145/1014052.1014137acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Cluster-based concept invention for statistical relational learning

Published:22 August 2004Publication History

ABSTRACT

We use clustering to derive new relations which augment database schema used in automatic generation of predictive features in statistical relational learning. Entities derived from clusters increase the expressivity of feature spaces by creating new first-class concepts which contribute to the creation of new features. For example, in CiteSeer, papers can be clustered based on words or citations giving "topics", and authors can be clustered based on documents they co-author giving "communities". Such cluster-derived concepts become part of more complex feature expressions. Out of the large number of generated features, those which improve predictive accuracy are kept in the model, as decided by statistical feature selection criteria. We present results demonstrating improved accuracy on two tasks, venue prediction and link prediction, using CiteSeer data.

References

  1. M. Craven and S. Slattery. Relational learning with statistical predicate invention: Better models for hypertext. Machine Learning, 43(1/2):97--119, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Saso Dzeroski and Nada Lavrac. An introduction to inductive logic programming. In Saso Dzeroski and Nada Lavrac, editors, Relational Data Mining, pages 48--73. Springer-Verlag, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Dean Foster and Lyle Ungar. A proposal for learning by ontological leaps. In Proc. of Snowbird Learning Conference, Snowbird, Utah, 2002.Google ScholarGoogle Scholar
  4. L. Getoor, N. Friedman, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In Saso Dzeroski and Nada Lavrac, editors, Relational Data Mining, pages 307--338. Springer-Verlag, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  5. L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience, 1990.Google ScholarGoogle Scholar
  6. Mathias Kirsten, Stefan Wrobel, and Tamas Horvath. Distance based approaches to relational learning and clustering. In Saso Dzeroski and Nada Lavrac, editors, Relational Data Mining, pages 213--230. Springer-Verlag, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Kramer, N. Lavrac, and P. Flach. Propositionalization approaches to relational data mining. In Saso Dzeroski and Nada Lavrac, editors, Relational Data Mining, pages 262--291. Springer-Verlag, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Stephen Muggleton and Luc De Raedt. Inductive logic programming: Theory and methods. Journal of Logic Programming, 19,20:629--679, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  9. Dmitry Pavlov, Alexandrin Popescul, David M. Pennock, and Lyle H. Ungar. Mixtures of conditional maximum entropy models. In Proc. of ICML-2003, 2003.Google ScholarGoogle Scholar
  10. Claudia Perlich and Foster Provost. Aggregation-based feature invention and relational concept classes. In Proc. of KDD-2003, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Alexandrin Popescul. Statistical Learning from Relational Databases, PhD thesis. University of Pennsylvania, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Alexandrin Popescul, Gary Flake, Steve Lawrence, Lyle H. Ungar, and C. Lee Giles. Clustering and identifying temporal trends in document databases. In Proc. of the IEEE Advances in Digital Libraries, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J.R. Quinlan and R.M. Cameron-Jones. Induction of logic programs: FOIL and related systems. New Generation Computing, 13:287--312, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw Hill, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gideon Schwartz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461--464, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  16. E. Shapiro. Algorithmic Program Debugging. MIT Press, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cluster-based concept invention for statistical relational learning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2004
      874 pages
      ISBN:1581138881
      DOI:10.1145/1014052

      Copyright © 2004 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 August 2004

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader