Skip to main content

Semi-supervised Constrained Clustering: An Expert-Guided Data Analysis Methodology

  • Conference paper
  • 1631 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6230))

Abstract

This paper presents a methodology for expert-guided analysis of large data sets, including large text corpora. Its main ingredient is the algorithm for semi-supervised data clustering using cluster size constraints which implements several improvements over existing k-means constrained clustering algorithms. First, it allows for a larger set of user-defined cluster size constraints of different types (lower- and upper-bound constraints). Second, it allows for dynamic re-assignment of predefined constraints to clusters in iterative cluster computation/optimization, thus improving the results of constrained clustering. Third, it allows for expert-guided cluster optimization achieved by combining constrained clustering and data visualization, which enables finer-grained expert’s control over the clustering process, leading to further improvements of the quality of obtained clustering solutions. Incorporating data visualization into the clustering process allows the user to select referential points which act as constraint anchors in the course of iterative cluster computation. The proposed semi-supervised constrained clustering methodology has been implemented using a service-oriented data mining environment Orange4WS and evaluated on different document corpora.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berkhin, P.: Survey of Clustering Data Mining Techniques. Research Paper. Accrue Software Inc. (2002)

    Google Scholar 

  2. Bertsekas, D.P.: Linear Network Optimization. MIT Press, Cambridge (1991)

    MATH  Google Scholar 

  3. Bradley, P.S., Bennett, K.P., Demiriz, A.: Constrained K-Means Clustering. Miscrosoft Research publication, MSR-TR-2000-65 (May 2000)

    Google Scholar 

  4. Bradley, P.S., Mangasarian, O.L., Street, W.N.: Clustering via concave mininization. In: Advances in Neural Information Prcessing Systems, vol. 9, pp. 368–374. MIT Press, Cambridge (1997)

    Google Scholar 

  5. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Machine Intelligence 1(4), 224–227 (1979)

    Article  Google Scholar 

  6. Dhillon, I., Guan, Y., Kogan, J.: Refining clusters in high dimensional data. In: Second SIAM ICDM Workshop on Clustering High Dimensional Data (2002)

    Google Scholar 

  7. Faloutsos, C., Lin, K.: FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data (1995)

    Google Scholar 

  8. Forgy, E.: Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics 21, 768–780 (1965)

    Google Scholar 

  9. Fortuna, B., Grobelnik, M., Mladenić, D.: Semi-automatic Data-driven Ontology Construction System. In: Proc. of the 9th Intl. Multiconf. Information Society IS 2006, Ljubljana, Slovenia (2006)

    Google Scholar 

  10. Gansner, E.R., Koren, Y., North, S.: Graph Drawing by Stress Majorization. In: Pach, J. (ed.) GD 2004. LNCS, vol. 3383, pp. 239–250. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  11. Karp, R.M.: Reducibility Among Combinatorial Problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103. Plenum, New York (1972)

    Google Scholar 

  12. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1997)

    MATH  Google Scholar 

  13. Paige, C.C., Saunders, M.A.: Algorithm 583; LSQR: Sparse Linear Equations and Least-squares Problems. ACM Trans. on Mathematical Software (TOMS) 8(2), 195–209 (1982)

    Article  MathSciNet  Google Scholar 

  14. Paulovich, F.V., Nonato, L.G., Minghim, R.: Visual Mapping of Text Collections through a Fast High Precision Projection Technique. In: Proc. of the 10th Conf. on Information Visualization, pp. 282–290 (2006)

    Google Scholar 

  15. Podpečan, V., Juršič, M., Žakova, M., Lavrač, N.: Towards a Service-Oriented Knowledge Discovery Platform. In: SoKD: ECML/PKDD 2009 workshop on Third Generation Data Mining (2009)

    Google Scholar 

  16. Sorkine, O., Cohen-Or, D.: Least-squares Meshes. In: Proc. of the Intl. Conference on Shape Modeling, pp. 191–199 (2004)

    Google Scholar 

  17. Tan, P., Steinbach, M., Kumar, V.: Introduction to Data mining. Addison Wesley, Reading (2006)

    Google Scholar 

  18. Tung, A.K.H., Ng, R.T., Lakshmanan, L.V.S., Han, J.: Constraint-based clustering in large databases. In: Proc. of the 8th Intl. Conf. on Database Theory, pp. 405–419 (2001)

    Google Scholar 

  19. Wagstaff, K., Cardie, C.: Clustering with Instance-level Constraints. In: Proc. of the 17th Intl. Conf. on Machine Learning, pp. 1103–1110 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Podpečan, V., Grčar, M., Lavrač, N. (2010). Semi-supervised Constrained Clustering: An Expert-Guided Data Analysis Methodology. In: Zhang, BT., Orgun, M.A. (eds) PRICAI 2010: Trends in Artificial Intelligence. PRICAI 2010. Lecture Notes in Computer Science(), vol 6230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15246-7_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15246-7_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15245-0

  • Online ISBN: 978-3-642-15246-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics