Abstract
This paper presents a methodology for expert-guided analysis of large data sets, including large text corpora. Its main ingredient is the algorithm for semi-supervised data clustering using cluster size constraints which implements several improvements over existing k-means constrained clustering algorithms. First, it allows for a larger set of user-defined cluster size constraints of different types (lower- and upper-bound constraints). Second, it allows for dynamic re-assignment of predefined constraints to clusters in iterative cluster computation/optimization, thus improving the results of constrained clustering. Third, it allows for expert-guided cluster optimization achieved by combining constrained clustering and data visualization, which enables finer-grained expert’s control over the clustering process, leading to further improvements of the quality of obtained clustering solutions. Incorporating data visualization into the clustering process allows the user to select referential points which act as constraint anchors in the course of iterative cluster computation. The proposed semi-supervised constrained clustering methodology has been implemented using a service-oriented data mining environment Orange4WS and evaluated on different document corpora.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Berkhin, P.: Survey of Clustering Data Mining Techniques. Research Paper. Accrue Software Inc. (2002)
Bertsekas, D.P.: Linear Network Optimization. MIT Press, Cambridge (1991)
Bradley, P.S., Bennett, K.P., Demiriz, A.: Constrained K-Means Clustering. Miscrosoft Research publication, MSR-TR-2000-65 (May 2000)
Bradley, P.S., Mangasarian, O.L., Street, W.N.: Clustering via concave mininization. In: Advances in Neural Information Prcessing Systems, vol. 9, pp. 368–374. MIT Press, Cambridge (1997)
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Machine Intelligence 1(4), 224–227 (1979)
Dhillon, I., Guan, Y., Kogan, J.: Refining clusters in high dimensional data. In: Second SIAM ICDM Workshop on Clustering High Dimensional Data (2002)
Faloutsos, C., Lin, K.: FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proc. of the ACM SIGMOD Intl. Conf. on Management of Data (1995)
Forgy, E.: Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics 21, 768–780 (1965)
Fortuna, B., Grobelnik, M., Mladenić, D.: Semi-automatic Data-driven Ontology Construction System. In: Proc. of the 9th Intl. Multiconf. Information Society IS 2006, Ljubljana, Slovenia (2006)
Gansner, E.R., Koren, Y., North, S.: Graph Drawing by Stress Majorization. In: Pach, J. (ed.) GD 2004. LNCS, vol. 3383, pp. 239–250. Springer, Heidelberg (2005)
Karp, R.M.: Reducibility Among Combinatorial Problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103. Plenum, New York (1972)
Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1997)
Paige, C.C., Saunders, M.A.: Algorithm 583; LSQR: Sparse Linear Equations and Least-squares Problems. ACM Trans. on Mathematical Software (TOMS) 8(2), 195–209 (1982)
Paulovich, F.V., Nonato, L.G., Minghim, R.: Visual Mapping of Text Collections through a Fast High Precision Projection Technique. In: Proc. of the 10th Conf. on Information Visualization, pp. 282–290 (2006)
Podpečan, V., Juršič, M., Žakova, M., Lavrač, N.: Towards a Service-Oriented Knowledge Discovery Platform. In: SoKD: ECML/PKDD 2009 workshop on Third Generation Data Mining (2009)
Sorkine, O., Cohen-Or, D.: Least-squares Meshes. In: Proc. of the Intl. Conference on Shape Modeling, pp. 191–199 (2004)
Tan, P., Steinbach, M., Kumar, V.: Introduction to Data mining. Addison Wesley, Reading (2006)
Tung, A.K.H., Ng, R.T., Lakshmanan, L.V.S., Han, J.: Constraint-based clustering in large databases. In: Proc. of the 8th Intl. Conf. on Database Theory, pp. 405–419 (2001)
Wagstaff, K., Cardie, C.: Clustering with Instance-level Constraints. In: Proc. of the 17th Intl. Conf. on Machine Learning, pp. 1103–1110 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Podpečan, V., Grčar, M., Lavrač, N. (2010). Semi-supervised Constrained Clustering: An Expert-Guided Data Analysis Methodology. In: Zhang, BT., Orgun, M.A. (eds) PRICAI 2010: Trends in Artificial Intelligence. PRICAI 2010. Lecture Notes in Computer Science(), vol 6230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15246-7_22
Download citation
DOI: https://doi.org/10.1007/978-3-642-15246-7_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15245-0
Online ISBN: 978-3-642-15246-7
eBook Packages: Computer ScienceComputer Science (R0)