Abstract
The rapid growth of XML adoption has urged for the need of a proper representation for semi-structured documents, where the document semantic structural information has to be taken into account so as to support more precise document analysis. In order to analyze the information represented in XML documents efficiently, researches on XML document clustering are actively in progress. The key issue is how to devise the similarity measure between XML documents to be used for clustering. Since XML documents have hierarchical structure, it is not appropriate to cluster them by using a general document similarity measure. Dimension reduction plays an important role in handling the massive quantity of high dimensional data such as mass semantic structural documents. In this paper, we introduce distance dimension reduction (DDR) based on the QR factorization (DDR/QR) or the Cholesky factorization (DDR/C). DDR generates lower dimensional representations of the high-dimensional XML document, which can exactly preserve Euclidean distances and cosine similarities between any pair of XML documents in the original dimensional space. After projecting XML documents to the lower dimensional space obtained from DDR, our proposed method QR fuzzy c-mean to execute the document-analysis clustering algorithms (we called the QR-FCM). DDR can substantially reduce the computing time and/or memory requirement of a given document-analysis clustering algorithm, especially when we need to run the document analysis algorithm many times for estimating parameters or searching for a better solution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Pei, J., Han, J., Asi, B.M., Pinto, H.: PrefixSpan: Mining Sequenctial Pattern efficiently by Prefix-Projected Pattern Growth. In: Int. Conf. Data Engineering, ICDE (2001)
Hwang, J.H., Ryu, K.H.: XML A New XML clustering for Structural Retrieval. In: International Conference on Conceptual Modeling (2004)
Hwang, J.H., Ryu, K.h.: Clustering and retrieval of XML documents by structure. In: Gervasi, O., Gavrilova, M.L., Kumar, V., Laganá, A., Lee, H.P., Mun, Y., Taniar, D., Tan, C.J.K. (eds.) ICCSA 2005. LNCS, vol. 3481, pp. 925–935. Springer, Heidelberg (2005)
Lian, W., Wai-lok, D.: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Computer Society Technical Committee on Data Engineering (2004)
Massay, W.F.: Principal components regression in exploratorystatistical research. J. Amer Statist. Assoc. 60, 234–246 (1965)
Torgerson, W.S.: Theory & Methods of Scaling. Wiley, New York (1958)
Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)
Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000)
Saul, L.K., Roweis, S.T.: Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research 4, 119–155 (2003)
Donoho, D.L., Grimes, C.E.: Hessian eigenmaps: locally embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. USA 100, 5591–5596 (2003)
Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373–1396 (2003)
Zhang, Z., Zha, H.: Principal manifolds and nonlinear dimension reduction via tangent space alignment. SIAM Journal of Scientific Computing 26(1), 313–338 (2004)
Kim, H., Park, H., Zha, H.: Distance preserving dimension reduction for manifold learning. In: Proceedings of the 2007 SIAM International Conference on Data Mining, SDM 2007 (2007)
Dalamagas, T., Cheng, T., Winkel, K.J., Sellis, T.: A Methodology for Clustering XML Documents by Structure. Information Systems 31(3), 187–228 (2006)
Gao, J., Zhang, J.: Clustered SVD strategies in latent semantic indexing. Inf. Process. Manag. 41(5), 1051–1063 (2005)
Berry, M.W., Shakhina, A.P.: Computing sparse reduced-rank approximation to sparse matrices. ACM Trans. Math. Software 31(2), 252–269 (2005)
Tou, J.T., Gonzalez, R.C.: Pattern Recognition Principles. Addison-Wesley, Reading (1974)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)
Frigui, H., Krishnapuram, R.: A Robust Competitive Clustering Algorithm with Application in Computer Vision. IEEE Trans. Pattern Analysis and Machine Intelligence 21(1), 450–465 (1999)
Everitt, B.S.: Cluster Analysis, 3rd edn. Halsted Press (1993)
Maulik, U., Bandyopadhyay, S.: Genetic Algorithm Based Clustering Technique. Pattern Recognition 33, 1455–1465 (2000)
Ye, Y.Q.: Comparing matrix methods in text-based information retrieval. — Tech. Rep., School of Mathematical Sciences, Peking University (2000)
ACM SIGMOD Record home page, http://www.acm.org/sigmod/record/xml
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 ICST Institute for Computer Science, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Chang, HK., Jou, IC. (2009). Distance Dimension Reduction on QR Factorization for Efficient Clustering Semantic XML Document Using the QR Fuzzy C-Mean (QR-FCM). In: Mueller, P., Cao, JN., Wang, CL. (eds) Scalable Information Systems. INFOSCALE 2009. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 18. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10485-5_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-10485-5_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10484-8
Online ISBN: 978-3-642-10485-5
eBook Packages: Computer ScienceComputer Science (R0)