Distance Dimension Reduction on QR Factorization for Efficient Clustering Semantic XML Document Using the QR Fuzzy C-Mean (QR-FCM)

Chang, Hsu-Kuang; Jou, I-Chang

doi:10.1007/978-3-642-10485-5_20

Hsu-Kuang Chang^18,19 &
I-Chang Jou¹⁸

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 18))

Included in the following conference series:

International Conference on Scalable Information Systems

428 Accesses

Abstract

The rapid growth of XML adoption has urged for the need of a proper representation for semi-structured documents, where the document semantic structural information has to be taken into account so as to support more precise document analysis. In order to analyze the information represented in XML documents efficiently, researches on XML document clustering are actively in progress. The key issue is how to devise the similarity measure between XML documents to be used for clustering. Since XML documents have hierarchical structure, it is not appropriate to cluster them by using a general document similarity measure. Dimension reduction plays an important role in handling the massive quantity of high dimensional data such as mass semantic structural documents. In this paper, we introduce distance dimension reduction (DDR) based on the QR factorization (DDR/QR) or the Cholesky factorization (DDR/C). DDR generates lower dimensional representations of the high-dimensional XML document, which can exactly preserve Euclidean distances and cosine similarities between any pair of XML documents in the original dimensional space. After projecting XML documents to the lower dimensional space obtained from DDR, our proposed method QR fuzzy c-mean to execute the document-analysis clustering algorithms (we called the QR-FCM). DDR can substantially reduce the computing time and/or memory requirement of a given document-analysis clustering algorithm, especially when we need to run the document analysis algorithm many times for estimating parameters or searching for a better solution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Pei, J., Han, J., Asi, B.M., Pinto, H.: PrefixSpan: Mining Sequenctial Pattern efficiently by Prefix-Projected Pattern Growth. In: Int. Conf. Data Engineering, ICDE (2001)
Google Scholar
Hwang, J.H., Ryu, K.H.: XML A New XML clustering for Structural Retrieval. In: International Conference on Conceptual Modeling (2004)
Google Scholar
Hwang, J.H., Ryu, K.h.: Clustering and retrieval of XML documents by structure. In: Gervasi, O., Gavrilova, M.L., Kumar, V., Laganá, A., Lee, H.P., Mun, Y., Taniar, D., Tan, C.J.K. (eds.) ICCSA 2005. LNCS, vol. 3481, pp. 925–935. Springer, Heidelberg (2005)
Chapter Google Scholar
Lian, W., Wai-lok, D.: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Computer Society Technical Committee on Data Engineering (2004)
Google Scholar
Massay, W.F.: Principal components regression in exploratorystatistical research. J. Amer Statist. Assoc. 60, 234–246 (1965)
Article Google Scholar
Torgerson, W.S.: Theory & Methods of Scaling. Wiley, New York (1958)
Google Scholar
Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)
Article Google Scholar
Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000)
Article Google Scholar
Saul, L.K., Roweis, S.T.: Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research 4, 119–155 (2003)
MathSciNet MATH Google Scholar
Donoho, D.L., Grimes, C.E.: Hessian eigenmaps: locally embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. USA 100, 5591–5596 (2003)
Article MathSciNet MATH Google Scholar
Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373–1396 (2003)
Article MATH Google Scholar
Zhang, Z., Zha, H.: Principal manifolds and nonlinear dimension reduction via tangent space alignment. SIAM Journal of Scientific Computing 26(1), 313–338 (2004)
Article MathSciNet MATH Google Scholar
Kim, H., Park, H., Zha, H.: Distance preserving dimension reduction for manifold learning. In: Proceedings of the 2007 SIAM International Conference on Data Mining, SDM 2007 (2007)
Google Scholar
Dalamagas, T., Cheng, T., Winkel, K.J., Sellis, T.: A Methodology for Clustering XML Documents by Structure. Information Systems 31(3), 187–228 (2006)
Article MATH Google Scholar
Gao, J., Zhang, J.: Clustered SVD strategies in latent semantic indexing. Inf. Process. Manag. 41(5), 1051–1063 (2005)
Article MATH Google Scholar
Berry, M.W., Shakhina, A.P.: Computing sparse reduced-rank approximation to sparse matrices. ACM Trans. Math. Software 31(2), 252–269 (2005)
Article MathSciNet MATH Google Scholar
Tou, J.T., Gonzalez, R.C.: Pattern Recognition Principles. Addison-Wesley, Reading (1974)
MATH Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)
MATH Google Scholar
Frigui, H., Krishnapuram, R.: A Robust Competitive Clustering Algorithm with Application in Computer Vision. IEEE Trans. Pattern Analysis and Machine Intelligence 21(1), 450–465 (1999)
Article Google Scholar
Everitt, B.S.: Cluster Analysis, 3rd edn. Halsted Press (1993)
Google Scholar
Maulik, U., Bandyopadhyay, S.: Genetic Algorithm Based Clustering Technique. Pattern Recognition 33, 1455–1465 (2000)
Article Google Scholar
Ye, Y.Q.: Comparing matrix methods in text-based information retrieval. — Tech. Rep., School of Mathematical Sciences, Peking University (2000)
Google Scholar
ACM SIGMOD Record home page, http://www.acm.org/sigmod/record/xml
http://www.cs.wisc.edu/niagara/data/

Download references

Author information

Authors and Affiliations

Institute of Engineering Science and Technology, National Kaohsiung First University of Science and Technology, Kaohsiung, Taiwan
Hsu-Kuang Chang & I-Chang Jou
Department of Information Engineering, I-Shou University, Kaohsiung, Taiwan
Hsu-Kuang Chang

Authors

Hsu-Kuang Chang
View author publications
You can also search for this author in PubMed Google Scholar
I-Chang Jou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IBM Zurich Research Laboratory, Saeumerstr. 4, 8803, Rueschlikon, Switzerland
Peter Mueller
Department of Computing, Hung Hom, Hong Kong Polytechnic University, Kowloon, Hong Kong
Jian-Nong Cao
Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong
Cho-Li Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chang, HK., Jou, IC. (2009). Distance Dimension Reduction on QR Factorization for Efficient Clustering Semantic XML Document Using the QR Fuzzy C-Mean (QR-FCM). In: Mueller, P., Cao, JN., Wang, CL. (eds) Scalable Information Systems. INFOSCALE 2009. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 18. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10485-5_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-10485-5_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10484-8
Online ISBN: 978-3-642-10485-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics