Skip to main content

Distance Dimension Reduction on QR Factorization for Efficient Clustering Semantic XML Document Using the QR Fuzzy C-Mean (QR-FCM)

  • Conference paper
Scalable Information Systems (INFOSCALE 2009)

Abstract

The rapid growth of XML adoption has urged for the need of a proper representation for semi-structured documents, where the document semantic structural information has to be taken into account so as to support more precise document analysis. In order to analyze the information represented in XML documents efficiently, researches on XML document clustering are actively in progress. The key issue is how to devise the similarity measure between XML documents to be used for clustering. Since XML documents have hierarchical structure, it is not appropriate to cluster them by using a general document similarity measure. Dimension reduction plays an important role in handling the massive quantity of high dimensional data such as mass semantic structural documents. In this paper, we introduce distance dimension reduction (DDR) based on the QR factorization (DDR/QR) or the Cholesky factorization (DDR/C). DDR generates lower dimensional representations of the high-dimensional XML document, which can exactly preserve Euclidean distances and cosine similarities between any pair of XML documents in the original dimensional space. After projecting XML documents to the lower dimensional space obtained from DDR, our proposed method QR fuzzy c-mean to execute the document-analysis clustering algorithms (we called the QR-FCM). DDR can substantially reduce the computing time and/or memory requirement of a given document-analysis clustering algorithm, especially when we need to run the document analysis algorithm many times for estimating parameters or searching for a better solution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Pei, J., Han, J., Asi, B.M., Pinto, H.: PrefixSpan: Mining Sequenctial Pattern efficiently by Prefix-Projected Pattern Growth. In: Int. Conf. Data Engineering, ICDE (2001)

    Google Scholar 

  2. Hwang, J.H., Ryu, K.H.: XML A New XML clustering for Structural Retrieval. In: International Conference on Conceptual Modeling (2004)

    Google Scholar 

  3. Hwang, J.H., Ryu, K.h.: Clustering and retrieval of XML documents by structure. In: Gervasi, O., Gavrilova, M.L., Kumar, V., Laganá, A., Lee, H.P., Mun, Y., Taniar, D., Tan, C.J.K. (eds.) ICCSA 2005. LNCS, vol. 3481, pp. 925–935. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  4. Lian, W., Wai-lok, D.: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Computer Society Technical Committee on Data Engineering (2004)

    Google Scholar 

  5. Massay, W.F.: Principal components regression in exploratorystatistical research. J. Amer Statist. Assoc. 60, 234–246 (1965)

    Article  Google Scholar 

  6. Torgerson, W.S.: Theory & Methods of Scaling. Wiley, New York (1958)

    Google Scholar 

  7. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)

    Article  Google Scholar 

  8. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000)

    Article  Google Scholar 

  9. Saul, L.K., Roweis, S.T.: Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research 4, 119–155 (2003)

    MathSciNet  MATH  Google Scholar 

  10. Donoho, D.L., Grimes, C.E.: Hessian eigenmaps: locally embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. USA 100, 5591–5596 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  11. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373–1396 (2003)

    Article  MATH  Google Scholar 

  12. Zhang, Z., Zha, H.: Principal manifolds and nonlinear dimension reduction via tangent space alignment. SIAM Journal of Scientific Computing 26(1), 313–338 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  13. Kim, H., Park, H., Zha, H.: Distance preserving dimension reduction for manifold learning. In: Proceedings of the 2007 SIAM International Conference on Data Mining, SDM 2007 (2007)

    Google Scholar 

  14. Dalamagas, T., Cheng, T., Winkel, K.J., Sellis, T.: A Methodology for Clustering XML Documents by Structure. Information Systems 31(3), 187–228 (2006)

    Article  MATH  Google Scholar 

  15. Gao, J., Zhang, J.: Clustered SVD strategies in latent semantic indexing. Inf. Process. Manag. 41(5), 1051–1063 (2005)

    Article  MATH  Google Scholar 

  16. Berry, M.W., Shakhina, A.P.: Computing sparse reduced-rank approximation to sparse matrices. ACM Trans. Math. Software 31(2), 252–269 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  17. Tou, J.T., Gonzalez, R.C.: Pattern Recognition Principles. Addison-Wesley, Reading (1974)

    MATH  Google Scholar 

  18. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  19. Frigui, H., Krishnapuram, R.: A Robust Competitive Clustering Algorithm with Application in Computer Vision. IEEE Trans. Pattern Analysis and Machine Intelligence 21(1), 450–465 (1999)

    Article  Google Scholar 

  20. Everitt, B.S.: Cluster Analysis, 3rd edn. Halsted Press (1993)

    Google Scholar 

  21. Maulik, U., Bandyopadhyay, S.: Genetic Algorithm Based Clustering Technique. Pattern Recognition 33, 1455–1465 (2000)

    Article  Google Scholar 

  22. Ye, Y.Q.: Comparing matrix methods in text-based information retrieval. — Tech. Rep., School of Mathematical Sciences, Peking University (2000)

    Google Scholar 

  23. ACM SIGMOD Record home page, http://www.acm.org/sigmod/record/xml

  24. http://www.cs.wisc.edu/niagara/data/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 ICST Institute for Computer Science, Social Informatics and Telecommunications Engineering

About this paper

Cite this paper

Chang, HK., Jou, IC. (2009). Distance Dimension Reduction on QR Factorization for Efficient Clustering Semantic XML Document Using the QR Fuzzy C-Mean (QR-FCM). In: Mueller, P., Cao, JN., Wang, CL. (eds) Scalable Information Systems. INFOSCALE 2009. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 18. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10485-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-10485-5_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-10484-8

  • Online ISBN: 978-3-642-10485-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics