Abstract
We consider the online data mining problem of continuously extracting all structural features among words from an infinite sequence of tree structured documents. In order to represent structural features among words appearing in tree structured documents, firstly, we introduce a consecutive path pattern (CPP, for short) on a list of words. A CPP is a sequence of consecutive paths from leaves to leaves. Then, we give a matching function over CPPs with respect to the recent frequency of a CPP, the recency of a CPP and the viewing time of tree structured document in which a CPP appears. Secondly, we present an online algorithm based on a sliding window strategy for extracting continuously all maximal CPPs as characteristic structural features from an infinite sequence of tree structured documents. Finally, by reporting experimental results on our algorithm, we show the good performance of our algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Asai, T., Abe, K., Kawasoe, S., Arimura, H., Sakamoto, H., Arikawa, S.: Efficient substructure discovery from large semi-structured data. In: Proc. 2nd SIAM Int. Conf. Data Mining (SDM 2002), pp. 158–174 (2002)
Asai, T., Arimura, H., Abe, K., Kawasoe, S., Arikawa, S.: Online algorithms for mining semi-structured data stream. In: Proc. IEEE International Conference on Data Mining (ICDM 2002), pp. 27–34 (2002)
Badica, A., Badica, C., Popescu, E.: Implementing logic wrappers using XSLT stylesheets. In: International Multi-Conference on Computing in the Global Information Technology, ICCGI 2006 (to appear, 2006)
Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: XSearch: A semantic search engine for XML. In: Proc. 29th VLDB Conference (2003)
Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.S.: Next Generation Data Mining. In: Mining Frequent Patterns in Data Streams at Multiple Time Granularities, pp. 191–212. AAAI/MIT (2003)
Gonnet, G., Baeza-Yates, R.: Handbook of Algorithms and Data Structures. Addison-Wesley, Reading (1991)
Guo, L., Shao, F., Batev, C., Shanmugasundaram, J.: XRANK: ranking keyword search over XML documents. In: ACM SIGMOD, pp. 16–27 (2003)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)
Lewis, D.: Reuters-21578 text categorization test collection. UCI KDD Archive (1997), http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Miyahara, T., Suzuki, Y., Shoudai, T., Uchida, T., Takahashi, K., Ueda, H.: Discovery of frequent tag tree patterns in semistructured web documents. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 341–355. Springer, Heidelberg (2002)
Rastogi, R.: Single-path algorithms for querying and mining data streams. In: Proc. HDM 2002, pp. 43–48 (2002)
Uchida, T., Mogawa, T., Nakamura, Y.: Finding frequent structural features among words in tree-structured documents. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 351–360. Springer, Heidelberg (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ishida, K., Uchida, T., Kawamoto, K. (2006). Extracting Structural Features Among Words from Document Data Streams. In: Sattar, A., Kang, Bh. (eds) AI 2006: Advances in Artificial Intelligence. AI 2006. Lecture Notes in Computer Science(), vol 4304. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11941439_37
Download citation
DOI: https://doi.org/10.1007/11941439_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49787-5
Online ISBN: 978-3-540-49788-2
eBook Packages: Computer ScienceComputer Science (R0)