Extracting Structural Features Among Words from Document Data Streams

Ishida, Kumiko; Uchida, Tomoyuki; Kawamoto, Kayo

doi:10.1007/11941439_37

Kumiko Ishida²⁰,
Tomoyuki Uchida²¹ &
Kayo Kawamoto²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4304))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

3439 Accesses
1 Citations

Abstract

We consider the online data mining problem of continuously extracting all structural features among words from an infinite sequence of tree structured documents. In order to represent structural features among words appearing in tree structured documents, firstly, we introduce a consecutive path pattern (CPP, for short) on a list of words. A CPP is a sequence of consecutive paths from leaves to leaves. Then, we give a matching function over CPPs with respect to the recent frequency of a CPP, the recency of a CPP and the viewing time of tree structured document in which a CPP appears. Secondly, we present an online algorithm based on a sliding window strategy for extracting continuously all maximal CPPs as characteristic structural features from an infinite sequence of tree structured documents. Finally, by reporting experimental results on our algorithm, we show the good performance of our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Asai, T., Abe, K., Kawasoe, S., Arimura, H., Sakamoto, H., Arikawa, S.: Efficient substructure discovery from large semi-structured data. In: Proc. 2nd SIAM Int. Conf. Data Mining (SDM 2002), pp. 158–174 (2002)
Google Scholar
Asai, T., Arimura, H., Abe, K., Kawasoe, S., Arikawa, S.: Online algorithms for mining semi-structured data stream. In: Proc. IEEE International Conference on Data Mining (ICDM 2002), pp. 27–34 (2002)
Google Scholar
Badica, A., Badica, C., Popescu, E.: Implementing logic wrappers using XSLT stylesheets. In: International Multi-Conference on Computing in the Global Information Technology, ICCGI 2006 (to appear, 2006)
Google Scholar
Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: XSearch: A semantic search engine for XML. In: Proc. 29th VLDB Conference (2003)
Google Scholar
Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.S.: Next Generation Data Mining. In: Mining Frequent Patterns in Data Streams at Multiple Time Granularities, pp. 191–212. AAAI/MIT (2003)
Google Scholar
Gonnet, G., Baeza-Yates, R.: Handbook of Algorithms and Data Structures. Addison-Wesley, Reading (1991)
Google Scholar
Guo, L., Shao, F., Batev, C., Shanmugasundaram, J.: XRANK: ranking keyword search over XML documents. In: ACM SIGMOD, pp. 16–27 (2003)
Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Lewis, D.: Reuters-21578 text categorization test collection. UCI KDD Archive (1997), http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Miyahara, T., Suzuki, Y., Shoudai, T., Uchida, T., Takahashi, K., Ueda, H.: Discovery of frequent tag tree patterns in semistructured web documents. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 341–355. Springer, Heidelberg (2002)
Chapter Google Scholar
Rastogi, R.: Single-path algorithms for querying and mining data streams. In: Proc. HDM 2002, pp. 43–48 (2002)
Google Scholar
Uchida, T., Mogawa, T., Nakamura, Y.: Finding frequent structural features among words in tree-structured documents. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 351–360. Springer, Heidelberg (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Depart. of Computer and Media Tech., Hiroshima City University, Japan
Kumiko Ishida
Faculty of Information Sciences, Hiroshima City University, Japan
Tomoyuki Uchida & Kayo Kawamoto

Authors

Kumiko Ishida
View author publications
You can also search for this author in PubMed Google Scholar
Tomoyuki Uchida
View author publications
You can also search for this author in PubMed Google Scholar
Kayo Kawamoto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

DisPRR, National ICT Australia Ltd, QLD, Australia
Abdul Sattar
School of Computing, University of Tasmania, Sandy Bay, 7005, Tasmania, Australia
Byeong-ho Kang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ishida, K., Uchida, T., Kawamoto, K. (2006). Extracting Structural Features Among Words from Document Data Streams. In: Sattar, A., Kang, Bh. (eds) AI 2006: Advances in Artificial Intelligence. AI 2006. Lecture Notes in Computer Science(), vol 4304. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11941439_37

Download citation

DOI: https://doi.org/10.1007/11941439_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49787-5
Online ISBN: 978-3-540-49788-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics