一個以賽局理論為基礎的網頁主題區塊擷取演算法

隨著資訊與網路科技的快速蓬勃發展，網際網路已成為目前最龐大的資料體，由於科技的進步以及使用者人數爆增，每天有數以萬計的網頁產生。而網際網路也成為使用者最大的資訊來源。資訊爆炸的現代，要在這麼龐大的資料當中找尋特定主題的相關資料，變成是一件相當重要的研究課題。因此，本論文提出一個以賽局理論為基礎的網頁主題區塊擷取演算法(a Game-theoRy-based Algorithm for extracting theme-Block from a web page, GRAB)，能夠自動地將使用者有興趣的主題區塊自動地辨識出來，並轉換成易於儲存、檢索與分析的結構化資料，提供不同平台(e.g.手機、PDA)應用的方便性。本論文針對提出的GRAB演算法設計一雛型系統，並設計了兩個實驗，以實際網頁資料測試驗證演算法的效能。實驗一的結果證明GRAB演算法在10種主題的HTML網頁具有70%到90%的效能，對於新聞網頁的效果特別好。實驗二從實驗一挑選效能最好的3種主題做為資料集，並與現有三種網頁區塊擷取法比較。實驗結果證明GRAB的整體效能優於三種現有方法，其中處理結構化網頁的效能雖與現有方法差不多，但對於新聞網頁的效能仍優於現有方法；在處理非結構化網頁則具有87%到95%的效能，都優於現有的三種網頁區塊擷取方法。

關鍵字

網頁擷取；資料擷取；主題區塊；賽局理論

並列摘要

Because of the rapid development of Information and network technology, Internet has become the largest body of information. The progress of technology and the explosion in the number of users, so there are produced tens of thousands pages every day. Internet has become the largest source of information. In the information explosion times, searching specific topic data in such a huge data is becoming an important topic. For this reason, this paper presents a Game-theoRy-based Algorithm for extracting theme-Block from a web page(GRAB). It can automatically identify user interested topic blocks, and then converted to easy storage, retrieval and analysis of structured data for providing different platforms (eg. mobile phone, PDA) to facilitate the application. This paper provides a prototype system based on GRAB algorithm. I design two experiments, and verify the effectiveness of algorithms through the actual web page data. The first experiment results prove that GRAB algorithm in ten topics of HTML web page can achieve 70% to 90% accuracy rate, especially in news web page. The second experiment selects the best three topics from the first experiment results for data sets, and compare with three existing web blocks retrieving methods. The results prove GRAB overall performance is better than three existing methods. Although processing structured web performance is similar to existing methods, but news web performance is better than existing methods. Processing unstructured web performance is can achieve 87% to 95% accuracy rate, and all is better than three existing web blocks retrieving methods

並列關鍵字

web extraction ； information retrieval ； Theme-based block ； game theory

參考文獻

[6] C. H. Chang and S. C. Lui. IEPAD： Information Extraction Based on Pattern Discovery. In Proceedings of the 10th international conference on World Wide Web, Page：681-688, 2001.

[3] B. Liu, and Y. Zhai. Web Data Extraction Based on Partial Tree Alignment. In the Proceedings of the 14th international conference on World Wide Web, Page 76-85, 2005.

[7] C. N. Hsu, and C. C. Chang. Finite-state Transducers for Semi-Structured Text Mining. In Proceedings of IJCAI-99 Workshop on Text Mining： Foundations, Techniques and Application, Page 38-49, 1999.

[4] B. Liu, and Y. Zhai. Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Transactions on knowledge and data engineering, vol.18, no.12.

[5] C. A. George, "Usability testing and design of a library website：an iterative approach, "OCLC Systems & Services： 21：3 (2005)： 167-180

國際替代計量

一個以賽局理論為基礎的網頁主題區塊擷取演算法

全文下載

主題瀏覽