Skip to main content
Log in

An efficient content extraction method for webpage based on tag-line-block analysis

  • Mathematical methods in data science
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

World Wide Web is a vast information resource that can be used in a broad range of applications. Web content is an efficient way to derive valuable information from webpages, and many efforts have been made on this subject. However, due to the increasing complexity of webpage technology, the existing methods cannot match quite well the requirements for the content extraction of webpages. This paper proposed an improved content extraction method for webpage based on Cx-Extractor, which is capable of dealing with content extraction for different types of webpages. Several improvements have been made for the proposed method: (1) The hyperlink tags are not removed directly to avoid mistaking the dense hyperlink groups for the main content. (2) The starting point of the main content is taken as the line number of tag-line-block whose size exceeds the threshold and thus the first few short texts of the main content can be retained. (3) The threshold value of tag-line-block for the main content is calculated automatically instead of being set manually. The above can improve the accuracy of the extracted content. Moreover, (4) the blank spaces in the original text of webpage are retained, which can increase the readability of the extracted content by avoiding connecting English words into pieces. (5) The multimedia information (e.g., pictures and videos) can be selectively retained by users, allowing for maximum flexibility and usage in multiple industries. The experimental results conducted on real-world webpages show that the proposed content extraction method works well for both single-content and multi-content webpages. Furthermore, the performance of the proposed content extraction method was compared with the Chinese extraction method called Cx-Extractor and the English extraction method called Readability. It is found that the proposed method in this study outperforms these two methods in precision, recall, and readability. In addition, the extraction efficiency of the proposed method is superior to that of the Readability method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

Enquiries about data availability should be directed to the authors.

References

  • Baroni M, Chantree F, Kilgarriff A et al (2008) Cleaneval: a competition for cleaning web pages. In: Proceedings of the 6th international conference on language resources and evaluation, pp 638–643

  • Cai D, Yu S, Wen J R, et al (2003) Extracting content structure for web pages based on visual representation. In: Proceedings of the 5th Asia-pacific web conference on web technologies and applications, pp 406–417

  • Cardoso E, Jabour I, Laber E, et al (2011) An efficient language-independent method to extract content from news webpages. In: Proceedings of the 11th ACM symposium on document engineering, pp 121–128

  • Chen X (2011) Universal web content extraction based on row block distribution function. https://code.google.com/p/cx-extractor

  • Crescenzi V, Mecca G, Merialdo P (2001) Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th international conference on very large data bases, vol. 1, pp 109–118

  • Ferrara E, De Meo P, Fiumara G et al (2014) Web data extraction, applications and techniques: a survey. Knowl-Based Syst 70:301–323

    Article  Google Scholar 

  • Gan L, Ye B, Huang Z et al (2023) Knowledge graph construction based on ship collision accident reports to improve maritime traffic safety. Ocean Coast Manag 240:106660

    Article  Google Scholar 

  • Gibson D, Punera K, Tomkins A (2005) The volume and evolution of web page templates. In: Special interest tracks and posters of the 14th international conference on World Wide Web, pp 830–839

  • Gottron T (2008) Combining content extraction heuristics: the CombinE system. In: Proceedings of the 10th international conference on information integration and web-based applications and services, pp 591–595

  • Gu Y, Gao Y, Gao B et al (2014) Research on deep web information extraction based on template and domain ontology. Comput Eng Des 35:327–332

    Google Scholar 

  • Gupta S, Kaiser G, Neistadt D et al (2003) DOM-based content extraction of html documents. In: Proceedings of the 12th international conference on World Wide Web, pp 207–214

  • Hammer J, McHugh J, Garcia-Molina H (1997) Semistructured data: the TSIMMIS experience. In: Proceedings of the 1th East-European symposium on advances in databases and information systems, vol. 1, pp 1–13

  • IDC, Statista (2022) Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025 (in zettabytes). https://www.statista.com/statistics/871513/worldwide-data-created/

  • Joe Dhanith PR, Surendiran B (2022) An ontology learning based approach for focused web crawling using combined normalized pointwise mutual information and Resnik algorithm. Int J Comput Appl 44(12):1123–1129

    Google Scholar 

  • Karthikeyan T, Sekaran K, Ranjith D et al (2019) Personalized content extraction and text classification using effective web scraping techniques. Int J Web Port 11(2):41–52

    Article  Google Scholar 

  • Laber ES, de Souza CP, Jabour IV et al (2009) A fast and simple method for extracting relevant content from news webpages. In: Proceedings of the 18th ACM conference on information and knowledge management, pp 1685–1688

  • Liang D, Yang Y, Wei Z (2018) Information extraction of web pages based on support vector machine. Comput Mod 9:21–26

    Google Scholar 

  • Liu L, Pu C, Han W (2000) XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th international conference on data engineering, pp 611–621

  • Rahman A, Alam H, Hartono R (2001) Content extraction from html documents. In: Proceedings of the 1st international workshop on web document analysis, pp 1–4

  • Ramakrishna M, Gowdar L, Havanur MS et al (2010) Web mining: key accomplishments, applications and future directions. In: Proceedings of the 2010 international conference on data storage and data engineering, pp 187–191

  • Samuel MO, Tolulope AI, Oyejoke OO (2019) A systematic review of current trends in web content mining. In: Proceedings of the 3th international conference on science and sustainable development, vol. 1299, p 012040

  • Sandeep KS, Patil N (2018) A multidimensional approach to blog mining. progress in intelligent computing techniques: theory, practice, and applications. Adv Intell Syst Comput 719:51–58

    Article  Google Scholar 

  • Sestito S, Dillon T (1993) Knowledge acquisition of conjunctive rules using multilayered neural networks. Int J Intell Syst 8(7):779–805

    Article  Google Scholar 

  • Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 245–254

  • Sun C, Guan Y (2004) A statistical approach for content extraction from web page. J Chin Inf Process 18(5):17–22

    Google Scholar 

  • Tan Z, He C, Fang Y et al (2018) Title-based extraction of news contents for text mining. IEEE Access 6:64085–64095

    Article  Google Scholar 

  • Waldherr A, Maier D, Miltner P et al (2017) Big data, big noise: the challenge of finding issue networks on the web. Soc Sci Comput Rev 35(4):427–443

    Article  Google Scholar 

  • Wang Q, Fang Y, Ravula A, et al (2022) Webformer: the web-page transformer for structure information extraction. In: Proceedings of the 2022 ACM web conference, pp 3124–3133

  • Weninger T, Hsu WH, Han J (2010) CETR: content extraction via tag ratios. In: Proceedings of the 19th international conference on World Wide Web, pp 971–980

  • Wu Y (2016) Language independent web news extraction system based on text detection framework. Inf Sci 342:132–149

    Article  MathSciNet  Google Scholar 

  • Yu M, Chen T, Xu H (2005) Research and design of HTML parser based on page segmentation. J Comput Appl 25(4):974–976

    Google Scholar 

  • Yunis H, Stein B, Kiesel J et al (2016) Content extraction from webpages using machine learning. Bauhaus-Universitaet Weimar

    Google Scholar 

  • Zhang H, Li L, Hu W et al (2019) Visualization of location-referenced web textual information based on map mashups. IEEE Access 7:40475–40487

    Article  Google Scholar 

  • Zhang Z, Yu B, Liu T, et al. (2023) Learning structural co-occurrences for structured web data extraction in low-resource settings. In: Proceedings of the 2023 ACM web conference, pp 1683–1692

Download references

Funding

This research was funded by National Key Research and Development Program of China, Grant Number 2021YFD1300101.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruizhi Sun.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Z., Zhou, J. & Sun, R. An efficient content extraction method for webpage based on tag-line-block analysis. Soft Comput 27, 14631–14645 (2023). https://doi.org/10.1007/s00500-023-09076-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-023-09076-x

Keywords

Navigation