skip to main content
10.1145/3178876.3186028acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article
Free Access

Inferring Missing Categorical Information in Noisy and Sparse Web Markup

Published:10 April 2018Publication History

ABSTRACT

Embedded markup of Web pages has seen widespread adoption throughout the past years driven by standards such as RDFa and Microdata and initiatives such as schema.org, where recent studies show an adoption by 39% of all Web pages already in 2016. While this constitutes an important information source for tasks such as Web search, Web page classification or knowledge graph augmentation, individual markup nodes are usually sparsely described and often lack essential information. For instance, from 26 million nodes describing events within the Common Crawl in 2016, 59% of nodes provide less than six statements and only 257,000 nodes (0.96%) are typed with more specific event subtypes. Nevertheless, given the scale and diversity of Web markup data, nodes that provide missing information can be obtained from the Web in large quantities, in particular for categorical properties. Such data constitutes potential training data for inferring missing information to significantly augment sparsely described nodes. In this work, we introduce a supervised approach for inferring missing categorical properties in Web markup. Our experiments, conducted on properties of events and movies, show a performance of 79% and 83% F1 score correspondingly, significantly outperforming existing baselines.

References

  1. Mohamed Ben Ellefi, Zohra Bellahsene, Breslin John, Elena Demidova, Stefan Dietze, Julian Szymanski, and Konstantin Todorov. 2017. RDF Dataset Profiling - a Survey of Features, Methods, Vocabularies and Applications. Semantic Web Journal (2017). to appear.Google ScholarGoogle Scholar
  2. James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-parameter Optimization. J. Mach. Learn. Res. Vol. 13 (Feb. 2012), 281--305. on Besnik Fetahu, Ujwal Gadiraju, and Stefan Dietze. 2016. A Survey on Challenges in Web Markup Data for Entity Retrieval Proceedings of the ISWC 2016 Posters & Demonstrations Track. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ran Yu, Ujwal Gadiraju, Besnik Fetahu, and Stefan Dietze. 2017. FuseM: Query-Centric Data Fusion on Structured Web Markup Proceedings of the IEEE 33rd International Conference on Data Engineering (ICDE), 2017. IEEE Computer Society, 179--182.Google ScholarGoogle Scholar
  4. Ran Yu, Ujwal Gadiraju, Besnik Fetahu, Oliver Lehmberg, Dominique Ritze, and Stefan Dietze. 2017. KnowMore - Knowledge Base Augmentation with Structured Web Markup. Semantic Web Journal, IOS Press (2017).Google ScholarGoogle Scholar
  5. Ran Yu, Ujwal Gadiraju, Xiaofei Zhu, Besnik Fetahu, and Stefan Dietze. 2016. Towards Entity Summarisation on Structured Web Markup. The Semantic Web: ESWC 2016 Satellite Events, (June. 2016).Google ScholarGoogle Scholar

Index Terms

  1. Inferring Missing Categorical Information in Noisy and Sparse Web Markup

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in
                  • Published in

                    cover image ACM Other conferences
                    WWW '18: Proceedings of the 2018 World Wide Web Conference
                    April 2018
                    2000 pages
                    ISBN:9781450356398

                    Copyright © 2018 ACM

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    International World Wide Web Conferences Steering Committee

                    Republic and Canton of Geneva, Switzerland

                    Publication History

                    • Published: 10 April 2018

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Check for updates

                    Qualifiers

                    • research-article

                    Acceptance Rates

                    WWW '18 Paper Acceptance Rate170of1,155submissions,15%Overall Acceptance Rate1,899of8,196submissions,23%

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader

                  HTML Format

                  View this article in HTML Format .

                  View HTML Format