Inferring Missing Categorical Information in Noisy and Sparse Web Markup

Authors:
Nicolas Tempelmeier

Leibniz Universität, Hannover, Germany

Leibniz Universität, Hannover, Germany
View Profile

,
Elena Demidova

Leibniz Universität, Hannover, Germany

Leibniz Universität, Hannover, Germany
View Profile

,
Stefan Dietze

Leibniz Universität, Hannover, Germany

Leibniz Universität, Hannover, Germany
View Profile

WWW '18: Proceedings of the 2018 World Wide Web ConferenceApril 2018Pages 1297–1306https://doi.org/10.1145/3178876.3186028

Published:10 April 2018Publication History

WWW '18: Proceedings of the 2018 World Wide Web Conference

Pages 1297–1306

ABSTRACT

Embedded markup of Web pages has seen widespread adoption throughout the past years driven by standards such as RDFa and Microdata and initiatives such as schema.org, where recent studies show an adoption by 39% of all Web pages already in 2016. While this constitutes an important information source for tasks such as Web search, Web page classification or knowledge graph augmentation, individual markup nodes are usually sparsely described and often lack essential information. For instance, from 26 million nodes describing events within the Common Crawl in 2016, 59% of nodes provide less than six statements and only 257,000 nodes (0.96%) are typed with more specific event subtypes. Nevertheless, given the scale and diversity of Web markup data, nodes that provide missing information can be obtained from the Web in large quantities, in particular for categorical properties. Such data constitutes potential training data for inferring missing information to significantly augment sparsely described nodes. In this work, we introduce a supervised approach for inferring missing categorical properties in Web markup. Our experiments, conducted on properties of events and movies, show a performance of 79% and 83% F1 score correspondingly, significantly outperforming existing baselines.

References

Mohamed Ben Ellefi, Zohra Bellahsene, Breslin John, Elena Demidova, Stefan Dietze, Julian Szymanski, and Konstantin Todorov. 2017. RDF Dataset Profiling - a Survey of Features, Methods, Vocabularies and Applications. Semantic Web Journal (2017). to appear.Google Scholar
James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-parameter Optimization. J. Mach. Learn. Res. Vol. 13 (Feb. 2012), 281--305. on Besnik Fetahu, Ujwal Gadiraju, and Stefan Dietze. 2016. A Survey on Challenges in Web Markup Data for Entity Retrieval Proceedings of the ISWC 2016 Posters & Demonstrations Track. Google ScholarDigital Library
Ran Yu, Ujwal Gadiraju, Besnik Fetahu, and Stefan Dietze. 2017. FuseM: Query-Centric Data Fusion on Structured Web Markup Proceedings of the IEEE 33rd International Conference on Data Engineering (ICDE), 2017. IEEE Computer Society, 179--182.Google Scholar
Ran Yu, Ujwal Gadiraju, Besnik Fetahu, Oliver Lehmberg, Dominique Ritze, and Stefan Dietze. 2017. KnowMore - Knowledge Base Augmentation with Structured Web Markup. Semantic Web Journal, IOS Press (2017).Google Scholar
Ran Yu, Ujwal Gadiraju, Xiaofei Zhu, Besnik Fetahu, and Stefan Dietze. 2016. Towards Entity Summarisation on Structured Web Markup. The Semantic Web: ESWC 2016 Satellite Events, (June. 2016).Google Scholar

Index Terms

Recommendations

Analysing and Improving Embedded Markup of Learning Resources on the Web
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

Web-scale reuse and interoperability of learning resources have been major concerns for the technology-enhanced learning community. While work in this area traditionally focused on learning resource metadata, provided through learning resource ...
Read More
KnowMore – knowledge base augmentation with structured web markup
Machine Learning for Knowledge Base Generation and Population

Knowledge bases are in widespread use for aiding tasks such as information extraction and information retrieval, for example in Web search. However, knowledge bases are known to be inherently incomplete, where in particular tail entities and properties ...
Read More
Markup: XML & related technologies
The columbia guide to digital publishing

Markup enables the various parts and features of a given set of content to be distinguished and named. It provides a way to label, describe, and delimit these in a publication so that processing systems can tell them apart and know how they relate to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '18: Proceedings of the 2018 World Wide Web Conference
April 2018
2000 pages
ISBN:9781450356398
General Chairs:
Pierre-Antoine Champin
Universitè Claude Bernard Lyon 1, France
,
Fabien Gandon
Inria, Université Côte d'Azur, CNRS, I3S, France
,
Lionel Médini
Université Claude Bernard Lyon 1, France
,
Program Chairs:
Mounia Lalmas
Spotify, UK
,
Panagiotis G. Ipeirotis
New York University, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
International World Wide Web Conferences Steering Committee
Republic and Canton of Geneva, Switzerland
Publication History
- Published: 10 April 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information inferring
supervised learning
web markup
Qualifiers
- research-article
Conference

Acceptance Rates
WWW '18 Paper Acceptance Rate170of1,155submissions,15%Overall Acceptance Rate1,899of8,196submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 503
  Total Downloads
- Downloads (Last 12 months)59
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Inferring Missing Categorical Information in Noisy and Sparse Web Markup

WWW '18: Proceedings of the 2018 World Wide Web Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Analysing and Improving Embedded Markup of Learning Resources on the Web

KnowMore – knowledge base augmentation with structured web markup

Markup: XML & related technologies