ABSTRACT
Embedded markup of Web pages has seen widespread adoption throughout the past years driven by standards such as RDFa and Microdata and initiatives such as schema.org, where recent studies show an adoption by 39% of all Web pages already in 2016. While this constitutes an important information source for tasks such as Web search, Web page classification or knowledge graph augmentation, individual markup nodes are usually sparsely described and often lack essential information. For instance, from 26 million nodes describing events within the Common Crawl in 2016, 59% of nodes provide less than six statements and only 257,000 nodes (0.96%) are typed with more specific event subtypes. Nevertheless, given the scale and diversity of Web markup data, nodes that provide missing information can be obtained from the Web in large quantities, in particular for categorical properties. Such data constitutes potential training data for inferring missing information to significantly augment sparsely described nodes. In this work, we introduce a supervised approach for inferring missing categorical properties in Web markup. Our experiments, conducted on properties of events and movies, show a performance of 79% and 83% F1 score correspondingly, significantly outperforming existing baselines.
- Mohamed Ben Ellefi, Zohra Bellahsene, Breslin John, Elena Demidova, Stefan Dietze, Julian Szymanski, and Konstantin Todorov. 2017. RDF Dataset Profiling - a Survey of Features, Methods, Vocabularies and Applications. Semantic Web Journal (2017). to appear.Google Scholar
- James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-parameter Optimization. J. Mach. Learn. Res. Vol. 13 (Feb. 2012), 281--305. on Besnik Fetahu, Ujwal Gadiraju, and Stefan Dietze. 2016. A Survey on Challenges in Web Markup Data for Entity Retrieval Proceedings of the ISWC 2016 Posters & Demonstrations Track. Google ScholarDigital Library
- Ran Yu, Ujwal Gadiraju, Besnik Fetahu, and Stefan Dietze. 2017. FuseM: Query-Centric Data Fusion on Structured Web Markup Proceedings of the IEEE 33rd International Conference on Data Engineering (ICDE), 2017. IEEE Computer Society, 179--182.Google Scholar
- Ran Yu, Ujwal Gadiraju, Besnik Fetahu, Oliver Lehmberg, Dominique Ritze, and Stefan Dietze. 2017. KnowMore - Knowledge Base Augmentation with Structured Web Markup. Semantic Web Journal, IOS Press (2017).Google Scholar
- Ran Yu, Ujwal Gadiraju, Xiaofei Zhu, Besnik Fetahu, and Stefan Dietze. 2016. Towards Entity Summarisation on Structured Web Markup. The Semantic Web: ESWC 2016 Satellite Events, (June. 2016).Google Scholar
Index Terms
- Inferring Missing Categorical Information in Noisy and Sparse Web Markup
Recommendations
Analysing and Improving Embedded Markup of Learning Resources on the Web
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web CompanionWeb-scale reuse and interoperability of learning resources have been major concerns for the technology-enhanced learning community. While work in this area traditionally focused on learning resource metadata, provided through learning resource ...
KnowMore – knowledge base augmentation with structured web markup
Machine Learning for Knowledge Base Generation and PopulationKnowledge bases are in widespread use for aiding tasks such as information extraction and information retrieval, for example in Web search. However, knowledge bases are known to be inherently incomplete, where in particular tail entities and properties ...
Markup: XML & related technologies
The columbia guide to digital publishingMarkup enables the various parts and features of a given set of content to be distinguished and named. It provides a way to label, describe, and delimit these in a publication so that processing systems can tell them apart and know how they relate to ...
Comments