From tables to frames
Introduction
Turning the current Web into a Semantic Web requires automatic approaches for annotation of existing data since manual annotation approaches such as presented in [9] will not scale in general. More scalable (semi-)automatic approaches known from ontology learning (cf. [16]) deal with extraction of ontologies from natural language texts. However, a large amount of data is stored in tables which require additional efforts.
We here present an approach for automatic generation of F-Logic frames [14] out of tables which subsequently supports the automatic population of ontologies from table-like structures. Even successful search engines on the Web currently do not make the content of tables searchable to users. Applying our approach e.g. allows for querying over a heterogeneous set of table-like structures.
Our approach consists of a methodology, an accompanying implementation and a thorough evaluation. It is based on a grounded cognitive table model which is stepwise instantiated by our methodology. In practice it is hard to cover every existing type of a table. We identified a couple of most relevant table types which were used in the experimental setting during the evaluation of our approach.
In this paper we use HTML tables as examples. We would like to point out that the methodology is in general independent of the incoming document type (i.e. text, pdf, excel) and can be applied to any table equivalent structure. The implementation is almost as generic as the methodology. To apply it for other formats than HTML one would only need to adapt the implementation for the first methodological step (cf. Fig. 2).
The paper is structured as follows. In the next Section 2 we first introduce the grounding table model which forms the base for our stepwise approach to generate frames out of tables. Subsequently we explain each step in detail and show relevant substeps. In Section 3 we present a thorough evaluation of the accompanying implementation. Before concluding and giving future directions, we present related work.
Section snippets
Methodological approach
Linguistic models traditionally describe natural language in terms of syntax, semantics and pragmatics. There also exist models to describe tables in similar ways (cf. [11], [12]) where tables are analyzed along the following dimensions: (i) graphical—the image level description of the pixels, lines and text or other content areas, (ii) physical—the description of inter-cell relative location, (iii) structural—the organization of cells as an indicator of their navigational relationship, (iv)
Evaluation
In order to evaluate our approach, we compare the automatically generated frames with frames manually created by two different subjects in terms of Precision, Recall and F-Measure. In particular, we considered 21 different tables in our experiment and asked 14 subjects to manually create a frame for three different tables such that each table in our dataset was annotated by two different subjects with the appropriate frame (). In what follows we first describe the dataset used in
Related work
A very recent systematic overview of related work on table recognition, transformation, and inferences can be found in [32]. Several conclusions can be drawn from this survey. Firstly, only few table models have been described explicitly. Apart from the table model of Hurst which we applied in our approach [11], [12] the most prominent other model is Wang’s [25]. However, the model of Hurst is better suited for our purpose since it is targeted towards table recognition and transformation
Conclusion
We have presented an approach which stepwise instantiates a formal table model consisting of Physical, Structural, Functional and Semantic components. The core steps of the methodology are (i) Cleaning and Normalization, (ii) Structure Detection, (iii) Building of the Functional Table Model (FTM) and (iv) Semantic Enriching of the FTM. We have further demonstrated and evaluated the successful automatic generation of frames from HTML tables. Additionally, our experimental results show that from
Acknowledgments
This work has been supported by the IST-projects Dot.Kom (Designing adaptive infOrmation exTraction from text for KnOwledge Management, IST-2001-34038) and SEKT (Semantically Enabled Knowledge Technologies, IST-2004-506826), sponsored by the EC as part of the frameworks V and VI, respectively. During his stay at the AIFB, Aleksander Pivk has been supported by a Marie Curie Fellowship of the European Community program ‘Host Training Sites’ and by the Slovenian Ministry of Education, Science and
References (32)
- et al.
Table structure understanding and its performance evaluation
Pattern Recogn.
(2004) - et al.
Mining tables from large scale HTML texts
A relational model for large shared databanks
Commun. ACM
(1970)- et al.
A comparison of string distance metrics for name-matching tasks
- et al.
A flexible learning system for wrapping tables and lists in html documents
- et al.
Ontobroker: ontology based access to distributed and semi-structured information
- et al.
Layout and language: list and tables in technical documents
- et al.
Automatically extracting ontologically specified data from html tables with unknown structure
WordNet, An Electronic Lexical Database
(1998)
Evaluating the performance of table processing algorithms
Int. J. Document Anal. Recogn.
Layout and language: beyond simple text for information interaction—modelling the table
Layout and language: challenges for table understanding on the web
Logical foundations of object-oriented and frame-based languages
J. ACM
Wrapper maintenance: a machine learning approach
J. Artif. Intell. Res.
Cited by (19)
TEXUS: A unified framework for extracting and understanding tables in PDF documents
2019, Information Processing and ManagementCitation Excerpt :For example, Vericlick (Nagy & Tamhankar, 2012) requires human interaction to produce the system output. The approaches like Pivk, Cimiano, and Sure (2005) and Liu, Chen, Zhang, and Wang (2010) relied on semantic analysis or NER techniques to improve the table extraction performance and consequently are more domain-dependent. Recently, there has been work showing successful applications of machine learning algorithms for the table detection or classification problem.
Extracting logical structures from HTML tables
2008, Computer Standards and InterfacesCitation Excerpt :The former determines whether the TABLE tags are used to represent genuine tables or not [20]. While, the latter analyzes the logical structure of a table and extracts its attribute-value relationships [1–19]. In this paper, we focus on structure recognition.
Transforming a nonstandard table into formalized tables
2017, Proceedings - 2017 14th Web Information Systems and Applications Conference, WISA 2017A Method for Materials Knowledge Extraction from HTML Tables Based on Sibling Comparison
2016, International Journal of Software Engineering and Knowledge EngineeringExtracting knowledge from web tables based on DOM tree similarity
2016, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Rule-based canonicalization of arbitrary tables in spreadsheets
2016, Communications in Computer and Information Science