Abstract
Locating specific chunks (records) of information within documents on the web is an interesting and nontrivial problem. If the problem of locating and separating records can be solved well, the longstanding problem of grouping extracted values into appropriate relationships in a record structure can be more easily resolved. Our solution is a hybrid of two well established techniques: (1) ontology-based extraction [ECJ + 99] and (2) vector space modeling [SM83]. To show that the technique has merit, we apply it to the particularly challenging task of locating and separating records for genealogical web documents, which tend to vary considerably in layout and format. Experiments we have conducted show this technique yields an average of 92% recall and 93% precision for locating and separating genealogical records in web documents.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Buttler, D., Liu, L., Calton, P.: A fully automated object extraction system for the world wide web. In: Proceedings of the 21st International Conference on Distributed Computing Systems (ICDC 2001), Mesa, Arizona (April 2001)
Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.-K., Smith, R.D.: Conceptual-model-based data extraction from multiple-record web pages. Data & Knowledge Engineering 31(3), 227–251 (1999)
Embley, D.W., Jiang, Y.S., Ng, Y.-K.: Record-boundary discovery in web documents. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD 1999), Philadelphia, Pennsylvania, 31 May - 3 June, pp. 467–478 (1999)
Embley, D.W., Kurtz, B.D., Woodfield, S.N.: Object-oriented Systems. In: Analysis: A Model-Driven Approach, Prentice Hall, Englewood Cliffs (1992)
Embley, D.W.: Programming with data frames for everyday data items. In: Proceedings of the 1980 National Computer Conference, Anaheim, California, May 1980, pp. 301–305 (1980)
Embley, D.W., Xu, L.: Record location and reconfiguration in unstructured multiple-record web documents. In: Proceedings of the Third International Workshop on the Web and Databases (WebDB 2000), Dallas, Texas, May 2000, pp. 123–128 (2000)
Kuhlins, S., Tredwell, R.: Toolkits for generating wrappers – a survey of software toolkits for automated data extraction from websites. In: Aksit, M., Mezini, M., Unland, R. (eds.) Objects, Components, Architectures, Services, and Applications for a Networked World – Proceedings of the 2002 International NetObjectDays Conference, Erfurt, Germany, October 2002, pp. 184–198 (2002)
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Walker, T., Embley, D.W. (2004). Automatic Location and Separation of Records: A Case Study in the Genealogical Domain. In: Wang, S., et al. Conceptual Modeling for Advanced Application Domains. ER 2004. Lecture Notes in Computer Science, vol 3289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30466-1_28
Download citation
DOI: https://doi.org/10.1007/978-3-540-30466-1_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23722-8
Online ISBN: 978-3-540-30466-1
eBook Packages: Springer Book Archive