Skip to main content

Automatic Location and Separation of Records: A Case Study in the Genealogical Domain

  • Conference paper
  • 671 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3289))

Abstract

Locating specific chunks (records) of information within documents on the web is an interesting and nontrivial problem. If the problem of locating and separating records can be solved well, the longstanding problem of grouping extracted values into appropriate relationships in a record structure can be more easily resolved. Our solution is a hybrid of two well established techniques: (1) ontology-based extraction [ECJ + 99] and (2) vector space modeling [SM83]. To show that the technique has merit, we apply it to the particularly challenging task of locating and separating records for genealogical web documents, which tend to vary considerably in layout and format. Experiments we have conducted show this technique yields an average of 92% recall and 93% precision for locating and separating genealogical records in web documents.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Buttler, D., Liu, L., Calton, P.: A fully automated object extraction system for the world wide web. In: Proceedings of the 21st International Conference on Distributed Computing Systems (ICDC 2001), Mesa, Arizona (April 2001)

    Google Scholar 

  2. Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.-K., Smith, R.D.: Conceptual-model-based data extraction from multiple-record web pages. Data & Knowledge Engineering 31(3), 227–251 (1999)

    Article  MATH  Google Scholar 

  3. Embley, D.W., Jiang, Y.S., Ng, Y.-K.: Record-boundary discovery in web documents. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD 1999), Philadelphia, Pennsylvania, 31 May - 3 June, pp. 467–478 (1999)

    Google Scholar 

  4. Embley, D.W., Kurtz, B.D., Woodfield, S.N.: Object-oriented Systems. In: Analysis: A Model-Driven Approach, Prentice Hall, Englewood Cliffs (1992)

    Google Scholar 

  5. Embley, D.W.: Programming with data frames for everyday data items. In: Proceedings of the 1980 National Computer Conference, Anaheim, California, May 1980, pp. 301–305 (1980)

    Google Scholar 

  6. Embley, D.W., Xu, L.: Record location and reconfiguration in unstructured multiple-record web documents. In: Proceedings of the Third International Workshop on the Web and Databases (WebDB 2000), Dallas, Texas, May 2000, pp. 123–128 (2000)

    Google Scholar 

  7. Kuhlins, S., Tredwell, R.: Toolkits for generating wrappers – a survey of software toolkits for automated data extraction from websites. In: Aksit, M., Mezini, M., Unland, R. (eds.) Objects, Components, Architectures, Services, and Applications for a Networked World – Proceedings of the 2002 International NetObjectDays Conference, Erfurt, Germany, October 2002, pp. 184–198 (2002)

    Google Scholar 

  8. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)

    Article  Google Scholar 

  9. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Walker, T., Embley, D.W. (2004). Automatic Location and Separation of Records: A Case Study in the Genealogical Domain. In: Wang, S., et al. Conceptual Modeling for Advanced Application Domains. ER 2004. Lecture Notes in Computer Science, vol 3289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30466-1_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30466-1_28

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23722-8

  • Online ISBN: 978-3-540-30466-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics