Elsevier

Computers & Geosciences

Volume 96, November 2016, Pages 23-34
Computers & Geosciences

Review article
A survey on the geographic scope of textual documents

https://doi.org/10.1016/j.cageo.2016.07.017Get rights and content

Highlights

  • A structured and comprehensive view on contributions to the problem of determining the geographic scope of documents.

  • Review of relevant definitions and concepts.

  • A discussion of the main techniques currently used to address the Geographic Scope. Resolution (GSR) problem and its subproblems.

  • A proposal for a future research agenda in the area.

Abstract

Recognizing references to places in texts is needed in many applications, such as search engines, location-based social media and document classification. In this paper we present a survey of methods and techniques for the recognition and identification of places referenced in texts. We discuss concepts and terminology, and propose a classification of the solutions given in the literature. We introduce a definition of the Geographic Scope Resolution (GSR) problem, dividing it in three steps: geoparsing, reference resolution, and grounding references. Solutions to the first two steps are organized according to the method used, and solutions to the third step are organized according to the type of output produced. We found that it is difficult to compare existing solutions directly to one another, because they often create their own benchmarking data, targeted to their own problem.

Introduction

The demand for geographic data in applications on the Web is increasing. One of the most important resources to support this increased interest is the ability to recognize references to places in Web documents. If documents can be correctly and efficiently linked to places mentioned directly or indirectly in them, it becomes possible to improve and innovate in directions such as geographic indexing and querying, finding relationships based on spatial proximity or containment, and detecting localized trends for events and phenomena mentioned in social media.

A large share of the information available on the Web is geographically specific (Delboni et al., 2007, Vaid et al., 2005, Vasardani et al., 2013). References to geographic locations appear in the form of place names, postal addresses, postcodes, historical dates, demonyms, ethnicity, typical food and others. Many queries include place names and other geographic terms (Delboni et al., 2007, Sanderson and Kohler, 2005, Silva et al., 2006). Therefore, there is demand for mechanisms to search for documents both thematically (for instance, using a set of keywords) and geographically, based on places mentioned or referenced by the text (Zong et al., 2005). Similar techniques and resources can also apply to streaming data, such as Twitter messages or RSS feeds, providing the opportunity to index content in near-real-time, based on references to places.

However, while finding references to places in Web documents, ambiguity and uncertainty occur. Places can share a name with other places (Paris, besides being the capital of France, refers to more than sixty places around the world1). Places are named using common language words (Park, Hope and Independence are American cities) and proper names (Washington, Houston and San Francisco). The first type of ambiguity occurs when a place name references multiple places, and it is called Geo/Geo ambiguity, or referent ambiguity. The latter ambiguity is called Geo/Non-Geo ambiguity (referent class ambiguity), which occurs when both a location and a non-location share the same name (Amitay et al., 2004). Clough et al. (2004) suggest a third type of ambiguity, named reference ambiguity, which occurs when a place is associated to many names, like New York, NYC or The Big Apple. Ambiguity makes the resolution of references to places intrinsically context-based. Although there are important work on place-based information integration and retrieval, areas such as disambiguation are still in its infancy (Vasardani et al., 2013).

An important resource to address disambiguation is the determination of the geographic scope of the document, i.e., the set of places referenced by and relevant to the contents of the document. ‘Every document has a geographical scope” (Andogah et al., 2012). Even keyword queries to search engines can have a geographic scope (Alexopoulos and Ruiz, 2012, Silva et al., 2006), since query words embed the user's intentions in the search.

References to places can be straightforward and unambiguous as geographic coordinates or not. Other sources of geographic location information can be structured (postal addresses) or unstructured (place descriptions in text). They can also be direct (place names) or indirect (references to cultural characteristics associated to places), explicit (news headers) or implicit (“9/11”). Humans are often able to recognize references to places based on such evidence, but this association does not come so easily to automated systems. Addressing this problem is one of the pressing tasks for Geographic Information Retrieval (GIR) research.

GIR extends Information Retrieval (Baeza-Yates and Ribeiro-Neto, 1999) with use of geographic locations and metadata (Jones and Purves, 2009), taking it beyond the use of keywords. GIR studies methods and techniques for the retrieval of information from unstructured or partially structured sources, including relevance ranking, based on queries that specify both theme and geographic scope (Jones and Purves, 2008, Jones and Purves, 2009). One of the most important research subjects currently in GIR involves recognizing references to places in regular text, and also in other media, such as photos and videos (Luo et al., 2011), including implicit references. The recognition of references to places in media other than text documents is beyond the scope of this paper.

Many initiatives to tackle the GIR problem of recognizing references to places in text have arisen in the recent past, usually with varied or conflicting descriptions or terminologies, targeting various applications, or using a range of reference data. The main contributions of this survey are (1) a structured and comprehensive view on contributions to the problem of determining the geographic scope of documents, (2) a review of relevant definitions and concepts, (3) a discussion of the main techniques currently used to address the Geographic Scope Resolution (GSR) problem, and (4) a proposal for a future research agenda in the area.

This paper is organized as follows. Section 2 introduces the main application areas related to the determination of the geographic scope of documents. Section 3 discusses the terminological variations found in the literature and presents them in a structured way. Section 4 focuses on the GSR problem, presenting a set of methods and a discussion of existing techniques from the literature. Finally, Section 5 shows our conclusions and indicates future research directions for the field.

Section snippets

Main application areas

In this section, we present a high level description of what can be gained if efficient methods for determining the geographic scope of documents are available. The main application areas are divided into two groups: (1) contributions to IR, in the form of tools and techniques that incorporate geographic variables, and (2) contributions to Web data mining.

The geographic scope resolution problem

In this section, we review a set of concepts and the terminology used the description of algorithms and techniques addressing the Geographic Scope Resolution (GSR) problem. Key and ground terms and their respective definitions are extracted from the relevant literature and adjusted to the proposed definition of the GSR problem, presented in more detail in Section 3.2.

Proposed solutions to the geographic scope resolution problem

This section presents a number of proposals from the literature for the solution of the GSR problem. The contributions are divided in two groups. First, Section 4.1 presents proposals that cover the entire GSR problem, although sometimes using concepts and steps that are slightly different from the definitions presented in Section 3. Next, Section 4.2 covers contributions that are specific to the geoparsing step. Section 4.3 presents proposals connected to the reference resolution and Section

Conclusions and future work

The goal of this survey was to provide a comprehensive and structured view on contributions to the geographic scope resolution problem. We presented a definition of the GSR problem, dividing the problem in three steps: geoparsing, reference resolution and grounding references. Solutions to the first two steps were organized according to the method used, and proposals to the third step were organized according to the type of output produced. We would like to emphasize that the classes in each

Acknowledgments

This work was partially supported by FAPEMIG (grant CEX-PPM-00679/15) and CNPq (grants 303532/2015-7, 459818/2014-2 and 401822/2013-3), Brazilian agencies in charge of fostering research and development.

References (102)

  • G. Andogah et al.

    Every document has a geographical scope

    Data Knowl. Eng

    (2012)
  • M.J. Silva et al.

    Adding geographic scopes to web resources

    Comput. Environ. Urban Syst.

    (2006)
  • Adelfio, M.D., Samet, H., 2013. Geowhiz: toponym resolution using common categories. In: Proceedings of the 21st ACM...
  • Ahlers, D., 2013. Assessment of the accuracy of Geonames gazetteer data, In: Proceedings of the 7th Workshop on...
  • Alencar, R.O., Davis Jr., C.A., 2011. Advancing Geoinformation Science for a Changing World, vol. 1, Springer Berlin...
  • Alencar, R.O., Davis Jr., C.A., Gonçalves, M.A., 2010. Geographical classification of documents using evidence from...
  • Alexopoulos, P., Ruiz, C., 2012. Optimizing geographical entity and scope resolution in texts using non-geographical...
  • P. Alexopoulos et al.

    KLocatoran ontology-based framework for scenario-driven geographical scope resolution

    Int. J. Adv. Intell. Syst.

    (2013)
  • Amitay, E., Har'El, N., Sivan, R., Soffer, A., 2004. Web-a-where: Geotagging web content, In: Proceedings of the 27th...
  • Anastácio, I., Martins, B., Calado, P., 2009. Progress in Artificial Intelligence: 14th Portuguese Conference on...
  • Anastácio, I., Martins, B., Calado, P., 2009. A comparison of different approaches for assigning geographic scopes to...
  • Backstrom, L., Sun, E., Marlow, C., 2010. Find me if you can: improving geographical prediction with social and spatial...
  • R.A. Baeza-Yates et al.

    Modern Information Retrieval

    (1999)
  • Borges, K.A.V., Laender, A.H.F., Medeiros, C.B., Davis Jr., C.A., 2007. Discovering geographic locations in Web pages...
  • K.A.V. Borges et al.

    Ontology-driven discovery of geospatial evidence in web pages

    GeoInformatica

    (2011)
  • D. Buscaldi et al.

    A conceptual density-based approach for the disambiguation of toponyms

    Int. J. Geogr. Inf. Sci.

    (2008)
  • Buscaldi, D., Rosso, P., 2008. Map-based vs. knowledge-based toponym disambiguation. In: Proceedings of the 2nd...
  • Buyukkokten, O., Cho, J., Garcia-Molina, H., Gravano, L., Shivakumar, N., 1999. Exploiting geographical location...
  • Campelo, C.E.C., Baptista, C.S., 2008. Geographic scope modeling for Web documents. In: Proceedings of the 2nd...
  • Cardoso, N., Silva, M.J., Santos, D., 2008. Handling implicit geographic evidence for geographic IR. In: Proceedings of...
  • N. Cardoso

    Evaluating geographic information retrieval

    SIGSPATIAL Spec.

    (2011)
  • R. Chasin et al.

    Extracting and displaying temporal and geospatial entities from articles on historical events

    Comput. J.

    (2013)
  • Chen, M., Lin, X., Zhang, Y., Wang, X., Yu, H., 2010. Assigning geographical focus to documents. In: 18th Conference on...
  • Clough, P., Sanderson, M., Joho, H., 2004. Extraction of Semantic Annotations from Textual Web Pages, Technical Report...
  • Clough, P., 2005. Extracting metadata for spatially-aware information retrieval on the internet. In: Proceedings of the...
  • Curran, J.R., Clark, S., 2003. Language independent NER using a maximum entropy tagger. In: Proceedings of the Seventh...
  • C.A. Davis et al.

    Assessing the certainty of locations produced by an address geocoding system

    GeoInformatica

    (2007)
  • C.A. Davis et al.

    Inferring the location of Twitter messages based on user relationships

    Trans. GIS

    (2011)
  • T.M. Delboni et al.

    Semantic expansion of geographic Web queries based on natural language positioning expressions

    Trans. GIS

    (2007)
  • DeLozier, G., Baldridge, J., London, L., 2015. Gazetteer-independent toponym resolution using geographic word profiles....
  • Ding, J. Gravano, L., Shivakumar, N., 2000. Computing geographical scopes of Web resources. In: Proceedings of the 26th...
  • Drymonas, E., Pfoser, D., 2010. Geospatial route extraction from texts. In: Proceedings of the 1st ACM SIGSPATIAL...
  • Fu, G., Jones, C.B., Abdelmoty, A.I., 2005. Building a geographical ontology for intelligent spatial search on the Web....
  • Y. Fujiwara et al.

    Fast likelihood search for hidden markov models

    ACM Trans. Knowl. Discov. Data (TKDD)

    (2009)
  • Garbin, E., Mani, I., 2005. Disambiguating toponyms in news. In: Proceedings of the Conference on Human Language...
  • D.W. Goldberg et al.

    From text to geographic coordinatesthe current state of geocoding

    URISA J. (J. Urban Reg. Inf. Assoc.)

    (2007)
  • Gouvêa, C., Loh, S., Garcia, L.F.F., Fonseca, E.B., Wendt, I., 2008. Discovering location indicators of toponyms from...
  • Gravano, L., Hatzivassiloglou, V., Lichtenstein, R., 2003. Categorizing Web queries according to geographical locality....
  • Habib, M.B., van Keulen, M., 2011. Named entity extraction and disambiguation: The reinforcement effect. In:...
  • Habib, M.B., van Keulen, M., 2012. Web Engineering. In: Proceedings of 12th International Conference, ICWE 2012,...
  • Habib, M.B., van Keulen, M., 2013. A hybrid approach for robust multilingual toponym extraction and disambiguation. In:...
  • T.C. Hart et al.

    Reference data and geocoding qualityexamining completeness and positional accuracy of street geocoded crime incidents

    Policing: Int. J. Police Strateg. Manag.

    (2013)
  • M.A. Hearst

    Support vector machines

    IEEE Intell. Syst.

    (1998)
  • Hill, L.L., 2000. Research and Advanced Technology for Digital Libraries. In: Proceedings of 4th European Conference,...
  • L.L. Hill

    Georeferencing: The Geographic Associations of Information

    (2006)
  • C.B. Jones et al.

    Geographical information retrieval

    Int. J. Geogr. Inf. Sci.

    (2008)
  • C.B. Jones et al.

    Encyclopedia of Database Systems

    (2009)
  • J.L. Leidner et al.

    Detecting geographical references in the form of place names and associated spatial natural language

    SIGSPATIAL Spec.

    (2011)
  • Leidner, J.L., Sinclair, G., Webber, B., 2003. Grounding spatial named entities for information extraction and question...
  • Leidner, J.L., 2007. Toponym resolution in text: annotation, evaluation and applications of spatial grounding of place...
  • Cited by (37)

    View all citing articles on Scopus
    View full text