Review articleA survey on the geographic scope of textual documents
Introduction
The demand for geographic data in applications on the Web is increasing. One of the most important resources to support this increased interest is the ability to recognize references to places in Web documents. If documents can be correctly and efficiently linked to places mentioned directly or indirectly in them, it becomes possible to improve and innovate in directions such as geographic indexing and querying, finding relationships based on spatial proximity or containment, and detecting localized trends for events and phenomena mentioned in social media.
A large share of the information available on the Web is geographically specific (Delboni et al., 2007, Vaid et al., 2005, Vasardani et al., 2013). References to geographic locations appear in the form of place names, postal addresses, postcodes, historical dates, demonyms, ethnicity, typical food and others. Many queries include place names and other geographic terms (Delboni et al., 2007, Sanderson and Kohler, 2005, Silva et al., 2006). Therefore, there is demand for mechanisms to search for documents both thematically (for instance, using a set of keywords) and geographically, based on places mentioned or referenced by the text (Zong et al., 2005). Similar techniques and resources can also apply to streaming data, such as Twitter messages or RSS feeds, providing the opportunity to index content in near-real-time, based on references to places.
However, while finding references to places in Web documents, ambiguity and uncertainty occur. Places can share a name with other places (Paris, besides being the capital of France, refers to more than sixty places around the world1). Places are named using common language words (Park, Hope and Independence are American cities) and proper names (Washington, Houston and San Francisco). The first type of ambiguity occurs when a place name references multiple places, and it is called Geo/Geo ambiguity, or referent ambiguity. The latter ambiguity is called Geo/Non-Geo ambiguity (referent class ambiguity), which occurs when both a location and a non-location share the same name (Amitay et al., 2004). Clough et al. (2004) suggest a third type of ambiguity, named reference ambiguity, which occurs when a place is associated to many names, like New York, NYC or The Big Apple. Ambiguity makes the resolution of references to places intrinsically context-based. Although there are important work on place-based information integration and retrieval, areas such as disambiguation are still in its infancy (Vasardani et al., 2013).
An important resource to address disambiguation is the determination of the geographic scope of the document, i.e., the set of places referenced by and relevant to the contents of the document. ‘Every document has a geographical scope” (Andogah et al., 2012). Even keyword queries to search engines can have a geographic scope (Alexopoulos and Ruiz, 2012, Silva et al., 2006), since query words embed the user's intentions in the search.
References to places can be straightforward and unambiguous as geographic coordinates or not. Other sources of geographic location information can be structured (postal addresses) or unstructured (place descriptions in text). They can also be direct (place names) or indirect (references to cultural characteristics associated to places), explicit (news headers) or implicit (“9/11”). Humans are often able to recognize references to places based on such evidence, but this association does not come so easily to automated systems. Addressing this problem is one of the pressing tasks for Geographic Information Retrieval (GIR) research.
GIR extends Information Retrieval (Baeza-Yates and Ribeiro-Neto, 1999) with use of geographic locations and metadata (Jones and Purves, 2009), taking it beyond the use of keywords. GIR studies methods and techniques for the retrieval of information from unstructured or partially structured sources, including relevance ranking, based on queries that specify both theme and geographic scope (Jones and Purves, 2008, Jones and Purves, 2009). One of the most important research subjects currently in GIR involves recognizing references to places in regular text, and also in other media, such as photos and videos (Luo et al., 2011), including implicit references. The recognition of references to places in media other than text documents is beyond the scope of this paper.
Many initiatives to tackle the GIR problem of recognizing references to places in text have arisen in the recent past, usually with varied or conflicting descriptions or terminologies, targeting various applications, or using a range of reference data. The main contributions of this survey are (1) a structured and comprehensive view on contributions to the problem of determining the geographic scope of documents, (2) a review of relevant definitions and concepts, (3) a discussion of the main techniques currently used to address the Geographic Scope Resolution (GSR) problem, and (4) a proposal for a future research agenda in the area.
This paper is organized as follows. Section 2 introduces the main application areas related to the determination of the geographic scope of documents. Section 3 discusses the terminological variations found in the literature and presents them in a structured way. Section 4 focuses on the GSR problem, presenting a set of methods and a discussion of existing techniques from the literature. Finally, Section 5 shows our conclusions and indicates future research directions for the field.
Section snippets
Main application areas
In this section, we present a high level description of what can be gained if efficient methods for determining the geographic scope of documents are available. The main application areas are divided into two groups: (1) contributions to IR, in the form of tools and techniques that incorporate geographic variables, and (2) contributions to Web data mining.
The geographic scope resolution problem
In this section, we review a set of concepts and the terminology used the description of algorithms and techniques addressing the Geographic Scope Resolution (GSR) problem. Key and ground terms and their respective definitions are extracted from the relevant literature and adjusted to the proposed definition of the GSR problem, presented in more detail in Section 3.2.
Proposed solutions to the geographic scope resolution problem
This section presents a number of proposals from the literature for the solution of the GSR problem. The contributions are divided in two groups. First, Section 4.1 presents proposals that cover the entire GSR problem, although sometimes using concepts and steps that are slightly different from the definitions presented in Section 3. Next, Section 4.2 covers contributions that are specific to the geoparsing step. Section 4.3 presents proposals connected to the reference resolution and Section
Conclusions and future work
The goal of this survey was to provide a comprehensive and structured view on contributions to the geographic scope resolution problem. We presented a definition of the GSR problem, dividing the problem in three steps: geoparsing, reference resolution and grounding references. Solutions to the first two steps were organized according to the method used, and proposals to the third step were organized according to the type of output produced. We would like to emphasize that the classes in each
Acknowledgments
This work was partially supported by FAPEMIG (grant CEX-PPM-00679/15) and CNPq (grants 303532/2015-7, 459818/2014-2 and 401822/2013-3), Brazilian agencies in charge of fostering research and development.
References (102)
- et al.
Every document has a geographical scope
Data Knowl. Eng
(2012) - et al.
Adding geographic scopes to web resources
Comput. Environ. Urban Syst.
(2006) - Adelfio, M.D., Samet, H., 2013. Geowhiz: toponym resolution using common categories. In: Proceedings of the 21st ACM...
- Ahlers, D., 2013. Assessment of the accuracy of Geonames gazetteer data, In: Proceedings of the 7th Workshop on...
- Alencar, R.O., Davis Jr., C.A., 2011. Advancing Geoinformation Science for a Changing World, vol. 1, Springer Berlin...
- Alencar, R.O., Davis Jr., C.A., Gonçalves, M.A., 2010. Geographical classification of documents using evidence from...
- Alexopoulos, P., Ruiz, C., 2012. Optimizing geographical entity and scope resolution in texts using non-geographical...
- et al.
KLocatoran ontology-based framework for scenario-driven geographical scope resolution
Int. J. Adv. Intell. Syst.
(2013) - Amitay, E., Har'El, N., Sivan, R., Soffer, A., 2004. Web-a-where: Geotagging web content, In: Proceedings of the 27th...
- Anastácio, I., Martins, B., Calado, P., 2009. Progress in Artificial Intelligence: 14th Portuguese Conference on...
Modern Information Retrieval
Ontology-driven discovery of geospatial evidence in web pages
GeoInformatica
A conceptual density-based approach for the disambiguation of toponyms
Int. J. Geogr. Inf. Sci.
Evaluating geographic information retrieval
SIGSPATIAL Spec.
Extracting and displaying temporal and geospatial entities from articles on historical events
Comput. J.
Assessing the certainty of locations produced by an address geocoding system
GeoInformatica
Inferring the location of Twitter messages based on user relationships
Trans. GIS
Semantic expansion of geographic Web queries based on natural language positioning expressions
Trans. GIS
Fast likelihood search for hidden markov models
ACM Trans. Knowl. Discov. Data (TKDD)
From text to geographic coordinatesthe current state of geocoding
URISA J. (J. Urban Reg. Inf. Assoc.)
Reference data and geocoding qualityexamining completeness and positional accuracy of street geocoded crime incidents
Policing: Int. J. Police Strateg. Manag.
Support vector machines
IEEE Intell. Syst.
Georeferencing: The Geographic Associations of Information
Geographical information retrieval
Int. J. Geogr. Inf. Sci.
Encyclopedia of Database Systems
Detecting geographical references in the form of place names and associated spatial natural language
SIGSPATIAL Spec.
Cited by (37)
Toward a semantic-based location tagging news feed system: Constructing a conceptual hierarchy on geographical hashtags
2019, Computers and Electrical EngineeringCitation Excerpt :Content hashtags are widely used to manage news in news feed systems and social networks [11–16]. Geographic locations have also affected on the news systems with or without the use of geographical hashtags [17–20]. Semantic web-based approaches, such as ontologies, in combination with machine learning methods have been used to improve the efficiency of news systems [21–23].
An empirical study of incorporating syntactic constraints into BERT-based location metonymy resolution
2023, Natural Language EngineeringCLGLIAM: contrastive learning model based on global and local semantic interaction for address matching
2023, Applied IntelligenceLocation Reference Recognition from Texts: A Survey and Comparison
2023, ACM Computing SurveysVoxel modeling and association of ubiquitous spatiotemporal information in natural language texts
2023, International Journal of Digital EarthFeature Selection for Location Metonymy Using Augmented Bag-of-Words
2022, IEEE Access