A survey on the geographic scope of textual documents

doi:10.1016/j.cageo.2016.07.017

Computers & Geosciences

Volume 96, November 2016, Pages 23-34

https://doi.org/10.1016/j.cageo.2016.07.017 Get rights and content

Highlights

•
A structured and comprehensive view on contributions to the problem of determining the geographic scope of documents.
•
Review of relevant definitions and concepts.
•
A discussion of the main techniques currently used to address the Geographic Scope. Resolution (GSR) problem and its subproblems.
•
A proposal for a future research agenda in the area.

Abstract

Recognizing references to places in texts is needed in many applications, such as search engines, location-based social media and document classification. In this paper we present a survey of methods and techniques for the recognition and identification of places referenced in texts. We discuss concepts and terminology, and propose a classification of the solutions given in the literature. We introduce a definition of the Geographic Scope Resolution (GSR) problem, dividing it in three steps: geoparsing, reference resolution, and grounding references. Solutions to the first two steps are organized according to the method used, and solutions to the third step are organized according to the type of output produced. We found that it is difficult to compare existing solutions directly to one another, because they often create their own benchmarking data, targeted to their own problem.

Introduction

The demand for geographic data in applications on the Web is increasing. One of the most important resources to support this increased interest is the ability to recognize references to places in Web documents. If documents can be correctly and efficiently linked to places mentioned directly or indirectly in them, it becomes possible to improve and innovate in directions such as geographic indexing and querying, finding relationships based on spatial proximity or containment, and detecting localized trends for events and phenomena mentioned in social media.

A large share of the information available on the Web is geographically specific (Delboni et al., 2007, Vaid et al., 2005, Vasardani et al., 2013). References to geographic locations appear in the form of place names, postal addresses, postcodes, historical dates, demonyms, ethnicity, typical food and others. Many queries include place names and other geographic terms (Delboni et al., 2007, Sanderson and Kohler, 2005, Silva et al., 2006). Therefore, there is demand for mechanisms to search for documents both thematically (for instance, using a set of keywords) and geographically, based on places mentioned or referenced by the text (Zong et al., 2005). Similar techniques and resources can also apply to streaming data, such as Twitter messages or RSS feeds, providing the opportunity to index content in near-real-time, based on references to places.

However, while finding references to places in Web documents, ambiguity and uncertainty occur. Places can share a name with other places (Paris, besides being the capital of France, refers to more than sixty places around the world¹). Places are named using common language words (Park, Hope and Independence are American cities) and proper names (Washington, Houston and San Francisco). The first type of ambiguity occurs when a place name references multiple places, and it is called Geo/Geo ambiguity, or referent ambiguity. The latter ambiguity is called Geo/Non-Geo ambiguity (referent class ambiguity), which occurs when both a location and a non-location share the same name (Amitay et al., 2004). Clough et al. (2004) suggest a third type of ambiguity, named reference ambiguity, which occurs when a place is associated to many names, like New York, NYC or The Big Apple. Ambiguity makes the resolution of references to places intrinsically context-based. Although there are important work on place-based information integration and retrieval, areas such as disambiguation are still in its infancy (Vasardani et al., 2013).

An important resource to address disambiguation is the determination of the geographic scope of the document, i.e., the set of places referenced by and relevant to the contents of the document. ‘Every document has a geographical scope” (Andogah et al., 2012). Even keyword queries to search engines can have a geographic scope (Alexopoulos and Ruiz, 2012, Silva et al., 2006), since query words embed the user's intentions in the search.

References to places can be straightforward and unambiguous as geographic coordinates or not. Other sources of geographic location information can be structured (postal addresses) or unstructured (place descriptions in text). They can also be direct (place names) or indirect (references to cultural characteristics associated to places), explicit (news headers) or implicit (“9/11”). Humans are often able to recognize references to places based on such evidence, but this association does not come so easily to automated systems. Addressing this problem is one of the pressing tasks for Geographic Information Retrieval (GIR) research.

GIR extends Information Retrieval (Baeza-Yates and Ribeiro-Neto, 1999) with use of geographic locations and metadata (Jones and Purves, 2009), taking it beyond the use of keywords. GIR studies methods and techniques for the retrieval of information from unstructured or partially structured sources, including relevance ranking, based on queries that specify both theme and geographic scope (Jones and Purves, 2008, Jones and Purves, 2009). One of the most important research subjects currently in GIR involves recognizing references to places in regular text, and also in other media, such as photos and videos (Luo et al., 2011), including implicit references. The recognition of references to places in media other than text documents is beyond the scope of this paper.

Many initiatives to tackle the GIR problem of recognizing references to places in text have arisen in the recent past, usually with varied or conflicting descriptions or terminologies, targeting various applications, or using a range of reference data. The main contributions of this survey are (1) a structured and comprehensive view on contributions to the problem of determining the geographic scope of documents, (2) a review of relevant definitions and concepts, (3) a discussion of the main techniques currently used to address the Geographic Scope Resolution (GSR) problem, and (4) a proposal for a future research agenda in the area.

This paper is organized as follows. Section 2 introduces the main application areas related to the determination of the geographic scope of documents. Section 3 discusses the terminological variations found in the literature and presents them in a structured way. Section 4 focuses on the GSR problem, presenting a set of methods and a discussion of existing techniques from the literature. Finally, Section 5 shows our conclusions and indicates future research directions for the field.

Section snippets

Main application areas

In this section, we present a high level description of what can be gained if efficient methods for determining the geographic scope of documents are available. The main application areas are divided into two groups: (1) contributions to IR, in the form of tools and techniques that incorporate geographic variables, and (2) contributions to Web data mining.

The geographic scope resolution problem

In this section, we review a set of concepts and the terminology used the description of algorithms and techniques addressing the Geographic Scope Resolution (GSR) problem. Key and ground terms and their respective definitions are extracted from the relevant literature and adjusted to the proposed definition of the GSR problem, presented in more detail in Section 3.2.

Proposed solutions to the geographic scope resolution problem

This section presents a number of proposals from the literature for the solution of the GSR problem. The contributions are divided in two groups. First, Section 4.1 presents proposals that cover the entire GSR problem, although sometimes using concepts and steps that are slightly different from the definitions presented in Section 3. Next, Section 4.2 covers contributions that are specific to the geoparsing step. Section 4.3 presents proposals connected to the reference resolution and Section

Conclusions and future work

The goal of this survey was to provide a comprehensive and structured view on contributions to the geographic scope resolution problem. We presented a definition of the GSR problem, dividing the problem in three steps: geoparsing, reference resolution and grounding references. Solutions to the first two steps were organized according to the method used, and proposals to the third step were organized according to the type of output produced. We would like to emphasize that the classes in each

Acknowledgments

This work was partially supported by FAPEMIG (grant CEX-PPM-00679/15) and CNPq (grants 303532/2015-7, 459818/2014-2 and 401822/2013-3), Brazilian agencies in charge of fostering research and development.

References (102)

G. Andogah et al.
Every document has a geographical scope
Data Knowl. Eng
(2012)
M.J. Silva et al.
Adding geographic scopes to web resources
Comput. Environ. Urban Syst.
(2006)
Adelfio, M.D., Samet, H., 2013. Geowhiz: toponym resolution using common categories. In: Proceedings of the 21st ACM...
Ahlers, D., 2013. Assessment of the accuracy of Geonames gazetteer data, In: Proceedings of the 7th Workshop on...
Alencar, R.O., Davis Jr., C.A., 2011. Advancing Geoinformation Science for a Changing World, vol. 1, Springer Berlin...
Alencar, R.O., Davis Jr., C.A., Gonçalves, M.A., 2010. Geographical classification of documents using evidence from...
Alexopoulos, P., Ruiz, C., 2012. Optimizing geographical entity and scope resolution in texts using non-geographical...
P. Alexopoulos et al.
KLocatoran ontology-based framework for scenario-driven geographical scope resolution
Int. J. Adv. Intell. Syst.
(2013)
Amitay, E., Har'El, N., Sivan, R., Soffer, A., 2004. Web-a-where: Geotagging web content, In: Proceedings of the 27th...
Anastácio, I., Martins, B., Calado, P., 2009. Progress in Artificial Intelligence: 14th Portuguese Conference on...

Anastácio, I., Martins, B., Calado, P., 2009. A comparison of different approaches for assigning geographic scopes to...

Backstrom, L., Sun, E., Marlow, C., 2010. Find me if you can: improving geographical prediction with social and spatial...

R.A. Baeza-Yates et al.

Modern Information Retrieval

(1999)

Borges, K.A.V., Laender, A.H.F., Medeiros, C.B., Davis Jr., C.A., 2007. Discovering geographic locations in Web pages...

K.A.V. Borges et al.

Ontology-driven discovery of geospatial evidence in web pages

GeoInformatica

(2011)

D. Buscaldi et al.

A conceptual density-based approach for the disambiguation of toponyms

Int. J. Geogr. Inf. Sci.

(2008)

Buscaldi, D., Rosso, P., 2008. Map-based vs. knowledge-based toponym disambiguation. In: Proceedings of the 2nd...

Buyukkokten, O., Cho, J., Garcia-Molina, H., Gravano, L., Shivakumar, N., 1999. Exploiting geographical location...

Campelo, C.E.C., Baptista, C.S., 2008. Geographic scope modeling for Web documents. In: Proceedings of the 2nd...

Cardoso, N., Silva, M.J., Santos, D., 2008. Handling implicit geographic evidence for geographic IR. In: Proceedings of...

N. Cardoso

Evaluating geographic information retrieval

SIGSPATIAL Spec.

(2011)

R. Chasin et al.

Extracting and displaying temporal and geospatial entities from articles on historical events

Comput. J.

(2013)

Chen, M., Lin, X., Zhang, Y., Wang, X., Yu, H., 2010. Assigning geographical focus to documents. In: 18th Conference on...

Clough, P., Sanderson, M., Joho, H., 2004. Extraction of Semantic Annotations from Textual Web Pages, Technical Report...

Clough, P., 2005. Extracting metadata for spatially-aware information retrieval on the internet. In: Proceedings of the...

Curran, J.R., Clark, S., 2003. Language independent NER using a maximum entropy tagger. In: Proceedings of the Seventh...

C.A. Davis et al.

Assessing the certainty of locations produced by an address geocoding system

GeoInformatica

(2007)

C.A. Davis et al.

Inferring the location of Twitter messages based on user relationships

Trans. GIS

(2011)

T.M. Delboni et al.

Semantic expansion of geographic Web queries based on natural language positioning expressions

Trans. GIS

(2007)

DeLozier, G., Baldridge, J., London, L., 2015. Gazetteer-independent toponym resolution using geographic word profiles....

Ding, J. Gravano, L., Shivakumar, N., 2000. Computing geographical scopes of Web resources. In: Proceedings of the 26th...

Drymonas, E., Pfoser, D., 2010. Geospatial route extraction from texts. In: Proceedings of the 1st ACM SIGSPATIAL...

Fu, G., Jones, C.B., Abdelmoty, A.I., 2005. Building a geographical ontology for intelligent spatial search on the Web....

Y. Fujiwara et al.

Fast likelihood search for hidden markov models

ACM Trans. Knowl. Discov. Data (TKDD)

(2009)

Garbin, E., Mani, I., 2005. Disambiguating toponyms in news. In: Proceedings of the Conference on Human Language...

D.W. Goldberg et al.

From text to geographic coordinatesthe current state of geocoding

URISA J. (J. Urban Reg. Inf. Assoc.)

(2007)

Gouvêa, C., Loh, S., Garcia, L.F.F., Fonseca, E.B., Wendt, I., 2008. Discovering location indicators of toponyms from...

Gravano, L., Hatzivassiloglou, V., Lichtenstein, R., 2003. Categorizing Web queries according to geographical locality....

Habib, M.B., van Keulen, M., 2011. Named entity extraction and disambiguation: The reinforcement effect. In:...

Habib, M.B., van Keulen, M., 2012. Web Engineering. In: Proceedings of 12th International Conference, ICWE 2012,...

Habib, M.B., van Keulen, M., 2013. A hybrid approach for robust multilingual toponym extraction and disambiguation. In:...

T.C. Hart et al.

Reference data and geocoding qualityexamining completeness and positional accuracy of street geocoded crime incidents

Policing: Int. J. Police Strateg. Manag.

(2013)

M.A. Hearst

Support vector machines

IEEE Intell. Syst.

(1998)

Hill, L.L., 2000. Research and Advanced Technology for Digital Libraries. In: Proceedings of 4th European Conference,...

L.L. Hill

Georeferencing: The Geographic Associations of Information

(2006)

C.B. Jones et al.

Geographical information retrieval

Int. J. Geogr. Inf. Sci.

(2008)

C.B. Jones et al.

Encyclopedia of Database Systems

(2009)

J.L. Leidner et al.

Detecting geographical references in the form of place names and associated spatial natural language

SIGSPATIAL Spec.

(2011)

Leidner, J.L., Sinclair, G., Webber, B., 2003. Grounding spatial named entities for information extraction and question...

Leidner, J.L., 2007. Toponym resolution in text: annotation, evaluation and applications of spatial grounding of place...

Cited by (37)

Toward a semantic-based location tagging news feed system: Constructing a conceptual hierarchy on geographical hashtags
2019, Computers and Electrical Engineering
Citation Excerpt :
Content hashtags are widely used to manage news in news feed systems and social networks [11–16]. Geographic locations have also affected on the news systems with or without the use of geographical hashtags [17–20]. Semantic web-based approaches, such as ontologies, in combination with machine learning methods have been used to improve the efficiency of news systems [21–23].
Online news and social networking sites have been significantly used in recent years. There has been a lot of efforts to provide appropriate contents for the end users; however, they proved not to be effective. We believe the semantic web (as the third generation of the web) is mature enough to undertake the responsibility of generating more user-centric content. One way to exploit the semantic web's capabilities for such purpose is to construct an ontology that establishes the relationships between hashtags. In this paper, we present the construction process of a news feed system based on the hierarchical relationships of geographic hashtag. Our experiments demonstrated that our proposed semantic-based hierarchical location tagging news feed system increases the quality of the user experience as well as the publication rate of the news while boosting the content match rate to the target audiences. The proposed system of this paper can be considered as a real step towards the realization of semantic web.
An empirical study of incorporating syntactic constraints into BERT-based location metonymy resolution
2023, Natural Language Engineering
CLGLIAM: contrastive learning model based on global and local semantic interaction for address matching
2023, Applied Intelligence
Location Reference Recognition from Texts: A Survey and Comparison
2023, ACM Computing Surveys
Voxel modeling and association of ubiquitous spatiotemporal information in natural language texts
2023, International Journal of Digital Earth
Feature Selection for Location Metonymy Using Augmented Bag-of-Words
2022, IEEE Access

View all citing articles on Scopus

View full text

Review articleA survey on the geographic scope of textual documents

Highlights

Abstract

Introduction

Section snippets

Main application areas

The geographic scope resolution problem

Proposed solutions to the geographic scope resolution problem

Conclusions and future work

Acknowledgments

Data Knowl. Eng

Comput. Environ. Urban Syst.

KLocatoran ontology-based framework for scenario-driven geographical scope resolution

Int. J. Adv. Intell. Syst.

Modern Information Retrieval

Ontology-driven discovery of geospatial evidence in web pages

GeoInformatica

A conceptual density-based approach for the disambiguation of toponyms

Int. J. Geogr. Inf. Sci.

Evaluating geographic information retrieval

SIGSPATIAL Spec.

Extracting and displaying temporal and geospatial entities from articles on historical events

Comput. J.

Assessing the certainty of locations produced by an address geocoding system

GeoInformatica

Inferring the location of Twitter messages based on user relationships

Trans. GIS

Semantic expansion of geographic Web queries based on natural language positioning expressions

Trans. GIS

Fast likelihood search for hidden markov models

ACM Trans. Knowl. Discov. Data (TKDD)

From text to geographic coordinatesthe current state of geocoding

URISA J. (J. Urban Reg. Inf. Assoc.)

Reference data and geocoding qualityexamining completeness and positional accuracy of street geocoded crime incidents

Policing: Int. J. Police Strateg. Manag.

Support vector machines

IEEE Intell. Syst.

Georeferencing: The Geographic Associations of Information

Geographical information retrieval

Int. J. Geogr. Inf. Sci.

Encyclopedia of Database Systems

Detecting geographical references in the form of place names and associated spatial natural language

SIGSPATIAL Spec.

Review article
A survey on the geographic scope of textual documents