Public News Archive: A Searchable Sub-archive to Portuguese Past News Articles

Campos, Ricardo; Correia, Diogo; Jatowt, Adam

doi:10.1007/978-3-031-28241-6_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13982))

Included in the following conference series:

European Conference on Information Retrieval

1487 Accesses

Abstract

Over the past few decades, the amount of information generated turned the Web into the largest knowledge infrastructure existing to date. Web archives have been at the forefront of data preservation, preventing the losses of significant data to humankind. Different snapshots of the web are saved everyday enabling users to surf the past web and to travel through this overtime. Despite these efforts, many people are not aware that the web is being preserved, often finding these infrastructures to be unattractive or difficult to use, when compared to common search engines. In this paper, we give a step towards making use of this preserved information to develop “Public Archive” an intuitive interface that enables end-users to search and analyze a large-scale of 67,242 past preserved news articles belonging to a Portuguese reference newspaper (“Jornal Público”). The referred collection was obtained by scraping 10,976 versions of the homepage of the “Jornal Público” preserved by the Portuguese web archive infrastructure (Arquivo.pt) during the time-period of 2010 to 2021. By doing this, we aim, not only to mark a stand in what respects to make use of this preserved information, but also to come up with an easy-to-follow solution, the Public Archive python package, which creates the roots to be used (with minor adaptations) by other news source providers interested in offering their readers access to past news articles.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

AlNoamany, Y., Weigle. M.C., Nelson. M.L.: Generating stories from archived collections. In: Proceedings of the 2017 ACM Conference on Web Science, WebSci 2017, pp. 309–318 (2017)
Google Scholar
Alonso, O., Berberich, K., Bedathur, S., Weikum, G.: NEAT: news exploration along time. In: Gurrin, C., et al. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 667–667. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12275-0_72
Chapter Google Scholar
Campos, R., Pasquali, A., Jatowt, A., Mangaravite, V., Jorge, A.: Automatic generation of timelines for past-web events. In: Gomes, D., Demidova, E., Winters, J., Risse, T. (eds.) The Past Web. Exploring Web Archives, pp. 225–242. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63291-5_18
Campos, R., Mangaravite, V., Pasquali, A., Jorge, A.M., Nunes, C., Jatowt, A.: A text feature based automatic keyword extraction method for single documents. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 684–691. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_63
Chapter Google Scholar
Campos, R., Mangaravite, V., Pasquali, A., Jorge, A.M., Nunes, C., Jatowt, A.: YAKE! Collection-independent automatic keyword extractor. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 806–810. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_80
Chapter Google Scholar
Ferdous, Md., Chowdhury, S., Jose, J.: Geo-tagging news stories using contextual modelling. Int. J. Inf. Retrieval Res. 7, 50–71 (2017)
Google Scholar
Gomes, D., Demidova, E., Winters, J., Risse, T.: The Past Web: Exploring Web Archives, pp. 1–297. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63291-5
Gomes, D., Cruz, D., Miranda, J., Costa, M., Fontes, S.: Search the past with the Portuguese web archive. In: Proceedings of the 22nd International Conference on World Wide Web (WWW 2013), Rio de Janeiro, Brazil, 13–17 May, pp. 321–324 (2013)
Google Scholar
Martinez-Alvarez, M., et al.: First international workshop on recent trends in news information retrieval (NewsIR’16). In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 878–882. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30671-1_85
Chapter Google Scholar
Rafiei, J., Rafiei, D.: Geotagging named entities in news and online documents. In: Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM 2016), USA, 24–28 October, pp. 1321–1330 (2016)
Google Scholar
Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. In Foundations and Trends in Information Retrieval. 3(4), 333–389 (2009)
Google Scholar
Saleiro, P., Teixeira, J., Soares, C., Oliveira, E.: TimeMachine: entity-centric search and visualization of news archives. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 845–848. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30671-1_78
Chapter Google Scholar
Sato, M., Jatowt, A., Duan, Y., Campos, R., Yoshikawa, M.: Estimating contemporary relevance of past news. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2021), 27–30 September, pp. 70–79 (2021)
Google Scholar

Download references

Acknowledgments

This work is financed by National Funds through the Portuguese funding agency, FCT – Fundação para a Ciência e a Tecnologia, within project LA/P/0063/2020 and by the project Text2Story, financed by the ERDF – European Regional Development Fund through the Norte Portugal Regional Operational Programme – NORTE 2020 under the Portugal 2020 Partnership Agreement and by National Funds through the FCT – Fundação para a Ciência e a Tecnologia, I.P. (Portuguese Foundation for Science and Technology) within project Text2Story, with reference PTDC/CCI-COM/31857/2017 (NORTE-01-0145-FEDER-031857).

Author information

Authors and Affiliations

LIAAD – INESCTEC, Porto, Portugal
Ricardo Campos
Polytechnic Institute of Tomar, Ci2 – Smart Cities Research Center, Tomar, Portugal
Ricardo Campos & Diogo Correia
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Authors

Ricardo Campos
View author publications
You can also search for this author in PubMed Google Scholar
Diogo Correia
View author publications
You can also search for this author in PubMed Google Scholar
Adam Jatowt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ricardo Campos .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Jaap Kamps
Université Grenoble-Alpes, Saint-Martin-d’Hères, France
Lorraine Goeuriot
Università della Svizzera Italiana, Lugano, Switzerland
Fabio Crestani
University of Copenhagen, Copenhagen, Denmark
Maria Maistro
University of Tsukuba, Ibaraki, Japan
Hideo Joho
Dublin City University, Dublin, Ireland
Brian Davis
Dublin City University, Dublin, Ireland
Cathal Gurrin
Universität Regensburg, Regensburg, Germany
Udo Kruschwitz
Dublin City University, Dublin, Ireland
Annalina Caputo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Campos, R., Correia, D., Jatowt, A. (2023). Public News Archive: A Searchable Sub-archive to Portuguese Past News Articles. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13982. Springer, Cham. https://doi.org/10.1007/978-3-031-28241-6_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-28241-6_16
Published: 16 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28240-9
Online ISBN: 978-3-031-28241-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics