Skip to main content

Public News Archive: A Searchable Sub-archive to Portuguese Past News Articles

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2023)

Abstract

Over the past few decades, the amount of information generated turned the Web into the largest knowledge infrastructure existing to date. Web archives have been at the forefront of data preservation, preventing the losses of significant data to humankind. Different snapshots of the web are saved everyday enabling users to surf the past web and to travel through this overtime. Despite these efforts, many people are not aware that the web is being preserved, often finding these infrastructures to be unattractive or difficult to use, when compared to common search engines. In this paper, we give a step towards making use of this preserved information to develop “Public Archive” an intuitive interface that enables end-users to search and analyze a large-scale of 67,242 past preserved news articles belonging to a Portuguese reference newspaper (“Jornal Público”). The referred collection was obtained by scraping 10,976 versions of the homepage of the “Jornal Público” preserved by the Portuguese web archive infrastructure (Arquivo.pt) during the time-period of 2010 to 2021. By doing this, we aim, not only to mark a stand in what respects to make use of this preserved information, but also to come up with an easy-to-follow solution, the Public Archive python package, which creates the roots to be used (with minor adaptations) by other news source providers interested in offering their readers access to past news articles.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://arquivopublico.ipt.pt.

  2. 2.

    http://www.publico.pt.

  3. 3.

    http://arquivo.pt.

  4. 4.

    https://github.com/diogocorreia01/PublicNewsArchive/.

References

  1. AlNoamany, Y., Weigle. M.C., Nelson. M.L.: Generating stories from archived collections. In: Proceedings of the 2017 ACM Conference on Web Science, WebSci 2017, pp. 309–318 (2017)

    Google Scholar 

  2. Alonso, O., Berberich, K., Bedathur, S., Weikum, G.: NEAT: news exploration along time. In: Gurrin, C., et al. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 667–667. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12275-0_72

    Chapter  Google Scholar 

  3. Campos, R., Pasquali, A., Jatowt, A., Mangaravite, V., Jorge, A.: Automatic generation of timelines for past-web events. In: Gomes, D., Demidova, E., Winters, J., Risse, T. (eds.) The Past Web. Exploring Web Archives, pp. 225–242. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63291-5_18

  4. Campos, R., Mangaravite, V., Pasquali, A., Jorge, A.M., Nunes, C., Jatowt, A.: A text feature based automatic keyword extraction method for single documents. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 684–691. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_63

    Chapter  Google Scholar 

  5. Campos, R., Mangaravite, V., Pasquali, A., Jorge, A.M., Nunes, C., Jatowt, A.: YAKE! Collection-independent automatic keyword extractor. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 806–810. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_80

    Chapter  Google Scholar 

  6. Ferdous, Md., Chowdhury, S., Jose, J.: Geo-tagging news stories using contextual modelling. Int. J. Inf. Retrieval Res. 7, 50–71 (2017)

    Google Scholar 

  7. Gomes, D., Demidova, E., Winters, J., Risse, T.: The Past Web: Exploring Web Archives, pp. 1–297. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63291-5

  8. Gomes, D., Cruz, D., Miranda, J., Costa, M., Fontes, S.: Search the past with the Portuguese web archive. In: Proceedings of the 22nd International Conference on World Wide Web (WWW 2013), Rio de Janeiro, Brazil, 13–17 May, pp. 321–324 (2013)

    Google Scholar 

  9. Martinez-Alvarez, M., et al.: First international workshop on recent trends in news information retrieval (NewsIR’16). In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 878–882. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30671-1_85

    Chapter  Google Scholar 

  10. Rafiei, J., Rafiei, D.: Geotagging named entities in news and online documents. In: Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM 2016), USA, 24–28 October, pp. 1321–1330 (2016)

    Google Scholar 

  11. Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. In Foundations and Trends in Information Retrieval. 3(4), 333–389 (2009)

    Google Scholar 

  12. Saleiro, P., Teixeira, J., Soares, C., Oliveira, E.: TimeMachine: entity-centric search and visualization of news archives. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 845–848. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30671-1_78

    Chapter  Google Scholar 

  13. Sato, M., Jatowt, A., Duan, Y., Campos, R., Yoshikawa, M.: Estimating contemporary relevance of past news. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2021), 27–30 September, pp. 70–79 (2021)

    Google Scholar 

Download references

Acknowledgments

This work is financed by National Funds through the Portuguese funding agency, FCT – Fundação para a Ciência e a Tecnologia, within project LA/P/0063/2020 and by the project Text2Story, financed by the ERDF – European Regional Development Fund through the Norte Portugal Regional Operational Programme – NORTE 2020 under the Portugal 2020 Partnership Agreement and by National Funds through the FCT – Fundação para a Ciência e a Tecnologia, I.P. (Portuguese Foundation for Science and Technology) within project Text2Story, with reference PTDC/CCI-COM/31857/2017 (NORTE-01-0145-FEDER-031857).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ricardo Campos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Campos, R., Correia, D., Jatowt, A. (2023). Public News Archive: A Searchable Sub-archive to Portuguese Past News Articles. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13982. Springer, Cham. https://doi.org/10.1007/978-3-031-28241-6_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-28241-6_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-28240-9

  • Online ISBN: 978-3-031-28241-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics