Contents and time sensitive document ranking of scientific literature

https://doi.org/10.1016/j.joi.2014.04.006Get rights and content

Highlights

  • A new ranking framework better models the behaviour of readers of scientific papers.

  • Dynamic adjustments of random walk parameters in terms of paper's contents and age.

  • Scoring a paper with awareness of its diversity of topics and how they evolve.

  • More effective than PageRank in ranking papers’ potential scientific utility.

  • High practical value in retrieving and placing useful papers at the top of ranking

Abstract

A new link-based document ranking framework is devised with at its heart, a contents and time sensitive random literature explorer designed to more accurately model the behaviour of readers of scientific documents. In particular, our ranking framework dynamically adjusts its random walk parameters according to both contents and age of encountered documents, thus incorporating the diversity of topics and how they evolve over time into the score of a scientific publication. Our random walk framework results in a ranking of scientific documents which is shown to be more effective in facilitating literature exploration than PageRank measured against a proxy gold standard based on papers’ potential usefulness in facilitating later research. One of its many strengths lies in its practical value in reliably retrieving and placing promisingly useful papers at the top of its ranking.

Section snippets

Introduction and motivation

The explosive growth of the Internet and the overabundance of data fuel the creation and development of information networks, which constantly poses new challenges for information retrieval. As the searched domains expand, even queries targeted at some niche field retrieve a large volume of potentially relevant information that far exceeds human processing capabilities. Ranking addresses the challenge of information overload by identifying material of the highest “quality” among all “relevant”

Scientific document ranking

Scientific document ranking is a challenging task whose core problem is to quantify the importance of academic publications. Citation count based metrics have a long lineage, tracing back to the pioneering work done by Garfield on citation analysis in the 1970s ([Garfield, 1972], [Garfield, 1979]), and they are still widely used today. However, citation count has been challenged for being a quantitative measure of the popularity of a scientific document that fails to properly capture

Problem statement

We aim at ranking documents in a citation network to help researchers identify papers of high scientific utility in their field, a paper's usefulness being acknowledged in the kind of incoming citations it receives from later work. A scientific citation network has the same abstract structure as any other directed network, but it is distinctively static in nature: the contents of a document and the references it includes are frozen at the time of publication, imposing a strict temporal

Our approach

Given the recent relative success of applying PageRank to scientific document ranking ([King et al., 2013], [Li et al., 2008], [Sayyadi and Getoor, 2009], [Walker et al., 2007]), we propose to use PageRank as a starting point for our link-based ranking framework. But as argued in Section 3, some modifications are required to adapt PageRank and properly deal with the ageing factor and the intricate topic dynamics in a citation network in order to make the resulting ranking more useful. PageRank

A proxy gold standard

Without a notion of gold standard, evaluating ranking results on intrinsic measures, e.g., the potential scientific utility of academic publications, is infeasible ([King et al., 2013], [Walker et al., 2007]). However, the construction of a perfect gold standard would inevitably involve large-scale human evaluations, which have been prohibitively expensive for all studies up to date. A common practice to get around this problem that is well accepted in the information retrieval community is the

Conclusion and future work

In this paper we described RALEX, a random-walk based document ranking framework for scientific literature to help researchers identify publications with a high potential for scientific utility in their domains of interest, which is all the more challenging that those domains are at the frontier of science and technology. To the best of our knowledge, this work represents the first attempt to rank scientific papers in decreasing order of potential usefulness taking both topical contents and age

Acknowledegment

The authors would like to thank the anonymous reviewers for their valuable comments and constructive suggestions.

References (30)

  • E. Garfield

    Citation analysis as a tool in journal evaluation

    Science

    (1972)
  • E. Garfield

    Citation indexing: Its theory and application in science, technology, and humanities

    Information sciences series

    (1979)
  • D.F. Gleich et al.

    2010. Tracking the random surfer: Empirically measured teleportation parameters in PageRank

  • T.L. Griffiths et al.

    Finding scientific topics

    Proceedings of the National Academy of Sciences

    (2004)
  • D. Hall et al.

    Studying the history of ideas using topic models

  • Cited by (16)

    View all citing articles on Scopus
    View full text