Skip to main content
Log in

Experimental comparison of first and second-order similarities in a scientometric context

Scientometrics Aims and scope Submit manuscript

Abstract

The measurement of similarity between objects plays a role in several scientific areas. In this article, we deal with document–document similarity in a scientometric context. We compare experimentally, using a large dataset, first-order with second-order similarities with respect to the overall quality of partitions of the dataset, where the partitions are obtained on the basis of optimizing weighted modularity. The quality of a partition is defined in terms of textual coherence. The results show that the second-order approach consistently outperforms the first-order approach. Each difference between the two approaches in overall partition quality values is significant at the 0.01 level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Ahlgren, P., & Colliander, C. (2009a). Document–document similarity approaches and science mapping: experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63.

    Article  Google Scholar 

  • Ahlgren, P., & Colliander, C. (2009b). Textual content, cited references, similarity order, and clustering: an experimental study in the context of science mapping. In Proceedings of the 12th International Conference on Scientometrics and Informetrics (Vol. 2, pp 862–873), Rio de Janeiro.

  • Ahlgren, P., & Jarneving, B. (2008). Bibliographic coupling, common abstract stems and clustering: A comparison of two document–document similarity approaches in the context of science mapping. Scientometrics, 76(2), 273–290.

    Article  Google Scholar 

  • Ahlgren, P., Jarneving, B., & Rousseau, R. (2003). Requirements for a cocitation similarity measure, with special reference to Pearson’s correlation coefficient. Journal of the American Society for Information Science and Technology, 54(6), 550–560.

    Article  Google Scholar 

  • Arenas, A., Fernandez, A., & Gomez, S. (2008). Analysis of the structure of complex networks at different resolution levels. New Journal of Physics, 10, Article Number: 053039.

  • Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Harlow, UK: Addison-Wesley.

    Google Scholar 

  • Bland, J. M., & Kerry, S. M. (1998). Statistics notes—Weighted comparison of means. British Medical Journal, 316(7125), 129.

    Article  Google Scholar 

  • Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics—Theory and Experiment, Article Number: P10008.

  • Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389–2404.

    Article  Google Scholar 

  • Boyack, K. W., Klavans, R., & Börner, K. (2005). Mapping the backbone of science. Scientometrics, 64(3), 351–374.

    Article  Google Scholar 

  • Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R., Patek, M., Biberstine, J. R., et al. (2011). Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS One, 6(3), Article Number: e18029.

  • Cao, M., & Gao, X. (2005). Combining contents and citations for scientific document classification. AI 2005: Advances in artificial intelligence (pp. 143–152). Berlin: Springer.

    Google Scholar 

  • Cribbin, T. (2011). Discovering latent topical structure by second-order similarity analysis. Journal of the American Society for Information Science and Technology, 62(6), 1188–1207.

    Article  Google Scholar 

  • Egghe, L. (2009). New relations between similarity measures for vectors based on vector norms. Journal of the American Society for Information Science and Technology, 60(2), 232–239.

    Article  Google Scholar 

  • Egghe, L. (2010a). Good properties of similarity measures and their complementarity. Journal of the American Society for Information Science and Technology, 61(10), 2151–2160.

    Article  Google Scholar 

  • Egghe, L. (2010b). On the relation between the association strength and other similarity measures. Journal of the American Society for Information Science and Technology, 61(7), 1502–1504.

    Article  Google Scholar 

  • Egghe, L., & Leydesdorff, L. (2009). The relation between Pearson’s correlation coefficient r and Salton’s cosine measure. Journal of the American Society for Information Science and Technology, 60(5), 1027–1036.

    Article  Google Scholar 

  • Egghe, L., & Rousseau, R. (2006). Classical retrieval and overlap measures satisfy the requirements for rankings based on a Lorenz curve. Information Processing & Management, 42(1), 106–120.

    Article  MATH  Google Scholar 

  • Fortunato, S., & Barthelemy, M. (2007). Resolution limit in community detection. Proceedings of the National Academy of Sciences of the United States of America, 104(1), 36–41.

    Article  Google Scholar 

  • Glenisson, P., Glänzel, W., & Persson, O. (2005). Combining full-text analysis and bibliometric indicators. A pilot study. Scientometrics, 63(1), 163–180.

    Article  Google Scholar 

  • Gmür, M. (2003). Co-citation analysis and the search for invisible colleges: A methodological evaluation. Scientometrics, 57(1), 27–57.

    Article  Google Scholar 

  • Hamers, L., Hemeryck, Y., Herweyers, G., Janssen, M., Keters, H., Rousseau, R., et al. (1989). Similarity measures in scientometric research— The Jaccard index versus Salton cosine formula. Information Processing & Management, 25(3), 315–318.

    Article  Google Scholar 

  • Janssens, F., Quoc, V. T., Glänzel, W., & Moor, B. D. (2006). Integration of textual content and link information for accurate clustering of science fields. In InSCit2006, Current Research in Information Sciences and Technologies: Multidisciplinary Approaches to Global Information Systems (Vol. I, pp. 615–619), Merida, Spain.

  • Klavans, R., & Boyack, K. W. (2006). Identifying a better measure of relatedness for mapping science. Journal of the American Society for Information Science and Technology, 57(2), 251–263.

    Article  Google Scholar 

  • Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 845–848.

    MathSciNet  Google Scholar 

  • Leydesdorff, L. (2008). On the normalization and visualization of author co-citation data: Salton’s cosine versus the Jaccard index. Journal of the American Society for Information Science and Technology, 59(1), 77–85.

    Article  Google Scholar 

  • Lin, J. H. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151.

    Article  MATH  Google Scholar 

  • Luukkonen, T., Tijssen, R. J. W., Persson, O., & Sivertsen, G. (1993). The measurement of international scientific collaboration. Scientometrics, 28(1), 15–36.

    Article  Google Scholar 

  • Newman, M. E. J. (2004). Analysis of weighted networks. Physical Review E, 70(5), Article Number: 056131.

    Google Scholar 

  • Peters, H. P. F., & Van Raan, A. F. J. (1993). Co-word-based science maps of chemical-engineering. Part 1: Representations by direct multidimensional-scaling. Research Policy, 22(1), 23–45.

    Article  Google Scholar 

  • Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523.

    Article  Google Scholar 

  • Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.

    MATH  Google Scholar 

  • Schneider, J. W., & Borlund, P. (2007a). Matrix comparison, part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results. Journal of the American Society for Information Science and Technology, 58(11), 1586–1595.

    Article  Google Scholar 

  • Schneider, J. W., & Borlund, P. (2007b). Matrix comparison, part 2: Measuring the resemblance between proximity measures or ordination results by use of the mantel and procrustes statistics. Journal of the American Society for Information Science and Technology, 58(11), 1596–1609.

    Article  Google Scholar 

  • Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston: Pearson Addison Wesley.

    Google Scholar 

  • van Eck, N. J., & Waltman, L. (2009). How to normalize cooccurrence data? An analysis of some well-known similarity measures. Journal of the American Society for Information Science and Technology, 60(8), 1635–1651.

    Article  Google Scholar 

  • Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and applications. Cambridge: Cambridge University Press.

    Google Scholar 

  • Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill-Manning, C. (1999). KEA: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, Berkeley, CA.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cristian Colliander.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Colliander, C., Ahlgren, P. Experimental comparison of first and second-order similarities in a scientometric context. Scientometrics 90, 675–685 (2012). https://doi.org/10.1007/s11192-011-0491-x

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-011-0491-x

Keywords

Navigation