Fast algorithms for the calculation of Kendall’s τ

Christensen, David

doi:10.1007/BF02736122

Fast algorithms for the calculation of Kendall’s τ

Published: 01 March 2005

Volume 20, pages 51–62, (2005)
Cite this article

Computational Statistics Aims and scope Submit manuscript

David Christensen¹

786 Accesses
57 Citations
1 Altmetric
Explore all metrics

Summary

Traditional algorithms for the calculation of Kendall’s τ between two datasets of n samples have a calculation time of O(n²). This paper presents a suite of algorithms with expected calculation time of O(n log n) or better using a combination of sorting and balanced tree data structures. The literature, e.g. Dwork et al. (2001), has alluded to the existence of O(n log n) algorithms without any analysis: this paper gives an explicit descriptions of such algorithms for general use both for the case with and without duplicate values in the data. Execution times for sample data are reduced from 3.8 hours to around 1–2 seconds for one million data pairs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cluster Matching Distance for Rooted Phylogenetic Trees

The Cluster Affinity Distance for Phylogenies

A More Practical Algorithm for the Rooted Triplet Distance

Notes

¹Whilst samples from continuous distributions may contain duplicates to the accuracy with which they are represented numerically, the algorithm presented will treat them as logically unequal with (effectively) a random ordering being selected for them. The chances of this occurring in practice on any sensible accuracy of machine representation are sufficiently slight that further analysis of this case has not been performed.
²The descriptions throughout this document assume ascending sorts are used. Obviously if descending-sorted data are available, the algorithms can be modified accordingly rather than resorting

References

Adel’son-Vel’skii, G. M. and Landis, E. M. (1962), An Algorithm for the Organization of Information.Soviet Mathematics Doklady, 3, 1259–1262.
Google Scholar
Dwork, C, Kumar, R., Naor, M. and Sivakumar, D. (2001), Rank Aggregation Revisited.Proc. 10th International World Wide Web Conference, 613–622.
Knuth, D.E. (1998),The Art of Computer Programming, Volume 3: Sorting and Searching, Addison-Wesley, 2nd edition.
Lindskog, F., McNeil, A. and Schmock, U. (2001), Kendall’s τ for Elliptical Distributions.Working paper from http://www, math, ethz. ch/~mcneil/pub_list, html
Press, W.H., Flannery, B.P., Teukolsky, S.A. and Vetterling, W.T. (1993),Numerical Recipes, Cambridge University Press.

Download references

Author information

Authors and Affiliations

EMB, Link House, Great Shelford, CB2 5LT, Cambridge, England
David Christensen

Authors

David Christensen
View author publications
You can also search for this author in PubMed Google Scholar

Appendices

Appendix 1: Use of binary trees

This appendix & provided for those unfamiliar with the use of binary trees to provide O(log n) lookup of data. It presents no new results and can be safely skipped by those familiar with the use of such data structures.

At several locations within the algorithms in the main paper, we wish to find a value of Y_i in a list of those that have already been encountered, and if it is not there, add it in. Furthermore we also wish to maintain a count of the number of values which occur in the sequence (so far) which have Y < Y_i.

A simple approach to this would maintain a list or array of values found so far. In the list case the search is O(n) and the insertion is in constant time. In the array case the search can be O(log n) but the insertion is O(n). Either way, the overall search and insert process is O(n). Given that we are performing this n times in the original algorithm, this produces an overall execution time of O(n²) — precisely what we are trying to avoid.

The standard computer science solution to this problem is to use a binary tree. In this data structure, each item is stored in a “node” which as well as storing the item’s value also refers to (up to) two other items, a “left node” and a “right node”. These nodes themselves refer to other nodes, so that each node has two “sub trees” under it. The rule for organising such a tree is that the value at a node is greater than the values at all nodes in its left sub tree, and less than the values in all nodes in its right sub tree. For example:

The efficiency of the data structure comes from the fact that the depth of the tree is at most 1+log₂(n) provided that the tree is balanced, i.e. that the routes from root to leaf nodes are all approximately the same length. Ensuring that the tree remains perfectly balanced is not a trivial process, and the normal solution is to use the AVL algorithm described in Adel’son-Vel’skii and Landis (1962) which keeps the tree “near enough balanced” and maintains O(log n) performance for both searching and insertion.

Appendix 2: Code listings

Since SDTau seems to be the best all-round performer, only that algorithm is included here. It also assumes the availability of Quicksort and AVLTree implementations. However, all of the algorithms described and an implementation of AVLTree are available on request from the author at d.christensen@emb.co.uk.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Christensen, D. Fast algorithms for the calculation of Kendall’s τ. Computational Statistics 20, 51–62 (2005). https://doi.org/10.1007/BF02736122

Download citation

Published: 01 March 2005
Issue Date: March 2005
DOI: https://doi.org/10.1007/BF02736122

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast algorithms for the calculation of Kendall’s τ

Summary

Access this article

Similar content being viewed by others

Cluster Matching Distance for Rooted Phylogenetic Trees

The Cluster Affinity Distance for Phylogenies

A More Practical Algorithm for the Rooted Triplet Distance

Notes

References