research-article

On power law distributions in large-scale taxonomies

Authors:
Rohit Babbar

Université Grenoble Alpes, CNRS, Grenoble, France

Université Grenoble Alpes, CNRS, Grenoble, France
View Profile

,
Cornelia Metzig

Université Grenoble Alpes, CNRS, Grenoble, France

Université Grenoble Alpes, CNRS, Grenoble, France
View Profile

,
Ioannis Partalas

Université Grenoble Alpes, CNRS, Grenoble, France

Université Grenoble Alpes, CNRS, Grenoble, France
View Profile

,
Eric Gaussier

Université Grenoble Alpes, CNRS, Grenoble, France

Université Grenoble Alpes, CNRS, Grenoble, France
View Profile

,
Massih-Reza Amini

Université Grenoble Alpes, CNRS, Grenoble, France

Université Grenoble Alpes, CNRS, Grenoble, France
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 16 Issue 1June 2014pp 47–56https://doi.org/10.1145/2674026.2674033

Published:25 September 2014Publication History

ACM SIGKDD Explorations Newsletter

Abstract

In many of the large-scale physical and social complex systems phenomena fat-tailed distributions occur, for which different generating mechanisms have been proposed. In this paper, we study models of generating power law distributions in the evolution of large-scale taxonomies such as Open Directory Project, which consist of websites assigned to one of tens of thousands of categories. The categories in such taxonomies are arranged in tree or DAG structured configurations having parent-child relations among them. We first quantitatively analyse the formation process of such taxonomies, which leads to power law distribution as the stationary distributions. In the context of designing classifiers for large-scale taxonomies, which automatically assign unseen documents to leaf-level categories, we highlight how the fat-tailed nature of these distributions can be leveraged to analytically study the space complexity of such classifiers. Empirical evaluation of the space complexity on publicly available datasets demonstrates the applicability of our approach.

References

R. Babbar, I. Partalas, C. Metzig, E. Gaussier, and M.-R. Amini. Comparative classifier evaluation for webscale taxonomies using power law. In European Semantic Web Conference, 2013.Google Scholar
A.-L. Barabási and R. Albert. Emergence of scaling in random networks. science, 286(5439):509--512, 1999.Google Scholar
S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In Neural Information Processing Systems, pages 163--171, 2010.Google Scholar
P. N. Bennett and N. Nguyen. Refined experts: improving classification in large taxonomies. In Proceedings of the 32nd international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 11--18, 2009. Google ScholarDigital Library
L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances In Neural Information Processing Systems, pages 161--168, 2008.Google Scholar
L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 78--87, 2004. Google ScholarDigital Library
A. Capocci, V. D. Servedio, F. Colaiori, L. S. Buriol, D. Donato, S. Leonardi, and G. Caldarelli. Preferential attachment in the growth of social networks: The internet encyclopedia wikipedia. Physical Review E, 74(3):036116, 2006.Google ScholarCross Ref
O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In Proceedings of the twenty-first international conference on Machine learning, ICML '04, pages 27--34, 2004. Google ScholarDigital Library
S. N. Dorogovtsev and J. F. F. Mendes. Evolution of networks with aging of sites. Physical Review E, 62(2):1842, 2000.Google ScholarCross Ref
L. Egghe. Untangling herdan's law and heaps' law: Mathematical and informetric arguments. Journal of the American Society for Information Science and Technology, 58(5):702--709, 2007. Google ScholarDigital Library
M. Faloutsos, P. Faloutsos, and C. Faloutsos. On powerlaw relationships of the internet topology. SIGCOMM. Google ScholarDigital Library
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871--1874, 2008. Google ScholarDigital Library
T. Gao and D. Koller. Discriminative learning of relaxed hierarchy for large-scale visual recognition. In IEEE International Conference on Computer Vision (ICCV), pages 2072--2079, 2011. Google ScholarDigital Library
M. M. Geipel, C. J. Tessone, and F. Schweitzer. A complementary view on the growth of directory trees. The European Physical Journal B, 71(4):641--648, 2009.Google ScholarCross Ref
S. Gopal, Y. Yang, B. Bai, and A. Niculescu-Mizil. Bayesian models for large-scale hierarchical classification. In Neural Information Processing Systems, 2012.Google Scholar
G. Jona-Lasinio. Renormalization group and probability theory. Physics Reports, 352(4):439--458, 2001.Google ScholarCross Ref
K. Klemm, V. M. Eguíluz, and M. San Miguel. Scaling in the structure of directory trees in a computer cluster. Physical review letters, 95(12):128701, 2005.Google Scholar
D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML '97, 1997. Google ScholarDigital Library
T.-Y. Liu, Y. Yang, H. Wan, H.-J. Zeng, Z. Chen, and W.-Y. Ma. Support vector machines classification with a very large-scale taxonomy. SIGKDD, 2005. Google ScholarDigital Library
B. Mandelbrot. A note on a class of skew distribution functions: Analysis and critique of a paper by ha simon. Information and Control, 2(1):90--99, 1959.Google ScholarCross Ref
C. Metzig and M. B. Gordon. A model for scaling in firms' size and growth rate distribution. Physica A, 2014.Google ScholarCross Ref
M. Newman. Power laws, pareto distributions and zipf's law. Contemporary Physics, 46(5):323--351, 2005.Google ScholarCross Ref
M. E. J. Newman. Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 2005.Google Scholar
I. Partalas, R. Babbar, Á E. Gaussier, and C. Amblard. Adaptive classifier selection in large-scale hierarchical classification. In ICONIP, pages 612--619, 2012. Google ScholarDigital Library
P. Richmond and S. Solomon. Power laws are disguised boltzmann laws. International Journal of Modern Physics C, 12(03):333--343, 2001.Google ScholarCross Ref
A. Simon. On a class of skew distribution functions. Biometrika, 42(3/4):425--440, 1955.Google ScholarCross Ref
C. Song, S. Havlin, and H. A. Makse. Self-similarity of complex networks. Nature, 433(7024):392--395, 2005.Google ScholarCross Ref
H. Takayasu, A.-H. Sato, and M. Takayasu. Stable infinite variance fluctuations in randomly amplified langevin systems. Physical Review Letters, 79(6):966--969, 1997.Google ScholarCross Ref
C. J. Tessone, M. M. Geipel, and F. Schweitzer. Sustainable growth in complex networks. EPL (Europhysics Letters), 96(5):58005, 2011.Google Scholar
K. G. Wilson and J. Kogut. The renormalization group and the expansion. Physics Reports, 12(2):75--199, 1974.Google ScholarCross Ref
G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classification in large-scale text hierarchies. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, pages 619--626, 2008. Google ScholarDigital Library
Y. Yang, J. Zhang, and B. Kisiel. A scalability analysis of classifiers in text categorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR '03, pages 96--103, 2003. Google ScholarDigital Library
G. U. Yule. A mathematical theory of evolution, based on the conclusions of dr. jc willis, frs. Philosophical Transactions of the Royal Society of London. Series B, Containing Papers of a Biological Character, 213:21--87, 1925.Google Scholar

Index Terms

On power law distributions in large-scale taxonomies
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Re-ranking approach to classification in large-scale power-law distributed category systems
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

For large-scale category systems, such as Directory Mozilla, which consist of tens of thousand categories, it has been empirically verified in earlier studies that the distribution of documents among categories can be modeled as a power-law ...
Read More
Scale mixtures of Kotz-Dirichlet distributions

In this paper, we first show that a k-dimensional Dirichlet random vector has independent components if and only if it is a Kotz Type I Dirichlet random vector. We then consider in detail the class of k-dimensional scale mixtures of Kotz-Dirichlet ...
Read More
Power Law Distributions in Information Retrieval

Several properties of information retrieval (IR) data, such as query frequency or document length, are widely considered to be approximately distributed as a power law. This common assumption aims to focus on specific characteristics of the empirical ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGKDD Explorations Newsletter Volume 16, Issue 1
Special issue on big data
June 2014
63 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/2674026
Editors:
Charu C. Aggarwal,
Haixun Wang,
Hanghang Tong,
Ankur Teredesai
University of Washington, Seattle, Washington
Issue’s Table of Contents
Copyright © 2014 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 September 2014
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 111
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On power law distributions in large-scale taxonomies

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Re-ranking approach to classification in large-scale power-law distributed category systems

Scale mixtures of Kotz-Dirichlet distributions

Power Law Distributions in Information Retrieval