Abstract
In many of the large-scale physical and social complex systems phenomena fat-tailed distributions occur, for which different generating mechanisms have been proposed. In this paper, we study models of generating power law distributions in the evolution of large-scale taxonomies such as Open Directory Project, which consist of websites assigned to one of tens of thousands of categories. The categories in such taxonomies are arranged in tree or DAG structured configurations having parent-child relations among them. We first quantitatively analyse the formation process of such taxonomies, which leads to power law distribution as the stationary distributions. In the context of designing classifiers for large-scale taxonomies, which automatically assign unseen documents to leaf-level categories, we highlight how the fat-tailed nature of these distributions can be leveraged to analytically study the space complexity of such classifiers. Empirical evaluation of the space complexity on publicly available datasets demonstrates the applicability of our approach.
- R. Babbar, I. Partalas, C. Metzig, E. Gaussier, and M.-R. Amini. Comparative classifier evaluation for webscale taxonomies using power law. In European Semantic Web Conference, 2013.Google Scholar
- A.-L. Barabási and R. Albert. Emergence of scaling in random networks. science, 286(5439):509--512, 1999.Google Scholar
- S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In Neural Information Processing Systems, pages 163--171, 2010.Google Scholar
- P. N. Bennett and N. Nguyen. Refined experts: improving classification in large taxonomies. In Proceedings of the 32nd international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 11--18, 2009. Google ScholarDigital Library
- L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances In Neural Information Processing Systems, pages 161--168, 2008.Google Scholar
- L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 78--87, 2004. Google ScholarDigital Library
- A. Capocci, V. D. Servedio, F. Colaiori, L. S. Buriol, D. Donato, S. Leonardi, and G. Caldarelli. Preferential attachment in the growth of social networks: The internet encyclopedia wikipedia. Physical Review E, 74(3):036116, 2006.Google ScholarCross Ref
- O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In Proceedings of the twenty-first international conference on Machine learning, ICML '04, pages 27--34, 2004. Google ScholarDigital Library
- S. N. Dorogovtsev and J. F. F. Mendes. Evolution of networks with aging of sites. Physical Review E, 62(2):1842, 2000.Google ScholarCross Ref
- L. Egghe. Untangling herdan's law and heaps' law: Mathematical and informetric arguments. Journal of the American Society for Information Science and Technology, 58(5):702--709, 2007. Google ScholarDigital Library
- M. Faloutsos, P. Faloutsos, and C. Faloutsos. On powerlaw relationships of the internet topology. SIGCOMM. Google ScholarDigital Library
- R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871--1874, 2008. Google ScholarDigital Library
- T. Gao and D. Koller. Discriminative learning of relaxed hierarchy for large-scale visual recognition. In IEEE International Conference on Computer Vision (ICCV), pages 2072--2079, 2011. Google ScholarDigital Library
- M. M. Geipel, C. J. Tessone, and F. Schweitzer. A complementary view on the growth of directory trees. The European Physical Journal B, 71(4):641--648, 2009.Google ScholarCross Ref
- S. Gopal, Y. Yang, B. Bai, and A. Niculescu-Mizil. Bayesian models for large-scale hierarchical classification. In Neural Information Processing Systems, 2012.Google Scholar
- G. Jona-Lasinio. Renormalization group and probability theory. Physics Reports, 352(4):439--458, 2001.Google ScholarCross Ref
- K. Klemm, V. M. Eguíluz, and M. San Miguel. Scaling in the structure of directory trees in a computer cluster. Physical review letters, 95(12):128701, 2005.Google Scholar
- D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML '97, 1997. Google ScholarDigital Library
- T.-Y. Liu, Y. Yang, H. Wan, H.-J. Zeng, Z. Chen, and W.-Y. Ma. Support vector machines classification with a very large-scale taxonomy. SIGKDD, 2005. Google ScholarDigital Library
- B. Mandelbrot. A note on a class of skew distribution functions: Analysis and critique of a paper by ha simon. Information and Control, 2(1):90--99, 1959.Google ScholarCross Ref
- C. Metzig and M. B. Gordon. A model for scaling in firms' size and growth rate distribution. Physica A, 2014.Google ScholarCross Ref
- M. Newman. Power laws, pareto distributions and zipf's law. Contemporary Physics, 46(5):323--351, 2005.Google ScholarCross Ref
- M. E. J. Newman. Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 2005.Google Scholar
- I. Partalas, R. Babbar, Á E. Gaussier, and C. Amblard. Adaptive classifier selection in large-scale hierarchical classification. In ICONIP, pages 612--619, 2012. Google ScholarDigital Library
- P. Richmond and S. Solomon. Power laws are disguised boltzmann laws. International Journal of Modern Physics C, 12(03):333--343, 2001.Google ScholarCross Ref
- A. Simon. On a class of skew distribution functions. Biometrika, 42(3/4):425--440, 1955.Google ScholarCross Ref
- C. Song, S. Havlin, and H. A. Makse. Self-similarity of complex networks. Nature, 433(7024):392--395, 2005.Google ScholarCross Ref
- H. Takayasu, A.-H. Sato, and M. Takayasu. Stable infinite variance fluctuations in randomly amplified langevin systems. Physical Review Letters, 79(6):966--969, 1997.Google ScholarCross Ref
- C. J. Tessone, M. M. Geipel, and F. Schweitzer. Sustainable growth in complex networks. EPL (Europhysics Letters), 96(5):58005, 2011.Google Scholar
- K. G. Wilson and J. Kogut. The renormalization group and the expansion. Physics Reports, 12(2):75--199, 1974.Google ScholarCross Ref
- G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classification in large-scale text hierarchies. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, pages 619--626, 2008. Google ScholarDigital Library
- Y. Yang, J. Zhang, and B. Kisiel. A scalability analysis of classifiers in text categorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR '03, pages 96--103, 2003. Google ScholarDigital Library
- G. U. Yule. A mathematical theory of evolution, based on the conclusions of dr. jc willis, frs. Philosophical Transactions of the Royal Society of London. Series B, Containing Papers of a Biological Character, 213:21--87, 1925.Google Scholar
Index Terms
- On power law distributions in large-scale taxonomies
Recommendations
Re-ranking approach to classification in large-scale power-law distributed category systems
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrievalFor large-scale category systems, such as Directory Mozilla, which consist of tens of thousand categories, it has been empirically verified in earlier studies that the distribution of documents among categories can be modeled as a power-law ...
Scale mixtures of Kotz-Dirichlet distributions
In this paper, we first show that a k-dimensional Dirichlet random vector has independent components if and only if it is a Kotz Type I Dirichlet random vector. We then consider in detail the class of k-dimensional scale mixtures of Kotz-Dirichlet ...
Power Law Distributions in Information Retrieval
Several properties of information retrieval (IR) data, such as query frequency or document length, are widely considered to be approximately distributed as a power law. This common assumption aims to focus on specific characteristics of the empirical ...
Comments