skip to main content
research-article

On power law distributions in large-scale taxonomies

Published:25 September 2014Publication History
Skip Abstract Section

Abstract

In many of the large-scale physical and social complex systems phenomena fat-tailed distributions occur, for which different generating mechanisms have been proposed. In this paper, we study models of generating power law distributions in the evolution of large-scale taxonomies such as Open Directory Project, which consist of websites assigned to one of tens of thousands of categories. The categories in such taxonomies are arranged in tree or DAG structured configurations having parent-child relations among them. We first quantitatively analyse the formation process of such taxonomies, which leads to power law distribution as the stationary distributions. In the context of designing classifiers for large-scale taxonomies, which automatically assign unseen documents to leaf-level categories, we highlight how the fat-tailed nature of these distributions can be leveraged to analytically study the space complexity of such classifiers. Empirical evaluation of the space complexity on publicly available datasets demonstrates the applicability of our approach.

References

  1. R. Babbar, I. Partalas, C. Metzig, E. Gaussier, and M.-R. Amini. Comparative classifier evaluation for webscale taxonomies using power law. In European Semantic Web Conference, 2013.Google ScholarGoogle Scholar
  2. A.-L. Barabási and R. Albert. Emergence of scaling in random networks. science, 286(5439):509--512, 1999.Google ScholarGoogle Scholar
  3. S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In Neural Information Processing Systems, pages 163--171, 2010.Google ScholarGoogle Scholar
  4. P. N. Bennett and N. Nguyen. Refined experts: improving classification in large taxonomies. In Proceedings of the 32nd international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 11--18, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances In Neural Information Processing Systems, pages 161--168, 2008.Google ScholarGoogle Scholar
  6. L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 78--87, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Capocci, V. D. Servedio, F. Colaiori, L. S. Buriol, D. Donato, S. Leonardi, and G. Caldarelli. Preferential attachment in the growth of social networks: The internet encyclopedia wikipedia. Physical Review E, 74(3):036116, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  8. O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In Proceedings of the twenty-first international conference on Machine learning, ICML '04, pages 27--34, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. N. Dorogovtsev and J. F. F. Mendes. Evolution of networks with aging of sites. Physical Review E, 62(2):1842, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  10. L. Egghe. Untangling herdan's law and heaps' law: Mathematical and informetric arguments. Journal of the American Society for Information Science and Technology, 58(5):702--709, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Faloutsos, P. Faloutsos, and C. Faloutsos. On powerlaw relationships of the internet topology. SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871--1874, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Gao and D. Koller. Discriminative learning of relaxed hierarchy for large-scale visual recognition. In IEEE International Conference on Computer Vision (ICCV), pages 2072--2079, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. M. Geipel, C. J. Tessone, and F. Schweitzer. A complementary view on the growth of directory trees. The European Physical Journal B, 71(4):641--648, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  15. S. Gopal, Y. Yang, B. Bai, and A. Niculescu-Mizil. Bayesian models for large-scale hierarchical classification. In Neural Information Processing Systems, 2012.Google ScholarGoogle Scholar
  16. G. Jona-Lasinio. Renormalization group and probability theory. Physics Reports, 352(4):439--458, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  17. K. Klemm, V. M. Eguíluz, and M. San Miguel. Scaling in the structure of directory trees in a computer cluster. Physical review letters, 95(12):128701, 2005.Google ScholarGoogle Scholar
  18. D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML '97, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T.-Y. Liu, Y. Yang, H. Wan, H.-J. Zeng, Z. Chen, and W.-Y. Ma. Support vector machines classification with a very large-scale taxonomy. SIGKDD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Mandelbrot. A note on a class of skew distribution functions: Analysis and critique of a paper by ha simon. Information and Control, 2(1):90--99, 1959.Google ScholarGoogle ScholarCross RefCross Ref
  21. C. Metzig and M. B. Gordon. A model for scaling in firms' size and growth rate distribution. Physica A, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  22. M. Newman. Power laws, pareto distributions and zipf's law. Contemporary Physics, 46(5):323--351, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  23. M. E. J. Newman. Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 2005.Google ScholarGoogle Scholar
  24. I. Partalas, R. Babbar, Á E. Gaussier, and C. Amblard. Adaptive classifier selection in large-scale hierarchical classification. In ICONIP, pages 612--619, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P. Richmond and S. Solomon. Power laws are disguised boltzmann laws. International Journal of Modern Physics C, 12(03):333--343, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  26. A. Simon. On a class of skew distribution functions. Biometrika, 42(3/4):425--440, 1955.Google ScholarGoogle ScholarCross RefCross Ref
  27. C. Song, S. Havlin, and H. A. Makse. Self-similarity of complex networks. Nature, 433(7024):392--395, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  28. H. Takayasu, A.-H. Sato, and M. Takayasu. Stable infinite variance fluctuations in randomly amplified langevin systems. Physical Review Letters, 79(6):966--969, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  29. C. J. Tessone, M. M. Geipel, and F. Schweitzer. Sustainable growth in complex networks. EPL (Europhysics Letters), 96(5):58005, 2011.Google ScholarGoogle Scholar
  30. K. G. Wilson and J. Kogut. The renormalization group and the expansion. Physics Reports, 12(2):75--199, 1974.Google ScholarGoogle ScholarCross RefCross Ref
  31. G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classification in large-scale text hierarchies. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, pages 619--626, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Yang, J. Zhang, and B. Kisiel. A scalability analysis of classifiers in text categorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR '03, pages 96--103, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. G. U. Yule. A mathematical theory of evolution, based on the conclusions of dr. jc willis, frs. Philosophical Transactions of the Royal Society of London. Series B, Containing Papers of a Biological Character, 213:21--87, 1925.Google ScholarGoogle Scholar

Index Terms

  1. On power law distributions in large-scale taxonomies

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGKDD Explorations Newsletter
      ACM SIGKDD Explorations Newsletter  Volume 16, Issue 1
      Special issue on big data
      June 2014
      63 pages
      ISSN:1931-0145
      EISSN:1931-0153
      DOI:10.1145/2674026
      Issue’s Table of Contents

      Copyright © 2014 Authors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 September 2014

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader