skip to main content
10.1145/1242572.1242642acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Robust methodologies for modeling web click distributions

Published:08 May 2007Publication History

ABSTRACT

Metrics such as click counts are vital to online businesses but their measurement has been problematic due to inclusion of high variance robot traffic. We posit that by applying statistical methods more rigorous than have been employed to date that we can build a robust model of thedistribution of clicks following which we can set probabilistically sound thresholds to address outliers and robots. Prior research in this domain has used inappropriate statistical methodology to model distributions and current industrial practice eschews this research for conservative ad-hoc click-level thresholds. Prevailing belief is that such distributions are scale-free power law distributions but using more rigorous statistical methods we find the best description of the data is instead provided by a scale-sensitive Zipf-Mandelbrot mixture distribution. Our results are based on ten data sets from various verticals in the Yahoo domain. Since mixture models can overfit the data we take care to use the BIC log-likelihood method which penalizes overly complex models. Using a mixture model in the web activity domain makes sense because there are likely multiple classes of users. In particular, we have noticed that there is a significantly large set of "users" that visit the Yahoo portal exactly once a day. We surmise these may be robots testing internet connectivity by pinging the Yahoo main website.

Backing up our quantitative analysis is graphical analysis in which empirical distributions are plotted against heoretical distributions in log-log space using robust cumulative distribution plots. This methodology has two advantages: plotting in log-log space allows one to visually differentiate the various exponential distributions and secondly, cumulative plots are much more robust to outliers. We plan to use the results of this work for applications for robot removal from web metrics business intelligence systems.

References

  1. G. Abdulla. Analysis and Modeling of World Wide Web Traffic. PhD thesis, Virginia Tech, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Agresti. Categorical Data Analysis. Wiley Series in Probability and Statistics, 2002.Google ScholarGoogle Scholar
  3. L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: Evidence and implications. In INFOCOM (1), pages 126--134, 1999.Google ScholarGoogle Scholar
  4. R. H. Byrd, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. Journal of Scientific Computing (SIAM), 16:1190--1208, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Casella and R. L. Berger. Statistical Inference. Duxbury Press, 1990.Google ScholarGoogle Scholar
  6. U. Frisch and D. Sornette. Extreme deviation and applications. J. Phys. I France 7, 7:1155--1171, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  7. S. Glassman. A caching relay for the World Wide Web. Computer Networks and ISDN Systems, 27(2):165--173, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. C. Heilbron. Zero-altered and other regression models for count data with added zeroes. Biometrics, 36:531--547, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  9. B. A. Huberman, P. L. T. Pirolli, J. E. Pitkow, and R. M. Lukose. Strong regularities in world wide web surfing. Science, 280:95--97, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  10. J. Laherrere and D. Sornette. Stretched exponential distributions in nature and economy: "fat tails" with characteristic scales. The European Physical Journal B, 2:525, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  11. D. Lambert. Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics, 34:1--14, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Lord, S. P. Washington, and J. N. Ivan. Poisson, poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory. Accident Analysis and Prevention, 37:35--46, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  13. B. Mandelbrot. An informational theory of the statistical structure of language. In W. Jackson, editor, Communication Theory. Betterworths, 1953.Google ScholarGoogle Scholar
  14. S. M. Mwalili, E. Lesaffre, and D. Declerck. The zero-inflated negative binomial regression model with correction for misclassification: An example in caries research. Technical Report TR0462, IAP Statistics Network, 2005.Google ScholarGoogle Scholar
  15. R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2004. ISBN 3-900051-00-3.Google ScholarGoogle Scholar
  16. J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth & Brooks/Cole, 1988.Google ScholarGoogle Scholar
  17. G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461--464, 1978.Google ScholarGoogle ScholarCross RefCross Ref
  18. H. A. Simon. On a class of skew distribution functions. Biometrika, 42:425--440, 1955.Google ScholarGoogle ScholarCross RefCross Ref
  19. E. C. Titchmarsh. The Theory of the Riemann Zeta Function, 2nd ed. Oxford Science Publications, Clarendon Press, Oxford, 1986.Google ScholarGoogle Scholar
  20. D. G. Uitenbroek. SISA Pairwise tests. http://home.clara.net/sisa/pairwhlp.htm, 1997.Google ScholarGoogle Scholar
  21. D. von Seggern. CRC Standard Curves and Surfaces. CRC Press, 1993.Google ScholarGoogle Scholar
  22. J. R. Wilson. Logarithmic series distribution and its use in analyzing discrete data. In Proceedings of the Survey Research Methods Section, American Statistical Association, pages 275--280, 1988.Google ScholarGoogle Scholar
  23. G. K. Zipf. Human Behaviour and the Principle of Least-Effort. Addison-Wesley, Cambridge, MA, 1949.Google ScholarGoogle Scholar

Index Terms

  1. Robust methodologies for modeling web click distributions

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '07: Proceedings of the 16th international conference on World Wide Web
      May 2007
      1382 pages
      ISBN:9781595936547
      DOI:10.1145/1242572

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 May 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

      Upcoming Conference

      WWW '24
      The ACM Web Conference 2024
      May 13 - 17, 2024
      Singapore , Singapore

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader