Article

Robust methodologies for modeling web click distributions

Authors:
Kamal Ali

Yahoo!

Yahoo!
View Profile

,
Mark Scarr

Yahoo!

Yahoo!
View Profile

WWW '07: Proceedings of the 16th international conference on World Wide WebMay 2007Pages 511–520https://doi.org/10.1145/1242572.1242642

Published:08 May 2007Publication History

WWW '07: Proceedings of the 16th international conference on World Wide Web

Pages 511–520

ABSTRACT

Metrics such as click counts are vital to online businesses but their measurement has been problematic due to inclusion of high variance robot traffic. We posit that by applying statistical methods more rigorous than have been employed to date that we can build a robust model of thedistribution of clicks following which we can set probabilistically sound thresholds to address outliers and robots. Prior research in this domain has used inappropriate statistical methodology to model distributions and current industrial practice eschews this research for conservative ad-hoc click-level thresholds. Prevailing belief is that such distributions are scale-free power law distributions but using more rigorous statistical methods we find the best description of the data is instead provided by a scale-sensitive Zipf-Mandelbrot mixture distribution. Our results are based on ten data sets from various verticals in the Yahoo domain. Since mixture models can overfit the data we take care to use the BIC log-likelihood method which penalizes overly complex models. Using a mixture model in the web activity domain makes sense because there are likely multiple classes of users. In particular, we have noticed that there is a significantly large set of "users" that visit the Yahoo portal exactly once a day. We surmise these may be robots testing internet connectivity by pinging the Yahoo main website.

Backing up our quantitative analysis is graphical analysis in which empirical distributions are plotted against heoretical distributions in log-log space using robust cumulative distribution plots. This methodology has two advantages: plotting in log-log space allows one to visually differentiate the various exponential distributions and secondly, cumulative plots are much more robust to outliers. We plan to use the results of this work for applications for robot removal from web metrics business intelligence systems.

References

G. Abdulla. Analysis and Modeling of World Wide Web Traffic. PhD thesis, Virginia Tech, 1998. Google ScholarDigital Library
A. Agresti. Categorical Data Analysis. Wiley Series in Probability and Statistics, 2002.Google Scholar
L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: Evidence and implications. In INFOCOM (1), pages 126--134, 1999.Google Scholar
R. H. Byrd, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. Journal of Scientific Computing (SIAM), 16:1190--1208, 1995. Google ScholarDigital Library
G. Casella and R. L. Berger. Statistical Inference. Duxbury Press, 1990.Google Scholar
U. Frisch and D. Sornette. Extreme deviation and applications. J. Phys. I France 7, 7:1155--1171, 1997.Google ScholarCross Ref
S. Glassman. A caching relay for the World Wide Web. Computer Networks and ISDN Systems, 27(2):165--173, 1994. Google ScholarDigital Library
D. C. Heilbron. Zero-altered and other regression models for count data with added zeroes. Biometrics, 36:531--547, 1994.Google ScholarCross Ref
B. A. Huberman, P. L. T. Pirolli, J. E. Pitkow, and R. M. Lukose. Strong regularities in world wide web surfing. Science, 280:95--97, 1998.Google ScholarCross Ref
J. Laherrere and D. Sornette. Stretched exponential distributions in nature and economy: "fat tails" with characteristic scales. The European Physical Journal B, 2:525, 1998.Google ScholarCross Ref
D. Lambert. Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics, 34:1--14, 1992. Google ScholarDigital Library
D. Lord, S. P. Washington, and J. N. Ivan. Poisson, poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory. Accident Analysis and Prevention, 37:35--46, 2005.Google ScholarCross Ref
B. Mandelbrot. An informational theory of the statistical structure of language. In W. Jackson, editor, Communication Theory. Betterworths, 1953.Google Scholar
S. M. Mwalili, E. Lesaffre, and D. Declerck. The zero-inflated negative binomial regression model with correction for misclassification: An example in caries research. Technical Report TR0462, IAP Statistics Network, 2005.Google Scholar
R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2004. ISBN 3-900051-00-3.Google Scholar
J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth & Brooks/Cole, 1988.Google Scholar
G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461--464, 1978.Google ScholarCross Ref
H. A. Simon. On a class of skew distribution functions. Biometrika, 42:425--440, 1955.Google ScholarCross Ref
E. C. Titchmarsh. The Theory of the Riemann Zeta Function, 2nd ed. Oxford Science Publications, Clarendon Press, Oxford, 1986.Google Scholar
D. G. Uitenbroek. SISA Pairwise tests. http://home.clara.net/sisa/pairwhlp.htm, 1997.Google Scholar
D. von Seggern. CRC Standard Curves and Surfaces. CRC Press, 1993.Google Scholar
J. R. Wilson. Logarithmic series distribution and its use in analyzing discrete data. In Proceedings of the Survey Research Methods Section, American Statistical Association, pages 275--280, 1988.Google Scholar
G. K. Zipf. Human Behaviour and the Principle of Least-Effort. Addison-Wesley, Cambridge, MA, 1949.Google Scholar

Index Terms

Robust methodologies for modeling web click distributions
1. Mathematics of computing
  1. Probability and statistics
    1. Statistical paradigms
      1. Regression analysis
        Robust regression

Recommendations

Finding a simple probability distribution for human mobile speed

The speed of human mobility in a day is commonly known to have a representative probability distribution and previous studies have identified several useful distributions to represent it. In this study, we deduce conditions for a simple probability ...
Read More
Tail densities of skew-elliptical distributions
Abstract
Skew-elliptical distributions constitute a large class of multivariate distributions that account for both skewness and a variety of tail properties. This class has simpler representations in terms of densities rather than cumulative ...
Read More
Fitting activity distributions using human partitioning and statistical calibration
Highlights
- An extension to an existing calibration procedure for analysing empirical project data is presented.
Abstract
Many project management and scheduling studies have modelled activity durations as a range of values to express the stochastic nature of projects in progress. A wide variety of simulation models have been proposed that all rely on pre-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '07: Proceedings of the 16th international conference on World Wide Web
May 2007
1382 pages
ISBN:9781595936547
DOI:10.1145/1242572
General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 May 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distribution fitting
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 547
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Robust methodologies for modeling web click distributions

WWW '07: Proceedings of the 16th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Finding a simple probability distribution for human mobile speed

Tail densities of skew-elliptical distributions

Fitting activity distributions using human partitioning and statistical calibration