Estimating player contribution in hockey with regularized logistic regression

Robert B. Gramacy; Shane T. Jensen; Matt Taddy

doi:10.1515/jqas-2012-0001

Published by De Gruyter March 30, 2013

Estimating player contribution in hockey with regularized logistic regression

Robert B. Gramacy , Shane T. Jensen and Matt Taddy

From the journal Journal of Quantitative Analysis in Sports

https://doi.org/10.1515/jqas-2012-0001

Showing a limited preview of this publication:

Abstract

We present a regularized logistic regression model for evaluating player contributions in hockey. The traditional metric for this purpose is the plus-minus statistic, which allocates a single unit of credit (for or against) to each player on the ice for a goal. However, plus-minus scores measure only the marginal effect of players, do not account for sample size, and provide a very noisy estimate of performance. We investigate a related regression problem: what does each player on the ice contribute, beyond aggregate team performance and other factors, to the odds that a given goal was scored by their team? Due to the large-p (number of players) and imbalanced design setting of hockey analysis, a major part of our contribution is a careful treatment of prior shrinkage in model estimation. We showcase two recently developed techniques – for posterior maximization or simulation – that make such analysis feasible. Each approach is accompanied with publicly available software and we include the simple commands used in our analysis. Our results show that most players do not stand out as measurably strong (positive or negative) contributors. This allows the stars to really shine, reveals diamonds in the rough overlooked by earlier analyses, and argues that some of the highest paid players in the league are not making contributions worth their expense.

Keywords: Bayesian shrinkage; lasso; logistic regression; regularization; sports analytics

Corresponding author: Robert B. Gramacy, Booth School of Business, The University of Chicago, 5807 S Woodlawn Ave, Chicago, IL 60637, USA, Tel.: +773-702-0739

Appendix

A Estimation details and software

Although L1 penalty methods and their Bayesian interpretation have been known for some time, it is only recently that joint penalty–coefficient posterior computation and simulation methods have become available for datasets of the size encountered here.

Probably the most well-known publicly available library for L1 penalty inference in logistic regression is the glmnet package (Friedman, Hastie, and Tibshirani, 2010) for R. Conditional on a single shared value of λ, this implementation estimates a sparse set of coefficients. A convenient wrapper routine called cv.glmnet allows one to chose λ by cross-validation (CV). Unfortunately, CV works poorly in our setting of large, sparse, and imbalanced X_P, where each model fit is relatively expensive and there is often little overlap between nonzero covariates in the training and validation sets. Moreover, when maximizing (rather than sampling from) the posterior, a single shared λ penalty leads to over-shrinkage of significant β_j as penalty choice is dominated by a large number of spurious predictors. But use of CV to choose unique λ_j for each covariate would imply an impossibly large search.

Instead, we propose two approaches, both accompanied by publicly available software in packages for R: joint MAP inference with textir, and posterior simulation with reglogit.

A.1 Fast variable selection and MAP inference with textir

Taddy (2012a) proposes a gamma-lasso framework for MAP estimation in logistic regression, wherein coefficients and their independent L1 penalties are inferred under a conjugate gamma hyperprior. An efficient coordinate descent algorithm is derived, including conditions for global convergence, and the resulting estimation is shown in Taddy (2012a) as superior, in both predictive performance and computation time, to the more common strategy of CV lasso estimation under a single shared λ (as in glmnet). Results in this paper were all obtained using the publicly available textir package for R (Taddy 2012b), which uses the slam (Hornik, Meyer, and Buchta, 2011) package’s simple-triplet matrices to take advantage of design sparsity.

Prior specification in the gamma-lasso attaches independent gamma G9(λ_j;9s, r) hyperpriors on each L1 penalty, with E[λ_j] = s/r and var[λ] = s/r², such that, for j =1…p,

Laplace priors are often motivated through estimation utility—the prior spike at zero corresponds to a preference for eliminating regressors from the model in absence of significant evidence. Our hyperprior is motivated by complementary considerations: for strong signals and large ∣β_j∣, expected λ_j shrinks in the joint distribution to reduce estimation bias.

This leads to the joint negative log posterior minimization objective

where s, r>0. We have set σ=1 and r=1/2 throughout Section 3. In choosing s=E[λ_j]/2, we focus on the conditional prior standard deviation,

, for the coefficients. Hence our value of s=7.5, for E[λ_j=15, implies expected SD(β_j)≈0.095. To put this in context, exp[3×0.095]≈1.33, implying that a single player increasing his team’s for-vs-against odds by 1/3 is three deviations away from the prior mean.

As an illustration of the implementation, the following snippets show commands to run our main team-player model (see ?mnlm for details). With X=cbind(XP,XT) as defined in Section 2.1, the list of 30 ridge and n_p gamma-lasso penalties are specified

and the model is then fit

where X is not normalized since this would up-weight players with little ice-time.

A.2 Full posterior inference via reglogit

Extending a well-known result by Holmes and Held (2006), Gramacy and Polson (2012) showed that three sets of latent variables could be employed to obtain sample from the full posterior distribution using a standard Gibbs strategy (Geman and Geman, 1984). The full conditionals required for the Gibbs sampler are given below for the L1 and λ_j=λ case (note that β includes α here for notational convenience).

Note that N⁺ indicates the normal distribution truncated to the positive real line,

and Ω=diag(ω₁,…ω_n). Holmes and Held (2006) give a rejection sampling algorithm for ω_i∣z_i, however Gramacy and Polson (2012) argue that it is actually more efficient to draw ω_i∣y_i,λ (i.e., marginalizing over z_i) as follows. A proposal

and ɛ_k∼Exp(1) may be accepted with probability min{1, A_i} where

. Larger K improves the approximation, although K=100 usually suffices. Everything extends to the L2 case upon fixing

. Extending to separate λ_j is future work.

We use the implementation of this Gibbs sampler provided in the reglogit package (Gramacy, 2012a) for R. For the λ prior we use the package defaults of a=2 and b=0.1 which are shared amongst several other fully Bayesian samplers for the ordinary linear regression context [e.g., blasso in the monomvn package (Gramacy, 2012b)]. Usage is similar to that described for mnlm. To obtain T samples from the posterior, simply call:

Estimating the full posterior distribution for β allows posterior means to be calculated, as well as component-wise variances and correlations between β_j’s. However, unlike MAP estimates

, none of the samples or the overall mean is sparse. Gramacy and Polson (2012) show how their algorithm can be extended to calculate the MAP via simulation, but this only provides sparse estimators in the limit.

Another option is to use the posterior mean for λ obtained by Gibbs sampling on the entire covariate set, and use it as a guide in MAP estimation. Indeed, this is what is done in Section 3, where mean of our shared λ from Gibbs sampling was considered when setting priors for E9[λ_j].

B Extension to player-player interactions

We explore the possibility of on-ice synergies or mismatches between players by adding interaction terms into our model. The extension is easy to write down: simply add columns to the design matrix that are row-wise products of unique pairs of columns in the original X_P. However, this implies substantial computational cost: see Gramacy and Polson (2012) for examples of regularized logistic regression estimators that work well for estimating main effects but break down in the presence of interaction terms.

There is a further representational problem in the context of our hockey application. There are about 27K unique player pairs observed in the data. Combining these interaction terms with the original n_p players, n_g goals and thirty teams, we have a double-precision data storage requirement of nearly a gigabyte. While this is not a storage issue for modern desktops, it does lead to a computational bottleneck since temporary space required for the linear algebra routines cannot fit in fast memory. Fortunately, our original design matrix has a high degree of sparsity, which means that the interaction-expanded design matrix is even more sparse, and the sparse capabilities of textir makes computation feasible. Inference on the fully interaction-expanded design takes about 20 s on an Apple Mac Desktop. All other methods we tried, including reglogit, failed in initialization stages.

While the computational feat is impressive, our results indicate little evidence of significant player interaction effects. Figure 9 shows results for estimation with E9[λ=15]: only four non-zero interactions are found when augmenting the team-player model to include player-player interaction terms.⁶ Importantly, there is a negligible effect of including these interactions on the individual player effect estimates (which are our primary interest). The only player with a visually detectable connecting line is Joe Thornton, and to see it you may need a magnifying glass. We guess from the neighboring interaction term that his ability is enhanced when Patrick Marleau is on the ice. The most interesting result from this analysis involves the pairing of David Krejci and Dennis Wideman, which has a large negative interaction. One could caution Bruins coach Claude Julien against having this pair of players on the ice at the same time.

Figure 9

Comparing (non-zero) main effects for team-player model to their values in the interaction-expanded team-player model (dots) The lines point to the unexpanded estimates, and the x-axis is ordered by the dots.

Using reglogit to obtain samples from the full posterior distribution of the interaction-expanded model is not feasible, but we can still glean some insight by sampling from the posterior distribution for the simpler model with the original team-player design augmented to include the four significant interactions found from our initial MAP analysis. Figure 10 compares pairwise abilities for the same three players used as examples in Figure 6. We observe minimal changes to the estimates obtained under the original team-player design, echoing the results from the MAP analysis. In total, we regard the ability to entertain interactions as an attractive feature of our methodology, but it does not change how we view the relative abilities of players.

Figure 10

Comparing the ability of Datsyuk, Roloson, and Marchant to the 60-odd other players with non-zero coefficients in the team-player model, showing coefficients under the interaction-expanded model as well.

Many thanks to Samuel Ventura and Andrew C. Thomas for supplying a cleaned version of the nhl.com data. We are grateful to the editor, an associate editor, and one referee whose comments on an early version of the manuscript led to many improvements.

¹Note that we include goalies in our analysis.
²Fitted in R using the command fit>-glm(goals~XP, family=“binomial”).
³We used forward step-wise regression with the Bayes information criterion (BIC).
⁴This is the lowest possible budget from which lines can be formed satisfying (4).
⁵Sweater sales is another matter.
⁶We omitted goalie-skater and goalie-goalie interaction terms.

References

Awad, T. 2009. “Numbers On Ice: Fixing Plus/Minus.” Hockey Prospectus. See www.puckprospectus.comSearch in Google Scholar

Friedman, J. H., T. Hastie, and R. Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33(1): 1–22.10.18637/jss.v033.i01Search in Google Scholar

Geman, S. and D. Geman. 1984. “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images.” IEEE Transaction on Pattern Analysis and Machine Intelligence 6: 721–741.10.1109/TPAMI.1984.4767596Search in Google Scholar

Gramacy, R. 2012a. Reglogit: Simulation-based Regularized Logistic Regression. R package version 1.1.10.1214/12-BA719Search in Google Scholar

Gramacy, R. B. 2012b. Monomvn: Estimation for multivariate normal and Student-t data with monotone missingness. R package version 1.8-9.Search in Google Scholar

Gramacy, R. and N. Polson. 2012. “Simulation-Based Regularized Logistic Regression.” Bayesian Analysis 7: 1–24.10.1214/12-BA719Search in Google Scholar

Hoerl, A. E. and R. W. Kennard. (1970). “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12: 55–67.Search in Google Scholar

Holmes, C. and K. Held. 2006. “Bayesian Auxilliary Variable Models for Binary and Multinomial Regression.” Bayesian Analysis 1(1): 145–168.Search in Google Scholar

Hornik, K., D. Meyer, and C. Buchta. (2011). slam: Sparse Lightweight Arrays and Matrices. R package version 0.1-23.Search in Google Scholar

Ilardi, S. and A. Barzilai. 2004. “Adjusted Plus-Minus Ratings: New and Improved for 2007–2008.” 82games.com.Search in Google Scholar

Macdonald, B. 2010. “A Regression-based Adjusted Plus-Minus Statistic for NHL Players.” Tech. rep., arXiv: 1006.4310.Search in Google Scholar

Rosenbaum, D. T. 2004. “Measuring How NBA Players Help Their Teams Win.” 82games.com.Search in Google Scholar

Schuckers, M. E., D. F. Lock, C. Wells, C. J. Knickerbocker, and R. H. Lock. 2010. “National Hockey League Skater Ratings Based upon All On-Ice Events: An Adjusted Minus/Plus Probability (AMPP) Approach.” Tech. rep., St. Lawrence University.Search in Google Scholar

R Development Core Team 2010. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.Search in Google Scholar

Taddy, M. 2012a. textir: Inverse Regression for Text. R package version 1.8-6.Search in Google Scholar

Taddy, M. 2012b. “Multinomial Inverse Regression for Text Analysis.” Journal of the American Statistical Association, accepted for publication.Search in Google Scholar

Thomas, A. C., S. L. Ventura, S. Jensen, and S. Ma. 2012. “Competing Process Hazard Function Models for Player Ratings in Ice Hockey.” Tech. rep., ArXiv:1208.0799.Search in Google Scholar

Tibshirani, R. 1996. “Regression shrinkage and selection via the lasso.” J. R. Statist. Soc. B, 58: 267–288.10.1111/j.2517-6161.1996.tb02080.xSearch in Google Scholar

Vollman, R. 2010. “Howe and Why: Ten Ways to Measure Defensive Contributions.” Hockey Prospectus.Search in Google Scholar

Published Online: 2013-03-30

Estimating player contribution in hockey with regularized logistic regression

Abstract

Appendix

A Estimation details and software

A.1 Fast variable selection and MAP inference with textir

A.2 Full posterior inference via reglogit

B Extension to player-player interactions

References

Journal and Issue

Articles in the same Issue