Consumer credit-risk models via machine-learning algorithms☆
Introduction
One of the most important drivers of macroeconomic conditions and systemic risk is consumer spending, which accounted for over two thirds of US gross domestic product as of October 2008. With $13.63 trillion of consumer credit outstanding as of the fourth quarter of 2008 ($10.47 trillion in mortgages, $2.59 trillion in other consumer debt), the opportunities and risk exposures in consumer lending are equally outsized.1 For example, as a result of the recent financial crisis, the overall charge-off rate in all revolving consumer credit across all US lending institutions reached 10.1% in the third quarter of 2009, far exceeding the average charge-off rate of 4.72% during 2003–2007. 2 With a total of $874 billion of revolving consumer credit outstanding in the US economy as of November 2009,3 and with 46.1% of all families carrying a positive credit-card balance in 2007,4 the potential for further systemic dislocation in this sector has made the economic behavior of consumers a topic of vital national interest.
The large number of decisions involved in the consumer lending business makes it necessary to rely on models and algorithms rather than human discretion, and to base such algorithmic decisions on “hard” information, e.g., characteristics contained in consumer credit files collected by credit bureau agencies. Models are typically used to generate numerical “scores” that summarize the creditworthiness of consumers.5 In addition, it is common for lending institutions and credit bureaus to create their own customized risk models based on private information about borrowers. The type of private information usually consists of both “within-account” as well as “across-account” data regarding customers’ past behavior.6 However, while such models are generally able to produce reasonably accurate ordinal measures, i.e., rankings, of consumer creditworthiness, these measures adjust only slowly over time and are relatively insensitive to changes in market conditions. Given the apparent speed with which consumer credit can deteriorate, there is a clear need for more timely cardinal measures of credit risk by banks and regulators.
In this paper, we propose a cardinal measure of consumer credit risk that combines traditional credit factors such as debt-to-income ratios with consumer banking transactions, which greatly enhances the predictive power of our model. Using a proprietary dataset from a major commercial bank (which we shall refer to as the “Bank” throughout this paper to preserve confidentiality) from January 2005 to April 2009, we show that conditioning on certain changes in a consumer’s bank-account activity can lead to considerably more accurate forecasts of credit-card delinquencies in the future. For example, in our sample, the unconditional probability of customers falling 90-days-or-more delinquent on their payments over any given 6-month period is 5.3%, but customers experiencing a recent decline in income—as measured by sharp drops in direct deposits—have a 10.8% probability of 90-days-or-more delinquency over the subsequent 6 months. Such conditioning variables are statistically reliable throughout the entire sample period, and our approach is able to generate many variants of these transactions-based predictors and combine them in nonlinear ways with credit bureau data to yield even more powerful forecasts. By analyzing patterns in consumer expenditures, savings, and debt payments, we are able to identify subtle nonlinear relationships that are difficult to detect in these massive datasets using standard consumer credit-default models such as logit, discriminant analysis, or credit scores.7
We use an approach known as “machine learning” in the computer science literature, which refers to a set of algorithms specifically designed to tackle computationally intensive pattern-recognition problems in extremely large datasets. These techniques include radial basis functions, tree-based classifiers, and support-vector machines, and are ideally suited for consumer credit-risk analytics because of the large sample sizes and the complexity of the possible relationships among consumer transactions and characteristics.8 The extraordinary speed-up in computing in recent years, coupled with significant theoretical advances in machine-learning algorithms, have created a renaissance in computational modeling, of which our consumer credit-risk model is just one of many recent examples.
One measure of the forecast power of our approach is to compare the machine-learning model’s forecasted scores of those customers who eventually default during the forecast period with the forecasted scores of those who do not. Significant differences between the forecasts of the two populations is an indication that the forecasts have genuine discriminating power. Over the sample period from May 2008 to April 2009, the average forecasted score among individuals who do become 90-days-or-more delinquent during the 6-month forecast period is 61.9, while the average score across all customers is 2.1. The practical value added of such forecasts can be estimated by summing the cost savings from credit reductions to high-risk borrowers and the lost revenues from “false positives”, and under a conservative set of assumptions, we estimate the potential net benefits of these forecasts to be 6–25% of total losses.
More importantly, by aggregating individual forecasts, it is possible to construct a measure of systemic risk in the consumer-lending sector, which accounts for one of the largest components of US economic activity. As Buyukkarabacaka and Valevb (2010) observe, private credit expansions are an early indicator of potential banking crises. By decomposing private credit into household and enterprise credit, they argue that household-credit growth increases debt without much effect on future income, while enterprise-credit expansion typically results in higher future income. Accordingly, they argue that rapid household-credit expansions are more likely to generate vulnerabilities that can precipitate a banking crisis than enterprise-credit expansion. Therefore, a good understanding of consumer choice and early warning signs of over-heating in consumer finance are essential to effective macroprudential risk management policies. We show that the time-series properties of our machine-learning forecasts are highly correlated with realized credit-card delinquency rates (linear regression R2’s of 85%), implying that a considerable portion of the consumer credit cycle can be forecasted 6–12 months in advance.
In Section 2, we describe our dataset, discuss the security issues surrounding it, and document some simple but profound empirical trends. Section 3 outlines our approach to constructing useful variables or feature vectors that will serve as inputs to the machine-learning algorithms we employ. In Section 4, we describe the machine-learning framework for combining multiple predictors to create more powerful forecast models, and present our empirical results. Using these results, we provide two applications in Section 5, one involving model-based credit-line reductions and the other focusing on systemic risk measures. We conclude in Section 6.
Section snippets
The data
In this study, we use a unique dataset consisting of transaction-level, credit bureau, and account-balance data for individual consumers. This data is obtained for a subset of the Bank’s customer base for the period from January 2005 to April 2009. Integrating transaction, credit bureau, and account-balance data allows us to compute and update measures of consumer credit risk much more frequently than the slower-moving credit-scoring models currently being employed in the industry and by
Constructing feature vectors
The objective of any machine-learning model is the identification of statistically reliable relationships between certain features of the input data and the target variable or outcome. In the models that we construct in later sections, the features we use include data items such as total inflow, total income, credit-card balance, etc., and the target variable is a binary outcome that indicates whether an account is delinquent by 90 days or more within the subsequent 3-, 6-, or 12-month window.
Modeling methodology
In this section, we describe the machine-learning algorithms we use to construct delinquency forecast models for the Bank’s consumer credit and transactions data from January 2005 to April 2009.17 This challenge is well suited to be formulated as a supervised learning problem, which is one of the most
Applications
In this section we apply the models and methods of Sections 3 Constructing feature vectors, 4 Modeling methodology to two specific challenges in consumer credit-risk management: deciding when and how much to cut individual-account credit lines, and forecasting aggregate consumer credit delinquencies for the purpose of enterprise-wide and macroprudential risk management.
The former application is of particular interest from the start of 2008 through early 2009 as banks and other financial
Conclusion
In the aftermath of one of the worst financial crises in modern history, it has become clear that consumer behavior has played a central role at every stage—in sowing the seeds of crisis, causing cascades of firesale liquidations, and bearing the brunt of the economic consequences. Therefore, any prospective insights regarding consumer credit that can be gleaned from historical data has become a national priority.
In this study, we develop a machine-learning model for consumer credit default and
References (33)
- et al.
Consumer credit scoring: Do situational circumstances matter?
Journal of Banking and Finance
(2004) - et al.
Support vector machines for credit scoring and discovery of significant features
Expert Systems with Applications
(2009) Relationship banking: What do we know?
Journal of Financial Intermediation
(2000)- et al.
The integrated impact of credit and interest rate risk on banks: A dynamic framework and stress testing application
Journal of Banking & Finance
(2010) - et al.
Inferring the default rate in a population by comparing two incomplete default databases
Journal of Banking & Finance
(2006) - et al.
Are good or bad borrowers discouraged from applying for loans?
Journal of Banking & Finance
(2009) - et al.
Credit rating analysis with support vector machines and neural networks: A market comparative study
Decision Support Systems
(2004) - et al.
The evaluation of consumer loans using support vector machines
Expert Systems with Applications
(2006) - et al.
Comprehensive credit scoring models using rule extraction from support vector machines
European Journal of Operational Research
(2007) - et al.
Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters
Expert Systems with Applications
(2005)
Building credit scoring systems using genetic programming
Expert Systems with Applications
The level and quality of value-at-risk disclosure by commercial banks
Journal of Banking & Finance
The economics of credit cards, debit cards and ATMs: A survey and some new evidence
Journal of Banking & Finance
An application of support vector machines in bankruptcy prediction model
Expert Systems with Applications
Price incentives and consumer payment behaviour
Journal of Banking & Finance
The relationship between default prediction and lending profits: Integrating ROC analysis and loan pricing
Journal of Banking & Finance
Cited by (546)
On coresets for fair clustering in metric and Euclidean spaces and their applications
2024, Journal of Computer and System SciencesRisk transmission, systemic fragility of banks’ interacting customers and credit worthiness assessment
2024, Finance Research LettersA decade of research on machine learning techniques for predicting employee turnover: A systematic literature review
2024, Expert Systems with ApplicationsAutomatic annotation of protected attributes to support fairness optimization
2024, Information SciencesAutoEIS: Automatic feature embedding, interaction and selection on default prediction
2024, Information Processing and Management
- ☆
The views and opinions expressed in this article are those of the authors only, and do not necessarily represent the views and opinions of AlphaSimplex Group, MIT, any of their affiliates and employees, or any of the individuals acknowledged below. We thank Jayna Cummings, Tanya Giovacchini, Alan Kaplan, Paul Kupiec, Frank Moss, Deb Roy, Ahktarur Siddique, Roger Stein, two referees, the editor Ike Mathur, and seminar participants at Citigroup, the FDIC, the MIT Center for Future Banking, the MIT Media Lab, and Moody’s Academic Advisory and Research Committee for many helpful comments and discussion. Research support from the MIT Laboratory for Financial Engineering and the Media Lab’s Center for Future Banking is gratefully acknowledged.