INTRODUCTION

Database marketing continues to grow in importance as technology developments make it possible for even the smallest businesses to track customers and prospects efficiently.1, 2, 3, 4 In 2006, database marketing services grew by more than 14 per cent.5 Since 2001, digital marketing and database marketing have continued to grow in importance as advertising strategies. Between 2001 and 2009, digital marketing (database marketing is a major component) represented 33 per cent of all advertising spending. In 2008 digital marketing represented 88 per cent of all advertising expenditures.6

An important task in database marketing is to create effective market segments or customer segments for the purpose of identifying appropriate targets for communication and advertising campaigns. Statistical models are created to determine lifetime value of customers, assess customer loyalty and value, and identify customers and prospects for acquisition and retention marketing programs. The business objective for creating database models is to increase advertising efficiency by targeting households or individuals that are the most likely to respond to specific advertising offers, thus reducing communication expenses while maintaining or increasing customer response or sales.

In the database world of model building, two popular analytic activities are model comparison and model validation. More specifically, database marketing analysts frequently construct multiple response models and need a measurement tool to compare and assess model performance. Once a candidate model is selected, it is prudent to validate the model with an independent data file and determine whether the model's performance is reliable (consistent) and sample independent. Database marketing acquisition models typically generate low response rates. Small incremental improvements in response rates can have a very significant impact on campaign profitability. Thus, it is important to create response models that maximize performance, assess performance with meaningful metrics and evaluate the significance of performance differences between competing models.

From a business perspective, a common objective is to create a response model that generates the highest return. The performance of the selected model should not only be superior to other models, but should also be significantly superior. Currently, the assessment methods employed by practitioners7, 8, 9, 10, 11 for assessing model performance or model reliability are not statistically rigorous. The database analyst cannot assess significance in measured differences. Gains and lift charts are valuable for evaluating some aspects of marketing campaigns (for example response/profit comparisons of different decile segments); however, they are cumbersome when comparing overall model performance or for validating a model with different data files. It is possible to add statistical rigor to individual decile computations with bootstrapping or jackknifing sampling methods, yet this requires a significant programming effort.

This article provides remedies for these two situations through an explication of the merits of the Gini statistic, often used in social sciences, and more recently applied in direct marketing research.2, 4, 12 Our contribution is twofold. First, we support the utilization of the Gini statistic as a performance measure for assessing database marketing response models. We show how the statistic is related to many of the popular descriptive methods used by practitioners. Second, we present a formula for approximating the standard error of the Gini statistic. This provides analysts with the ability to compare statistical differences between competing models and to validate response models with different data files. Our methodology is derived from a regression analysis of over 1000 different data conditions created from a Monte Carlo simulation in which we varied the value of the Gini coefficient, sample file size and sample response rate. It is important to note that the simulation was created completely independent of any specific type of underlying response model (regression, Chaid, Neural Networks and so on). Our assessment for the standard error of Gini depends only on file size, response rate and the actual Gini obtained, regardless of the model that created the Gini. Essentially, we assume a very liberal variation in response (the simulation utilizes a uniform distribution of responses within deciles), which should serve as a conservative estimate.

The methodology that we introduce for computing the standard error of Gini is most appropriate when response models are applied to relatively large data files (n>15 000). Files of this size or larger are commonly used in database marketing applications. In order to employ a response model assessment in a database marketing context, an analyst might follow a process such as scoring all data records with the model; sorting all records on the file by model score (predicted values); classifying each record with an appropriate n-tile (typically deciles are used – 10 equal size segments); aggregating or averaging the measure of interest (for example responses, inquiries, sales) for each decile; and displaying the results in tabular or graphical form. The Gini statistic is a single number that represents the area under the cumulative lift chart relative to the area under a uniform distribution. The value's association with the cumulative lift represents the cumulative percentage of responses.

It is not uncommon for database marketers to utilize two other statistics, Area under the Curve (AUC) and receiver operating characteristic (ROC). AUC is equivalent to the Gini differing by a scale factor. A benefit of Gini is that it is scaled from 0–1, where 0 indicates no difference between segments and 1 indicates a maximum difference. ROC curves are designed to assess two competing conditions simultaneously rather than assess one. Situations such as signal/noise for interpreting radar, illness and extraneous symptoms for diagnosing patients benefit from ROC analysis. One can use ROC to balance two types of potential errors (False Positives and True Negatives). However, when there is one dominant objective (Maximize Response Rate), a measure like Gini is simpler and more useful. The Gini statistic has been recognized in the literature by different marketers 2, 3, 9 and is used by practitioners.

Our approach to deriving the standard error of Gini differs from other methods discussed in the literature. When the data points represent individual data points and are sorted according to their unique values, Giles13 describes a regression approach. There is still some controversy about the approach.14, 15, 16 Moreover, the database marketing summaries do not possess the properties appropriate for computing the standard error of Gini using Giles method.13

The remaining portion of this article is organized as follows. We discuss response model evaluation, summarize the descriptive methods used by practitioners, describe how descriptive methods are related to the Gini statistic, describe a method for approximating Gini, present a procedure for estimating the standard error of Gini and, lastly, illustrate our suggested procedure using three data sets, two available from the Direct Marketing Association (DMA) and the third a subset of a proprietary data file from a large, national insurance company.

BACKGROUND

Database model evaluation

Traditionally, statistical response models used in database marketing have been evaluated based on some form of goodness of fit. Assumptions are made regarding underlying data distributions and models are evaluated based on how well predicted data values from the response model actually fit the observed data values from a sample data set. Various statistical measures (R2, the F statistic, the Chi Square statistic, classification indices and so on) are used to evaluate the goodness of fit. In database marketing, the goals are more focused on developing response models to meet business objectives.17, 18 Although fit is often useful, the goal is differentiating consumer units (for example people, households) based on their likelihood of responding to a specific offer. As such, marketers create metrics that measure the magnitude of separation between productive market segments and non-productive market segments.

Practitioners engaged in direct marketing efforts have been using alternative metrics for evaluating model performance for at least 25 years.7, 9 The metrics that are most commonly used are the decile chart,7, 8, 11 the gains and cumulative gains table,8 the lift chart and the cumulative lift chart. 1, 6, 8, 11 Decile analysis is used in database marketing in order to more easily visualize data files consisting of thousands to hundreds of thousands of data records. For each decile segment, the number and percentage of responses are recorded. Gains indices are commonly created. A common form of the index is created by forming the ratio of segment to the average response rate of the total sample, multiplied by 100. This provides the analyst a method for determining which segments perform significantly better than average. A segment with a gains index of 120 indicates a segment that performs 20 per cent higher than average. The segments and the corresponding households that fall in high-performing segments are the best candidates for targeting. The graph shown in Figure 1, Panel A illustrates both a gains chart and a cumulative gains chart. The index (y axis value) represents the gains index, the ratio of a response rate of a segment to the overall response rate multiplied by 100.

Figure 1
figure 1

 Gains chart, cumulative gains chart and comparison. (a) Gains chart and cumulative gains chart. (b) Comparison of two gains charts

Although the gains index is quite useful for identifying better performing segments, a difficulty arises when one tries to compare the performance of two models or to validate the performance of a model with two different sample files (see Figure 1, Panel B). In order to determine whether two models perform similarly, practitioners rely on ‘eyeball’ judgments or prior experience. There is no formal or rigorous procedure to claim that one of the models is superior to the other in terms of performance. In addition, if the two graphs represent one model applied to two different sample files, there is no statistical test to conclude that the model is reliable, consistent or sample independent. Another issue that frequently occurs when response models are built with small samples or low response rates is over-fitting. Providing a Gini statistic alone does not warn the analyst of potential model failure owing to an over-fit (sample specific) condition. By supplying the standard error for Gini, the analyst is made aware of potential risk that can be realized. This is a very important point. Research papers that describe model performance with Gini should always include the standard error13

Descriptive metrics used by practitioners

A brief survey among practitioners (13 members of the DMA Research Council) was conducted in 2004 to determine their preferred method of choice for evaluating response models. The gains and cumulative gains chart were the most popular metrics for assessing model performance, followed by measures of lift and cumulative lift. The applicability of these metrics is also supported in the literature.1, 8, 19, 20 The descriptive metrics used by practitioners are illustrated in Table 1.

Table 1 Descriptive metrics used by industry practitioners

Column 1 shows 10 segments used in decile analysis. Each decile has an equal number of customers, 10 000 in this case (column 3). The responses (column 2) divided by the customers (column 3) is used to determine the response rate (column 4). The gains index (column 5) is derived by dividing the response (column 4) by the average response rate (0.01307 in this example) and multiplying by 100. The cumulative gains index is computed by first determining the cumulated response rate (column 6), which for decile i is the sum of all responses from decile 1 to decile i, divided by all customers from decile 1 to decile i. Once the cumulative response rate is computed, the cumulative gains index (column 7) is the cumulative response rate (column 7) divided by the overall response rate (0.01307) multiplied by 100. Lift (column 8) is the number of responses in a decile (column 2) divided by total responses. Cumulative lift (column 10) is computed by accumulating responses from all prior deciles (column 9) inclusively and dividing the accumulation of responses by total number of responses (for example 1307).

THE RELATIONSHIP AMONG THE GAIN, LIFT AND THE MODIFIED GINI

The original Gini coefficient

Although there are many derivations of the Gini coefficient, we start with the original coefficient as articulated by Corado Gini in 1921, which is the form utilized in the direct marketing community. The Original Gini coefficient19 is computed by sorting scores from a distribution from low to high, determining the corresponding cumulative lift (Lorenz curve), computing the area between the cumulative uniform distribution (AU) and the Lorenz curve(AL) on the interval [0, 1], and dividing the result by the area under the cumulative uniform distribution on the same interval.

The modified Gini coefficient

The modified Gini coefficient21 (Γ) measures the relative difference between two areas, that is, the area under a Lorenz curve (cumulative percentage of responses) or cumulative lift curve (AL), and the area under a cumulative uniform distribution (AU), relative to the area under the cumulative uniform distribution. The Gini we discuss is very similar to the original Gini used by economists, except in the case of the original Gini scores are ranked from low to high and in our modified form scores are ranked from high to low. In either case, Gini represents the area between the same two curves. The area between the two curves is computed in such a way that it is always positive. Otherwise, the calculation for our modified Gini is the same as the original Gini, the relative area between two curves, divided by 0.5. This allows Gini to range from 0 to 1.

And

The Gini coefficient is minimized when the responses are spread equally across all deciles (Gini=0) and maximized when all of the responses are in the top decile (Gini=1). Theoretically, Gini could be negative, but no practitioner or researcher would either use or continue to investigate the results of a model that performed worse than random chance. The faster the cumulative lift reaches 1 (see Figure 2), the greater the AUC will be and the greater the Gini coefficient will be.

Figure 2
figure 2

 Gini graph.

COMPUTING THE GINI STATISTIC

In practice, population response distributions are unknown and are estimated from sampled data. The Gini statistic is a function of the area under the Lorenz curve (area under the cumulative lift).

Trapezoids are frequently used to approximate the area between the empirical response curve and the cumulative uniform distribution. Using trapezoids derived from deciles (n=10), and noting that F0=0

The Gini coefficient is computed directly from the cumulative lift (AL), which is a linear transformation of the cumulative gains index (AG), one of the descriptive techniques frequently used in database model building. Every cumulative gains index corresponds to a unique cumulative lift index,

where i represents the decile segment of interest.

Thus, with substitution, the Gini coefficient can also be expressed in terms of cumulative gains indices.

Whereas the cumulative gains graph and cumulative lift graph provide a visual display of response model performance, the Gini statistic provides a single number, which explains the degree of separation between segments in terms of response rate.

Quick calculation for Gini

The Gini coefficient can be calculated using the trapezoid approximation for computing the area under a continuous function. Gini is calculated by summing the cumulative lift values, subtracting 0.5, multiplying by 0.2 (assuming deciles) and subtracting 1. As such, the more segments used in the summary chart, the more accurate the approximation to the true area. The Gini coefficient from the data in Table 1 is 0.281.

In order to better approximate the AUC we offer the following correction. This is based on simulated data in Table 2. To recalibrate Gini, first calculate the trapezoid Gini and then use the following formula (see Table 2).

Table 2 Relationship between Gini coefficient and trapezoid estimation of Gini coefficient

where Γ T is Gini statistic calculated with trapezoids; and Γ is Gini calculated with a continuous function. Thus, Gini is recalibrated to 0.288, based on a trapezoid Gini of 0.281.

COMPUTING THE STANDARD ERROR FOR GINI

Knowing the standard error of Gini is useful for assessing the reliability of a model and determining the superiority of one model over another. First, we describe how we determined the standard error of Gini.

Monte Carlo simulation

In order to estimate the standard error of Gini we ran a Monte Carlo simulation. We created data files of 100 000 records and each record was given a 0 or 1 to reflect response=1 or non-response=0. For a given Gini and response rate, we selected the appropriate proportion of 0s and 1s to match the given Gini and response rate. We then created 200 sample files, ranging between 5000 and 50 000 records to compute the standard error associated with a variety of Gini values, response rates and sample file sizes. We tested a total of 1620 conditions, varying response rate from 0.01 to 0.1 in increments of 0.01, and then from 0.1 to 0.9 in increments of 0.1; and file sizes were varied from 5000 to 50 000 records in increments of 5000 for a total of 18 × 9 × 10=1620 test cases. For each test case, we extracted 200 random samples. We selected a variety of conditions that we felt would simulate actual database marketing situations as well as some extreme cases. We recognize that database analysts frequently have hundreds of thousands and even millions of records available for analysis. However, as they fine-tune their campaigns, they may test tens or hundreds of sub-file conditions, reducing their analysis to files ranging from thousands to tens of thousands of records. Details on the simulation implementation are shown in the Appendix.

For each of the 1620 test conditions, we computed the standard errors of the Gini statistic. Standard errors increase when the file structures are more unstable. As shown in Figure 3, standard errors increase as file sizes, response rates and Gini coefficients decrease. For example, if the response rate is low, there are fewer 1s in the file. Small changes in the number of 1s in any segment cause a high degree of variability in the sample calculation of Gini. The figure shows six panels representing various conditions. The three panels on the left reflect response rates from 0.01–0.09 and the three columns on the right reflect response rates from 0.1–0.9. The first panel row represents an n of 10 000, the second row an n of 25 000 and the third row an n of 40 000. The cells with the darker coloring reflect larger standard errors. The figure shows that higher standard errors are found when (1) Gini is small, (2) response rate is low (<0.02), (3) Gini is low or (4) the file size is small. When models are constructed with file sizes (n) less than 30 000, the standard errors of Gini should be included in any report. One should also consider including the standard error when response rates are lower than 0.04 and file sizes are <50 000.

Figure 3
figure 3

 Boundary conditions for Gini coefficient standard errror.

Normality distribution of sample Ginis was checked in 95 of the test cases representing the range of combinations of n – file size, r – response rate and Gini. Of the 95 cases tested, 78 could not be rejected for normality, P=0.05, based on the Anderson–Darling and Kolmogorv–Smirnov tests. Those cases in which normality could not be assumed fell on the boundaries of the test cases. In these cases Gini was extremely small (<0.2), the file size was small (<10 000), the response rate was small (<=0.01) or a combination of all three factors. We observed that if Gini is computed with a file size of at least 20 000 records, a response rate of at least 1 per cent and a Gini >0.2, it will be reasonable to assume that the sample Gini is normally distributed. This will provide an opportunity to make statistical inferences about Gini.

Estimating the standard error of Gini with Ordinary Least Squares

A regression equation was created to predict standard errors for the Gini statistic as a function of sample size, response rate and the Gini coefficient. The data set for building the regression equation was based on 1136 different parametric conditions. We eliminated the boundary conditions that are unrealistic for customer acquisition.

The regression formula that was used to estimate the standard error of the Gini coefficient is shown in Table 3. To validate the formula, we compared the predicted values to actual sample values. In absolute terms, 98.6 per cent of the standard error estimates are within 0.01 of the sample Gini. Confidence intervals for the Gini coefficient can be calculated as Γ±Γ, where σΓ=standard error of the Gini coefficient and will be approximated with the regression equation shown in Table 3.

Table 3 Gini standard error formula estimate

APPLICATION OF GINI AND ITS STANDARD ERROR

In this section we demonstrate the accuracy of the Gini standard error using a model calculated on an entire data set. First, we calculate Gini for the entire data set. We took 30 random samples from the file and computed an estimate of the standard error of Gini. We calculated one standard error estimate from our formula and the other standard error from the sample of 30 Ginis.We compare the differences of the standard errors. We did both of these calculations on two different data sets. We selected data sets provided by the DMA (catalog customer file and non-profit contributor file). We then show how including a standard error can assist analysts with model assessment, reliability assessment and model selection. These data files were selected because they are similar to the types of files that database marketers routinely use, and they have also been used in various academic research studies.

Validation

We test our methodology with three different data files, two from the DMA and one representing a national insurance company. Regression models are built to predict response, and we then compute the Gini statistic as outlined in this article and the standard error of the Gini statistic with the regression equation that we described. Finally, we compare our computed value with the sample standard error generated from 30 random samples (each sample was randomly selected from the original data file). The results are presented in Table 4.

Table 4 Standard error validation results

Using the DMA data sets, Table 4 shows the validation results that compare standard errors generated from the sample Gini statistics to the standard error estimated by the regression formula. Comparisons are made for the catalog and non-profit data sets. The results show that the estimated standard error is very close to the standard error computed from sampling 30 Ginis. These sample files were relatively small. For larger files, the standard error becomes very small, that is, Gini becomes quite stable. Table 5 demonstrates confidence intervals for the two DMA data files and the insurance data file. The size of the data files varies from 8600 to 39 000. As the sample sizes increase the confidence intervals shrink dramatically.

Table 5 Gini, standard errors and confidence intervals

CONCLUSIONS

In this article we discussed why the Gini index is a useful measure for assessing model performance. Gini can also be used to measure the consistency (reliability) of a response model. In order to add statistical rigor to model assessment, we include a method for approximating the statistical error for the Gini statistic. Our methodology is most relevant when data files range in size from 20 000 to 60 000 records. When files are small (n<20 000), Gini becomes unreliable; when files are larger than 60 000 records, Gini becomes very stable and the standard errors will be very small and insignificant.

We recommend that if a split half validation is to be performed for assessing model reliability or if there is a need to compare two different models with the same data file, the Gini statistic be used as the performance measure of choice. It is easy to compute and easy to apply. The range of file sizes is easy to understand, and one can construct confidence intervals and hypotheses tests. This gives the Gini statistic a decided advantage over current descriptive techniques employed by both researchers and practitioners. Inclusion of the Gini standard error allows analysts the opportunity to determine whether model performance is statistically significant.

Limitations

Applying the Gini statistic and the appropriate standard error is limited to data files that have reasonable response rates, file sizes and Gini values. When Gini is less than 0.2, one must question whether utilizing a model will provide any significant benefit. When file sizes are small (<20 000), model results will be unstable if the response rate is also small. Marketing efforts designed to acquire new customers frequently experience very low response rates. Under these circumstances, one would need to greatly increase the training file size to gain any confidence in the accuracy of the response model. The results of this article are based on a large simulation. For future research it would be interesting to develop a closed-form solution for the standard error of Gini.