Using the Gini coefficient to characterize the shape of computational chemistry error distributions

Pernot, Pascal; Savin, Andreas

doi:10.1007/s00214-021-02725-0

Using the Gini coefficient to characterize the shape of computational chemistry error distributions

Regular Article
Published: 15 February 2021

Volume 140, article number 24, (2021)
Cite this article

Theoretical Chemistry Accounts Aims and scope Submit manuscript

246 Accesses
6 Citations
4 Altmetric
Explore all metrics

Abstract

The distribution of errors is a central object in the assessment and benchmarking of computational chemistry methods. The popular and often blind use of the mean unsigned error as a benchmarking statistic leads to ignore distributions features that impact the reliability of the tested methods. We explore how the Gini coefficient offers a global representation of the errors distribution, but, except for extreme values, does not enable an unambiguous diagnostic. We propose to relieve the ambiguity by applying the Gini coefficient to mode-centered error distributions. This version can usefully complement benchmarking statistics and alert on error sets with potentially problematic shapes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Confidence limits, error bars and method comparison in molecular modeling. Part 2: comparing methods

Article Open access 01 February 2016

Systematic benchmarking of omics computational tools

Article Open access 27 March 2019

Absolute and relative pKa predictions via a DFT approach applied to the SAMPL6 blind challenge

Article 20 August 2018

Data availability statement

The data and codes that support the findings of this study are openly available at the following URL: https://doi.org/10.5281/zenodo.4333217

References

Pernot P, Civalleri B, Presti D, Savin A (2015) Prediction uncertainty of density functional approximations for properties of crystals with cubic symmetry. J Phys Chem A 119:5288–5304. https://doi.org/10.1021/jp509980w
Article CAS PubMed Google Scholar
Pernot P, Savin A (2018) Probabilistic performance estimators for computational chemistry methods: the empirical cumulative distribution function of absolute errors. J Chem Phys 148:241707. https://doi.org/10.1063/1.5016248
Article CAS PubMed Google Scholar
Pernot P, Savin A (2020) Probabilistic performance estimators for computational chemistry methods: systematic improvement probability and ranking probability matrix. I. Theory J Chem Phys 152:164108. https://doi.org/10.1063/5.0006202
Article CAS PubMed Google Scholar
Pernot P, Savin A (2020) Probabilistic performance estimators for computational chemistry methods: Systematic improvement probability and ranking probability matrix. II. Appl J Chem Phys 152:164109. https://doi.org/10.1063/5.0006204
Article CAS Google Scholar
Pernot P, Huang B, Savin A (2020) Impact of non-normal error distributions on the benchmarking and ranking of Quantum Machine Learning models. Mach Learn Sci Technol 1:035011. https://doi.org/10.1088/2632-2153/aba184
Article Google Scholar
Bonato M (2011) Robust estimation of skewness and kurtosis in distributions with infinite higher moments. Finance Res Lett 8:77–87. https://doi.org/10.1016/j.frl.2010.12.001
Article Google Scholar
Lorenz MO (1905) Methods of measuring the concentration of wealth. Publ Am Stat Assoc 9:209–219. https://doi.org/10.2307/2276207
Article Google Scholar
Gini C (1912) Variabilità e mutabilità
Damgaard C, Weiner J (2000) Describing inequality in plant size or fecundity. Ecology 81:1139–1142. https://doi.org/10.2307/177185
Article Google Scholar
Eliazar II, Sokolov IM (2010) Measuring statistical heterogeneity: the Pietra index. Phys A 389:117–125. https://doi.org/10.1016/j.physa.2009.08.006
Article Google Scholar
Bendel RB, Higgins SS, Teberg JE, Pyke DA (1989) Comparison of skewness coefficient, coefficient of variation, and Gini coefficient as inequality measures within populations. Oecologia 78:394–400. https://doi.org/10.1007/BF00379115
Article CAS PubMed Google Scholar
Florian MK, Li N, Gladders MD (2016) The Gini coefficient as a morphological measurement of strongly lensed galaxies in the image plane. Astrophys J 832:168. https://doi.org/10.3847/0004-637X/832/2/168
Article Google Scholar
Hurley N, Rickard S (2009) Comparing measures of sparsity. IEEE Trans Inf Theory 55:4723–4741. https://doi.org/10.1109/TIT.2009.2027527
Article Google Scholar
Kleiber C (2005) The Lorenz curve in economics and econometrics. techreport, TU Dortmund, March. https://doi.org/10.17877/DE290R-14481
Dixon PM, Weiner J, Mitchell-Olds T, Woodley R (1987) Bootstrapping the Gini coefficient of inequality. Ecology 68:1548–1551. https://doi.org/10.2307/1939238
Article Google Scholar
Ruppert D (1987) What is kurtosis? An influence function approach. Am Stat 41:1. https://doi.org/10.2307/2684309
Article Google Scholar
Groeneveld RA, Meeden G (1984) Measuring skewness and kurtosis. Stat 33:391–399. http://www.jstor.org/stable/2987742, https://doi.org/10.2307/2987742
Suaray K (2015) On the asymptotic distribution of an alternative measure of kurtosis. Int J Adv Stat Proba 3:161–168. https://doi.org/10.14419/ijasp.v3i2.5007
Article Google Scholar
Crow EL, Siddiqui MM (1967) Robust estimation of location. J Am Stat Assoc 62:353–389. https://doi.org/10.2307/2283968
Article Google Scholar
Bickel DR (2002) Robust estimators of the mode and skewness of continuous data. Comput Stat Data Anal 39:153–163. https://doi.org/10.1016/S0167-9473(01)00057-3
Article Google Scholar
Hedges SB, Shah P (2003) Comparison of mode estimation methods and application in molecular clock analysis. BMC Bioinform 4:31. https://doi.org/10.1186/1471-2105-4-31
Article Google Scholar
Glasser GJ (1962) Variance formulas for the mean difference and coefficient of concentration. J Am Stat Assoc 57:648–654. https://doi.org/10.1080/01621459.1962.10500553
Article Google Scholar
Zeileis A (2014) ineq: measuring inequality, concentration, and poverty. R package version 0.2-13. URL: https://CRAN.R-project.org/package=ineq
Harrell FE, Davis C (1982) A new distribution-free quantile estimator. Biometrika 69:635–640. https://doi.org/10.2307/2335999
Article Google Scholar
Wilcox RR, Erceg-Hurn DM (2012) Comparing two dependent groups via quantiles. J Appl Stat 39:2655–2664. https://doi.org/10.1080/02664763.2012.724665
Article Google Scholar
Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7(1):1–26. https://doi.org/10.1214/aos/1176344552
Article Google Scholar
R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL: http://www.R-project.org/
Gentleman R, Carey V, Huber W, Hahne F (2019) genefilter: methods for filtering genes from high-throughput experiments. R package version 1(68)
Canty A, Ripley BD (2019) boot: bootstrap R (S-Plus) Functions. R package version 1.3-22
Evans M, Hastings N, Peacock B (2000) Statistical distributions. Wiley-Interscience, 3rd edition
Hoaglin DC (1985) Exploring data tables, trends, and shapes, chapter Summarizing shape numerically: the g-and-h distributions, pp 461–513. Wiley, New York
Borlido P, Aull T, Huran AW, Tran F, Marques MA, Botti S (2019) Large-scale benchmark of exchange-correlation functionals for the determination of electronic band gaps of solids. J Chem Theory Comput 15:5069–5079. https://doi.org/10.1021/acs.jctc.9b00322
Article CAS PubMed PubMed Central Google Scholar
Narayanan B, Redfern PC, Assary RS, Curtiss LA (2019) Accurate quantum chemical energies for 133000 organic molecules. Chem Sci 10:7449–7455. https://doi.org/10.1039/c9sc02834j
Article CAS PubMed PubMed Central Google Scholar
Schmidt PS, Thygesen KS (2018) Benchmark database of transition metal surface and adsorption energies from many-body perturbation theory. J Phys Chem C 122:4381–4390. https://doi.org/10.1021/acs.jpcc.7b12258
Article CAS Google Scholar
Thakkar AJ, Wu T (2015) How well do static electronic dipole polarizabilities from gas-phase experiments compare with density functional and MP2 computations? J Chem Phys 143:144302. https://doi.org/10.1063/1.4932594
Article CAS PubMed Google Scholar
Wu T, Kalugina YN, Thakkar AJ (2015) Choosing a density functional for static molecular polarizabilities. Chem Phys Lett 635:257–261. https://doi.org/10.1016/j.cplett.2015.07.003
Article CAS Google Scholar
Zaspel P, Huang B, Harbrecht H, von Lilienfeld OA (2019) Boosting quantum machine learning models with a multilevel combination technique: people diagrams revisited. J Chem Theory Comput 15(3):1546–1559. https://doi.org/10.1021/acs.jctc.8b00832
Article CAS PubMed Google Scholar
Zhang Y, Kitchaev DA, Yang J, Chen T, Dacek ST, Sarmiento-Perez RA, Marques MAL, Peng H, Ceder G, Perdew JP, Sun J (2018) Efficient first-principles prediction of solid stability: towards chemical accuracy. npj Comput Mater 4:9. https://doi.org/10.1038/s41524-018-0065-z
Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7:308–313. https://doi.org/10.1093/comjnl/7.4.308
Article Google Scholar
Kacker RN, Kessel R, Sommer K-D (2010) Assessing differences between results determined according to the guide to the expression of uncertainty in measurement. J Res Nat Inst Stand Technol 115(6):453. https://doi.org/10.6028/jres.115.031
Article Google Scholar
Lejaeghere K, Jaeken J, Speybroeck VV, Cottenier S (2014) Ab initio based thermal property predictions at a low cost: an error analysis. Phys Rev B 89:014304. https://doi.org/10.1103/physrevb.89.014304
Article Google Scholar
Lejaeghere K, Vanduyfhuys L, Verstraelen T, Speybroeck VV, Cottenier S (2016) Is the error on first-principles volume predictions absolute or relative? Comput Mater Sci 117:390–396. https://doi.org/10.1016/j.commatsci.2016.01.039
Article CAS Google Scholar
Proppe J, Husch T, Simm GN, Reiher M (2016) Uncertainty quantification for quantum chemical models of complex reaction networks. Faraday Discuss 195:497–520. https://doi.org/10.1039/c6fd00144k
Article CAS PubMed Google Scholar
Proppe J, Reiher M (2017) Reliable estimation of prediction uncertainty for physicochemical property models. J Chem Theory Comput 13:3297–3317. https://doi.org/10.1021/acs.jctc.7b00235
Article CAS PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Institut de Chimie Physique, UMR8000 CNRS, Université Paris-Saclay, 91405, Orsay, France
Pascal Pernot
Laboratoire de Chimie Théorique, CNRS and UPMC Université Paris 06, Sorbonne Universités, 75252, Paris, France
Andreas Savin

Authors

Pascal Pernot
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Savin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pascal Pernot.

Additional information

This work is dedicated to Ramon Carbó-Dorca for his 80th birthday. It reflects his interest for cross-disciplinary aspects of science, especially the role of mathematics in chemistry.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Published as part of the special collection of articles “Festschrift in honour of Prof. Ramon Carbó-Dorca”.

Supplementary Information

Below is the link to the electronic supplementary material. Statistics, ECDFs and Lorenz curves for the literature datasets are also openly available at the following URL: https://doi.org/10.5281/zenodo.4333217

Supplementary material 1 (pdf 3173 KB)

Supplementary material 2 (csv 3173 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pernot, P., Savin, A. Using the Gini coefficient to characterize the shape of computational chemistry error distributions. Theor Chem Acc 140, 24 (2021). https://doi.org/10.1007/s00214-021-02725-0

Download citation

Received: 17 December 2020
Accepted: 29 January 2021
Published: 15 February 2021
DOI: https://doi.org/10.1007/s00214-021-02725-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using the Gini coefficient to characterize the shape of computational chemistry error distributions

Abstract

Access this article

Similar content being viewed by others

Confidence limits, error bars and method comparison in molecular modeling. Part 2: comparing methods

Systematic benchmarking of omics computational tools

Absolute and relative pKa predictions via a DFT approach applied to the SAMPL6 blind challenge

Data availability statement

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary material 1 (pdf 3173 KB)

Supplementary material 2 (csv 3173 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using the Gini coefficient to characterize the shape of computational chemistry error distributions

Abstract

Access this article

Similar content being viewed by others

Confidence limits, error bars and method comparison in molecular modeling. Part 2: comparing methods

Systematic benchmarking of omics computational tools

Absolute and relative pKa predictions via a DFT approach applied to the SAMPL6 blind challenge

Data availability statement

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary material 1 (pdf 3173 KB)

Supplementary material 2 (csv 3173 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation