Kybernetika 57 no. 6, 879-907, 2021

On the Jensen-Shannon divergence and the variation distance for categorical probability distributions

Jukka Corander, Ulpu Remes and Timo KoskiDOI: 10.14736/kyb-2021-6-0879

Abstract:

We establish a decomposition of the Jensen-Shannon divergence into a linear combination of a scaled Jeffreys' divergence and a reversed Jensen-Shannon divergence. Upper and lower bounds for the Jensen-Shannon divergence are then found in terms of the squared (total) variation distance. The derivations rely upon the Pinsker inequality and the reverse Pinsker inequality. We use these bounds to prove the asymptotic equivalence of the maximum likelihood estimate and minimum Jensen-Shannon divergence estimate as well as the asymptotic consistency of the minimum Jensen-Shannon divergence estimate. These are key properties for likelihood-free simulator-based inference.

Keywords:

blended divergences, Chan-Darwiche metric, likelihood-free inference, implicit maximum likelihood, reverse Pinsker inequality, simulator-based inference

Classification:

62B10, 62H05, 94A17

paper.pdf

References:

N. S. Barnet and S. Dragomir: A survey of recent inequalities for $\phi$-divergences of discrete probability distributions. In: Advances in Inequalities from Probability Theory and Statistics (N. S. Barnett and S. S. Dragomir, eds.), Nova Science Publishing, New York 2008, pp. 1-85. DOI:10.1002/ev.254
M. Basseville: Divergence measures for statistical data processing $-$ An annotated bibliography. Signal Processing 93 (2013), 621-633. DOI:10.1016/j.sigpro.2012.09.003
D. Berend and A. Kontorovich: A sharp estimate of the binomial mean absolute deviation with applications. Stat. Probab. Lett. 83 (2013), 1254-259. CrossRef
BOLFI Tutorial and Manual: https://elfi.readthedocs.io/en/latest/usage/BOLFI.html, 2017. CrossRef
U. Böhm, P. F. Dahm, B. F. McAllister and I. F. Greenbaum: Identifying chromosomal fragile sites from individuals: a multinomial statistical model. Human Genetics 95 (1995), 249-256. CrossRef
H. Chan and A. Darwiche: A distance measure for bounding probabilistic belief change. Int. J. Approx. Reasoning 38 (2005), 149-174. DOI:10.1016/j.ijar.2004.07.001
H. Chan and A. Darwiche: On the revision of probabilistic beliefs using uncertain evidence. Artif. Intell. 163 (2005), 67-90. CrossRef
C. D. Charalambous, I. Tzortzis, S. Loyka and T. Charalambous: Extremum problems with total variation distance and their applications. IEEE Trans. Automat. Control 59 (2014), 2353-2368. DOI:10.1109/TAC.2014.2321951
J. Corander, C. Fraser, M. U. Gutmann, B. Arnold, W. P. Hanage, S. D. Bentley, M. Lipsitch and N. J. Croucher: Frequency-dependent selection in vaccine-associated pneumococcal population dynamics. Nature Ecology Evolution 1 (2017), 1950-1960. DOI:10.1038/s41559-017-0337-x
Th. M. Cover and J. A. Thomas: Elements of Information Theory. Second edition. John Wiley and Sons, New York 2012. CrossRef
K. Cranmer, J. Brehmer and G. Louppe: The frontier of simulation-based inference. Proc. Natl. Acad. Sci. USA 117 (2020), 30055-30062. DOI:10.1073/pnas.1912789117
I. Csiszár and Z. Talata: Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans. Inform. Theory 52 (2006), 1007-1016. DOI:10.1109/TIT.2005.864431
I. Csiszár and P. C. Shields: Information Theory and Statistics: A tutorial. Now Publishers Inc, Delft 2004. CrossRef
L. Devroye: The equivalence of weak, strong and complete convergence in $ L_1 $ for kernel density estimates. Ann. Statist. 11 (1983), 896-904. CrossRef
P. J. Diggle and R. J. Gratton: Monte Carlo methods of inference for implicit statistical models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 46, (1984), 193-212. CrossRef
D. M.Endre and J. E. Schindelin: A new metric for probability distributions. IEEE Trans. Inform. Theory 49 (2003), 1858-1860. DOI:10.1109/TIT.2003.813506
A. A. Fedotov, P. Harremoës and F. Topsøe: Refinements of Pinsker's inequality. IEEE Trans. Inform. Theory 49 (2003), 1491-1498. DOI:10.1109/TIT.2003.811927
A. L. Gibbs and F. E. Su: On choosing and bounding probability metrics. Int. Stat. Rev. 70 (2002), 419-435. DOI:10.1111/j.1751-5823.2002.tb00178.x
A. Guntuboyina: Lower bounds for the minimax risk using $ f $-divergences, and applications. IEEE Trans. Inform. Theory 57 (2011), 2386-2399. DOI:10.1109/TIT.2011.2110791
M. U. Gutmann and J. Corander: Bayesian optimization for likelihood-free inference of simulator-based statistical models. J. Mach. Learn. Res. 17, (2016), 4256-4302. CrossRef
M. Gyllenberg, T. Koski, E. Reilink and M. Verlaan: Non-uniqueness in probabilistic numerical identification of bacteria. J. App. Prob. 31 (1994), 542-548. DOI:10.1017/S0021900200045034
M. Gyllenberg and T. Koski: Numerical taxonomy and the principle of maximum entropy. J. Classification 13 (1996), 213-229. DOI:10.1007/BF01246099
I. Holopainen: Evaluating Uncertainty with Jensen-Shannon Divergence. Master's Thesis, Faculty of Science, University of Helsinki 2021. CrossRef
C-D. Hou, J. Chiang and J. J. Tai: Identifying chromosomal fragile sites from a hierarchical-clustering point of view. Biometrics 57 (2001), 435-440. DOI:10.1111/j.0006-341X.2001.00435.x
M. Janžura and P. Boček: A method for knowledge integration. Kybernetika 34 (1998), 41-55. CrossRef
N. Jardine and R. Sibson: Mathematical Taxonomy. J. Wiley and Sons, London 1971. CrossRef
M. Khosravifard, D. Fooladivanda and T. A. Gulliver: Exceptionality of the variational distance. In: 2006 IEEE Information Theory Workshop-ITW'06 Chengdu 2006, pp. 274-276. CrossRef
T. Koski: Probability Calculus for Data Science. Studentlitteratur, Lund 2020. CrossRef
V. Kůs: Blended $\phi $-divergences with examples. Kybernetika 39 (2003), 43-54. CrossRef
V. Kůs, D. Morales and I. Vajda: Extensions of the parametric families of divergences used in statistical inference. Kybernetika 44 (2008), 95-112. DOI:10.1111/j.1399-0004.1993.tb03860.x
L. LeCam: On the assumptions used to prove asymptotic normality of maximum likelihood estimates. Ann. Math. Statist. 41 (1970), 802-828. DOI:10.1214/aoms/1177696960
F. Liese and I. Vajda: On divergences and informations in statistics and information theory. IEEE Trans. Inform. Theory 52 (2006), 4394-4412. DOI:10.1109/TIT.2006.881731
K. Li and J. Mitendra: Implicit maximum likelihood estimation. arXiv preprint arXiv:1809.09087, 2018). CrossRef
J. Lin: Divergence measures based on the Shannon entropy. IEEE Trans. Inform. Theory 37 (1991), 145-151. DOI:10.1109/18.61115
J. Lintusaari, M. U Gutmann, R. Dutta, S. Kaski and J. Corander: Fundamentals and recent developments in approximate Bayesian computation. Systematic Biology 66 (2017), e66-e82. CrossRef
J. Lintusaari, H. Vuollekoski, A. Kangasrääsiö, K. Skytén, M. Järvenpää, P. Marttinen, M. U. Gutmann, A. Vehtari, J. Corander and S. Kaski: ELFI: Engine for likelihood-free inference. J. Mach. Learn. Res. 19 (2018), 1-7. CrossRef
D. Morales, L. Pardo and I. Vajda: Asymptotic divergence of estimates of discrete distributions. J. Statist. Plann. Inference 48 (1995), 347-369. DOI:10.1016/0378-3758(95)00013-Y
S. Nowozin, B. Cseke and R. Tomioka: f-gan: Training generative neural samplers using variational divergence minimization. Advances Neural Inform. Process. Systems (2016), 271-279. CrossRef
M. Okamoto: Some inequalities relating to the partial sum of binomial probabilities. Ann. Inst.of Statist. Math. 10 (1959), 29-35. DOI:10.1007/BF02883985
I. Sason: On f-divergences: Integral representations, local behavior, and inequalities. Entropy 20 (2018), 383-405. DOI:10.3390/e20050383
I. Sason and S. Verdu: $f$-divergence inequalities. IEEE Trans. Inform. Theory 62 (2016), 5973-6006. DOI:10.1109/TIT.2016.2603151
M. Shannon: Properties of f-divergences and f-GAN training. arXiv preprint arXiv:2009.00757, 2020. CrossRef
R. Sibson: Information radius. Z. Wahrsch. Verw. Geb. 14 (1969), 149-160. DOI:10.1007/BF00537520
M. Sinn and A. Rawat: Non-parametric estimation of Jensen-Shannon divergence in generative adversarial network training. In: International Conference on Artificial Intelligence and Statistics 2018, pp. 642-651. CrossRef
I. J. Taneja: On mean divergence measures. In: Advances in Inequalities from Probability Theory and Statistics (N. S. Barnett and S. S. Dragomir, eds.), Nova Science Publishing, New York 2008, pp. 169-186. CrossRef
F. Topsøe: Information-theoretical optimization techniques. Kybernetika 15 (1979), 8-27. CrossRef
F. Topsøe: Some inequalities for information divergence and related measures of discrimination. IEEE Trans. Inform. Theory 46 (2000), 1602-1609. DOI:10.1109/18.850703
I. Vajda: Note on discrimination information and variation (Corresp.). IEEE Trans. Inform. Theory 16 (1970), 771-773. DOI:10.1109/TIT.1970.1054557
I. Vajda: Theory of Statistical Inference and Information. Kluwer Academic Publ., Delft 1989. CrossRef
I. Vajda: On metric divergences of probability measures. Kybernetika 45 (2009), 885-900. DOI:10.1145/1932682.1869533
J. I. Yellott Jr.: The relationship between Luce's choice axiom, Thurstone's theory of comparative judgment, and the double exponential distribution. J. Math. Psych. 15 (1977), 109-144. DOI:10.1016/0022-2496(77)90026-8
F. Österreicher and I. Vajda: Statistical information and discrimination. IEEE Trans. Inform. Theory 39 (1993), 1036-1039. DOI:10.1109/18.256536

Kybernetika

Journal