research-article

Open Access

The simple essence of automatic differentiation

Author:
Conal Elliott

Target, USA

Target, USA
View Profile

Proceedings of the ACM on Programming Languages Volume 2 Issue ICFPArticle No.: 70pp 1–29https://doi.org/10.1145/3236765

Published:30 July 2018Publication History

Proceedings of the ACM on Programming Languages

Abstract

Automatic differentiation (AD) in reverse mode (RAD) is a central component of deep learning and other uses of large-scale optimization. Commonly used RAD algorithms such as backpropagation, however, are complex and stateful, hindering deep understanding, improvement, and parallel execution. This paper develops a simple, generalized AD algorithm calculated from a simple, natural specification. The general algorithm is then specialized by varying the representation of derivatives. In particular, applying well-known constructions to a naive representation yields two RAD algorithms that are far simpler than previously known. In contrast to commonly used RAD implementations, the algorithms defined here involve no graphs, tapes, variables, partial derivatives, or mutation. They are inherently parallel-friendly, correct by construction, and usable directly from an existing programming language with no need for new data types or programming style, thanks to use of an AD-agnostic compiler plugin.

Supplemental Material

a70-elliott.webm

webm

88.8 MB

Download

References

Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning . In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265–283. Google ScholarDigital Library
Andrew W. Appel. 2007. Compiling with Continuations. Cambridge University Press. Google ScholarDigital Library
Steve Awodey. 2006. Category theory. Oxford Logic Guides, Vol. 49. Oxford University Press. Google ScholarDigital Library
Richard Bird and Oege de Moor. 1996. The Algebra of Programming . Prentice-Hall. Google ScholarDigital Library
Max Bolingbroke. 2011. Constraint kinds for GHC. Blog post. http://blog.omega- prime.co.uk/2011/09/10/ constraint- kinds- for- ghc/ .Google Scholar
François Chollet. 2016. Keras resources. GitHub repository. https://github.com/fcholletGoogle Scholar
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2001. Introduction to Algorithms, Third Edition. The MIT Press and McGraw-Hill Book Company. Google ScholarDigital Library
Olivier Danvy and Lasse R. Nielsen. 2001. Defunctionalization at work . In Proceedings of the 3rd ACM SIGPLAN International Conference on Principles and Practice of Declarative Programming (PPDP ’01). 162–174. Google ScholarDigital Library
Conal Elliott. 2009. Beautiful differentiation . In International Conference on Functional Programming (ICFP). Google ScholarDigital Library
Conal Elliott. 2017. Compiling to categories . Proceedings of the ACM on Programming Languages 1, ICFP, Article 48 (Sept. 2017), 24 pages. Google ScholarDigital Library
Conal Elliott. 2018. The simple essence of automatic differentiation (Extended version) . CoRR abs/1804.00746 (2018). Google ScholarDigital Library
Brendan Fong, David I. Spivak, and Rémy Tuyéras. 2017. Backprop as functor: A compositional perspective on supervised learning . CoRR abs/1711.10455 (2017).Google Scholar
Jeremy Gibbons. 2002. Calculating functional programs . In Algebraic and Coalgebraic Methods in the Mathematics of Program Construction. Lecture Notes in Computer Science, Vol. 2297. Springer-Verlag. Google ScholarDigital Library
Andy Gill. 2009. Type-safe observable sharing in Haskell . In Proceedings of the 2nd ACM SIGPLAN Symposium on Haskell (Haskell ’09). Google ScholarDigital Library
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. Google ScholarDigital Library
Andreas Griewank. 1989. On Automatic Differentiation. In In Mathematical Programming: Recent Developments and Applications.Google Scholar
Andreas Griewank and Andrea Walther. 2008. Evaluating Derivatives. Principles and Techniques of Algorithmic Differentiation (second ed.). Society for Industrial and Applied Mathematics. Google ScholarDigital Library
Joe Hermaszewski and Ben Gamari. 2017. vector-sized. http://github.com/expipiplus1/vector- sized Haskell library.Google Scholar
Ralf Hinze. 2000. Memo functions, polytypically! . In 2nd Workshop on Generic Programming. 17–32.Google Scholar
T. C. Hu and M. T. Shing. 1981. Computation of matrix chain products, Part I, Part II. Technical Report STAN-CS-TR-81-875. Stanford University, Department of Computer Science. Google ScholarDigital Library
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross B. Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding . CoRR abs/1408.5093 (2014).Google ScholarDigital Library
Jerzy Karczmarczuk. 1999. Functional coding of differential forms . In Scottish Workshop on Functional Programming.Google Scholar
Jerzy Karczmarczuk. 2000. Adjoint codes in functional framework .Google Scholar
Jerzy Karczmarczuk. 2001. Functional differentiation of computer programs . Higher-Order and Symbolic Computation 14, 1 (2001). Google ScholarDigital Library
Andrew Kennedy. 2007. Compiling with continuations, continued . In ACM SIGPLAN International Conference on Functional Programming. Google ScholarDigital Library
Oleg Kiselyov and Chung-chieh Shan. 2004. Functional pearl: Implicit configurations—or, type classes reflect the values of types . In Proceedings of the 2004 ACM SIGPLAN Workshop on Haskell (Haskell ’04). Google ScholarDigital Library
Edward Kmett. 2011. The adjunctions package. https://hackage.haskell.org/package/adjunctions . Haskell library.Google Scholar
Edward Kmett, Barak Pearlmutter, and Jeffrey Mark Siskind. 2010. The ad package. https://hackage.haskell.org/package/ad . Haskell library.Google Scholar
Joachim Lambek. 1980. From λ-calculus to cartesian closed categories. In To H.B. Curry: Essays on Combinatory Logic, Lambda Calculus, and Formalism, J.P. Seldin and J.R. Hindley (Eds.). Academic Press.Google Scholar
Joachim Lambek. 1986. Cartesian closed categories and typed lambda-calculi. In Thirteenth Spring School of the LITP on Combinators and Functional Programming Languages. 136–175. Google ScholarDigital Library
Serge Lang. 1987. Linear Algebra (3rd ed.). Springer-Verlag.Google Scholar
F. William Lawvere and Stephen H. Schanuel. 2009. Conceptual Mathematics: A First Introduction to Categories (2nd ed.). Cambridge University Press. Google ScholarDigital Library
Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (27 5 2015), 436–444.Google Scholar
Seppo Linnainmaa. 1970. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis. University of Helsinki.Google Scholar
Saunders Mac Lane. 1998. Categories for the Working Mathematician. Springer New York.Google Scholar
Hugo Daniel Macedo and José Nuno Oliveira. 2013. Typing linear algebra: A biproduct-oriented approach . Science of Computer Programming 78, 11 (2013), 2160–2191. Google ScholarDigital Library
José Pedro Magalhães, Atze Dijkstra, Johan Jeuring, and Andres Löh. 2010. A generic deriving mechanism for Haskell . In Haskell Symposium. 37–48. Google ScholarDigital Library
José Pedro Magalhães et al. 2011. GHC.Generics. https://wiki.haskell.org/GHC.Generics Haskell wiki.Google Scholar
Uwe Naumann. 2008. Optimal Jacobian accumulation is NP-complete. Mathematical Programming 112 (2008), 427–441. Google ScholarDigital Library
Chris Olah. 2015. Neural networks, types, and functional programming. Blog post. http://colah.github.io/posts/ 2015- 09- NN- Types- FP/ .Google Scholar
José Nuno Oliveira. 2018. Program Design by Calculation . Draft of textbook in preparation.Google Scholar
Barak A. Pearlmutter and Jeffrey Mark Siskind. 2007. Lazy multivariate higher-order forward-mode AD . In Proceedings of the 34th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’07). Google ScholarDigital Library
Barak A. Pearlmutter and Jeffrey Mark Siskind. 2008. Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator . ACM TOPLAS 30, 2 (March 2008). Google ScholarDigital Library
Simon L. Peyton Jones, Simon Marlow, and Conal Elliott. 1999. Stretching the storage manager: Weak pointers and stable names in Haskell . In Implementation of Functional Languages. Google ScholarDigital Library
Marian Boykan Pour-El and Ian Richards. 1978. Differentiability properties of computable functions—A summary . Acta Cybernetica 4, 1 (1978), 123–125.Google Scholar
Marian Boykan Pour-El and Ian Richards. 1983. Computability and noncomputability in classical analysis . Transactions of the American Mathematical Society 275, 2 (1983), 539–560.Google ScholarCross Ref
Louis B. Rall. 1981. Automatic Differentiation: Techniques and Applications. Springer-Verlag.Google ScholarCross Ref
John C. Reynolds. 1972. Definitional interpreters for higher-order programming languages . In Reprinted from the proceedings of the 25th ACM National Conference. ACM, 717–740. Google ScholarDigital Library
Emily Riehl. 2016. Category Theory in Context. Dover Publications.Google Scholar
R. Tyrrell Rockafellar. 1966. Characterization of the subdifferentials of convex functions . Pacific Journal of Mathematics 17, 3 (1966), 497–510.Google ScholarCross Ref
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. Learning representations by back-propagating errors. In Neurocomputing: Foundations of Research. MIT Press. Google ScholarDigital Library
Jeffrey Mark Siskind and Barak A. Pearlmutter. 2008. Nesting forward-mode AD in a functional framework . Higher Order Symbolic Computation 21, 4 (2008), 361–376. Google ScholarDigital Library
Bert Speelpenning. 1980. Compiling fast partial derivatives of functions given by algorithms . Ph.D. Dissertation. University of Illinois at Urbana-Champaign. Google ScholarDigital Library
Michael Spivak. 1965. Calculus on Manifolds: A Modern Approach to Classical Theorems of Advanced Calculus. Addison-Wesley.Google Scholar
Mitchell Wand. 1980. Continuation-based program transformation strategies. Journal of the ACM 27, 1 (1980), 164–180. Google ScholarDigital Library
R. E. Wengert. 1964. A simple automatic derivative evaluation program. Communications of the ACM 7, 8 (1964), 463–464. Google ScholarDigital Library

Index Terms

The simple essence of automatic differentiation
1. Mathematics of computing
  1. Continuous mathematics
    1. Calculus
      1. Differential calculus
2. Theory of computation
  1. Semantics and reasoning
    1. Program reasoning
      1. Program specifications

Recommendations

Numerical solution of a third-order nonlinear boundary-value problem by automatic differentiation

We develop a simple numerical method for obtaining Taylor series approximation to the solution of a nonlinear third-order boundary-value problem. We use recursive formulas derived from the governing differential equation itself to calculate exact values ...
Read More
Efficient Derivative Codes through Automatic Differentiation and Interface Contraction: An Application in Biostatistics

Developing code for computing the first- and higher-order derivatives of a function by hand can be very time consuming and is prone to errors. Automatic differentiation has proven capable of producing derivative codes with very little effort on the part ...
Read More
Bringing together automatic differentiation and OpenMP
ICS '01: Proceedings of the 15th international conference on Supercomputing

Derivatives of almost arbitrary functions can be evaluated efficiently by automatic differentiation whenever the functions are given in the form of computer programs in a high-level programming language such as Fortran, C, or C++. Furthermore, in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the ACM on Programming Languages Volume 2, Issue ICFP
September 2018
1133 pages
EISSN:2475-1421
DOI:10.1145/3243631
Issue’s Table of Contents

Copyright © 2018 Owner/Author
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 July 2018
Published in pacmpl Volume 2, Issue ICFP

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
automatic differentiation
category theory
program calculation
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 38
  Total Citations
  View Citations
- 3,139
  Total Downloads
- Downloads (Last 12 months)472
- Downloads (Last 6 weeks)104
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The simple essence of automatic differentiation

Proceedings of the ACM on Programming Languages

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Numerical solution of a third-order nonlinear boundary-value problem by automatic differentiation

Efficient Derivative Codes through Automatic Differentiation and Interface Contraction: An Application in Biostatistics

Bringing together automatic differentiation and OpenMP

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The simple essence of automatic differentiation

Proceedings of the ACM on Programming Languages

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Numerical solution of a third-order nonlinear boundary-value problem by automatic differentiation

Efficient Derivative Codes through Automatic Differentiation and Interface Contraction: An Application in Biostatistics

Bringing together automatic differentiation and OpenMP

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media