Abstract
Automatic differentiation (AD) in reverse mode (RAD) is a central component of deep learning and other uses of large-scale optimization. Commonly used RAD algorithms such as backpropagation, however, are complex and stateful, hindering deep understanding, improvement, and parallel execution. This paper develops a simple, generalized AD algorithm calculated from a simple, natural specification. The general algorithm is then specialized by varying the representation of derivatives. In particular, applying well-known constructions to a naive representation yields two RAD algorithms that are far simpler than previously known. In contrast to commonly used RAD implementations, the algorithms defined here involve no graphs, tapes, variables, partial derivatives, or mutation. They are inherently parallel-friendly, correct by construction, and usable directly from an existing programming language with no need for new data types or programming style, thanks to use of an AD-agnostic compiler plugin.
Supplemental Material
- Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning . In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265–283. Google ScholarDigital Library
- Andrew W. Appel. 2007. Compiling with Continuations. Cambridge University Press. Google ScholarDigital Library
- Steve Awodey. 2006. Category theory. Oxford Logic Guides, Vol. 49. Oxford University Press. Google ScholarDigital Library
- Richard Bird and Oege de Moor. 1996. The Algebra of Programming . Prentice-Hall. Google ScholarDigital Library
- Max Bolingbroke. 2011. Constraint kinds for GHC. Blog post. http://blog.omega- prime.co.uk/2011/09/10/ constraint- kinds- for- ghc/ .Google Scholar
- François Chollet. 2016. Keras resources. GitHub repository. https://github.com/fcholletGoogle Scholar
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2001. Introduction to Algorithms, Third Edition. The MIT Press and McGraw-Hill Book Company. Google ScholarDigital Library
- Olivier Danvy and Lasse R. Nielsen. 2001. Defunctionalization at work . In Proceedings of the 3rd ACM SIGPLAN International Conference on Principles and Practice of Declarative Programming (PPDP ’01). 162–174. Google ScholarDigital Library
- Conal Elliott. 2009. Beautiful differentiation . In International Conference on Functional Programming (ICFP). Google ScholarDigital Library
- Conal Elliott. 2017. Compiling to categories . Proceedings of the ACM on Programming Languages 1, ICFP, Article 48 (Sept. 2017), 24 pages. Google ScholarDigital Library
- Conal Elliott. 2018. The simple essence of automatic differentiation (Extended version) . CoRR abs/1804.00746 (2018). Google ScholarDigital Library
- Brendan Fong, David I. Spivak, and Rémy Tuyéras. 2017. Backprop as functor: A compositional perspective on supervised learning . CoRR abs/1711.10455 (2017).Google Scholar
- Jeremy Gibbons. 2002. Calculating functional programs . In Algebraic and Coalgebraic Methods in the Mathematics of Program Construction. Lecture Notes in Computer Science, Vol. 2297. Springer-Verlag. Google ScholarDigital Library
- Andy Gill. 2009. Type-safe observable sharing in Haskell . In Proceedings of the 2nd ACM SIGPLAN Symposium on Haskell (Haskell ’09). Google ScholarDigital Library
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. Google ScholarDigital Library
- Andreas Griewank. 1989. On Automatic Differentiation. In In Mathematical Programming: Recent Developments and Applications.Google Scholar
- Andreas Griewank and Andrea Walther. 2008. Evaluating Derivatives. Principles and Techniques of Algorithmic Differentiation (second ed.). Society for Industrial and Applied Mathematics. Google ScholarDigital Library
- Joe Hermaszewski and Ben Gamari. 2017. vector-sized. http://github.com/expipiplus1/vector- sized Haskell library.Google Scholar
- Ralf Hinze. 2000. Memo functions, polytypically! . In 2nd Workshop on Generic Programming. 17–32.Google Scholar
- T. C. Hu and M. T. Shing. 1981. Computation of matrix chain products, Part I, Part II. Technical Report STAN-CS-TR-81-875. Stanford University, Department of Computer Science. Google ScholarDigital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross B. Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding . CoRR abs/1408.5093 (2014).Google ScholarDigital Library
- Jerzy Karczmarczuk. 1999. Functional coding of differential forms . In Scottish Workshop on Functional Programming.Google Scholar
- Jerzy Karczmarczuk. 2000. Adjoint codes in functional framework .Google Scholar
- Jerzy Karczmarczuk. 2001. Functional differentiation of computer programs . Higher-Order and Symbolic Computation 14, 1 (2001). Google ScholarDigital Library
- Andrew Kennedy. 2007. Compiling with continuations, continued . In ACM SIGPLAN International Conference on Functional Programming. Google ScholarDigital Library
- Oleg Kiselyov and Chung-chieh Shan. 2004. Functional pearl: Implicit configurations—or, type classes reflect the values of types . In Proceedings of the 2004 ACM SIGPLAN Workshop on Haskell (Haskell ’04). Google ScholarDigital Library
- Edward Kmett. 2011. The adjunctions package. https://hackage.haskell.org/package/adjunctions . Haskell library.Google Scholar
- Edward Kmett, Barak Pearlmutter, and Jeffrey Mark Siskind. 2010. The ad package. https://hackage.haskell.org/package/ad . Haskell library.Google Scholar
- Joachim Lambek. 1980. From λ-calculus to cartesian closed categories. In To H.B. Curry: Essays on Combinatory Logic, Lambda Calculus, and Formalism, J.P. Seldin and J.R. Hindley (Eds.). Academic Press.Google Scholar
- Joachim Lambek. 1986. Cartesian closed categories and typed lambda-calculi. In Thirteenth Spring School of the LITP on Combinators and Functional Programming Languages. 136–175. Google ScholarDigital Library
- Serge Lang. 1987. Linear Algebra (3rd ed.). Springer-Verlag.Google Scholar
- F. William Lawvere and Stephen H. Schanuel. 2009. Conceptual Mathematics: A First Introduction to Categories (2nd ed.). Cambridge University Press. Google ScholarDigital Library
- Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (27 5 2015), 436–444.Google Scholar
- Seppo Linnainmaa. 1970. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis. University of Helsinki.Google Scholar
- Saunders Mac Lane. 1998. Categories for the Working Mathematician. Springer New York.Google Scholar
- Hugo Daniel Macedo and José Nuno Oliveira. 2013. Typing linear algebra: A biproduct-oriented approach . Science of Computer Programming 78, 11 (2013), 2160–2191. Google ScholarDigital Library
- José Pedro Magalhães, Atze Dijkstra, Johan Jeuring, and Andres Löh. 2010. A generic deriving mechanism for Haskell . In Haskell Symposium. 37–48. Google ScholarDigital Library
- José Pedro Magalhães et al. 2011. GHC.Generics. https://wiki.haskell.org/GHC.Generics Haskell wiki.Google Scholar
- Uwe Naumann. 2008. Optimal Jacobian accumulation is NP-complete. Mathematical Programming 112 (2008), 427–441. Google ScholarDigital Library
- Chris Olah. 2015. Neural networks, types, and functional programming. Blog post. http://colah.github.io/posts/ 2015- 09- NN- Types- FP/ .Google Scholar
- José Nuno Oliveira. 2018. Program Design by Calculation . Draft of textbook in preparation.Google Scholar
- Barak A. Pearlmutter and Jeffrey Mark Siskind. 2007. Lazy multivariate higher-order forward-mode AD . In Proceedings of the 34th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’07). Google ScholarDigital Library
- Barak A. Pearlmutter and Jeffrey Mark Siskind. 2008. Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator . ACM TOPLAS 30, 2 (March 2008). Google ScholarDigital Library
- Simon L. Peyton Jones, Simon Marlow, and Conal Elliott. 1999. Stretching the storage manager: Weak pointers and stable names in Haskell . In Implementation of Functional Languages. Google ScholarDigital Library
- Marian Boykan Pour-El and Ian Richards. 1978. Differentiability properties of computable functions—A summary . Acta Cybernetica 4, 1 (1978), 123–125.Google Scholar
- Marian Boykan Pour-El and Ian Richards. 1983. Computability and noncomputability in classical analysis . Transactions of the American Mathematical Society 275, 2 (1983), 539–560.Google ScholarCross Ref
- Louis B. Rall. 1981. Automatic Differentiation: Techniques and Applications. Springer-Verlag.Google ScholarCross Ref
- John C. Reynolds. 1972. Definitional interpreters for higher-order programming languages . In Reprinted from the proceedings of the 25th ACM National Conference. ACM, 717–740. Google ScholarDigital Library
- Emily Riehl. 2016. Category Theory in Context. Dover Publications.Google Scholar
- R. Tyrrell Rockafellar. 1966. Characterization of the subdifferentials of convex functions . Pacific Journal of Mathematics 17, 3 (1966), 497–510.Google ScholarCross Ref
- David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. Learning representations by back-propagating errors. In Neurocomputing: Foundations of Research. MIT Press. Google ScholarDigital Library
- Jeffrey Mark Siskind and Barak A. Pearlmutter. 2008. Nesting forward-mode AD in a functional framework . Higher Order Symbolic Computation 21, 4 (2008), 361–376. Google ScholarDigital Library
- Bert Speelpenning. 1980. Compiling fast partial derivatives of functions given by algorithms . Ph.D. Dissertation. University of Illinois at Urbana-Champaign. Google ScholarDigital Library
- Michael Spivak. 1965. Calculus on Manifolds: A Modern Approach to Classical Theorems of Advanced Calculus. Addison-Wesley.Google Scholar
- Mitchell Wand. 1980. Continuation-based program transformation strategies. Journal of the ACM 27, 1 (1980), 164–180. Google ScholarDigital Library
- R. E. Wengert. 1964. A simple automatic derivative evaluation program. Communications of the ACM 7, 8 (1964), 463–464. Google ScholarDigital Library
Index Terms
- The simple essence of automatic differentiation
Recommendations
Numerical solution of a third-order nonlinear boundary-value problem by automatic differentiation
We develop a simple numerical method for obtaining Taylor series approximation to the solution of a nonlinear third-order boundary-value problem. We use recursive formulas derived from the governing differential equation itself to calculate exact values ...
Efficient Derivative Codes through Automatic Differentiation and Interface Contraction: An Application in Biostatistics
Developing code for computing the first- and higher-order derivatives of a function by hand can be very time consuming and is prone to errors. Automatic differentiation has proven capable of producing derivative codes with very little effort on the part ...
Bringing together automatic differentiation and OpenMP
ICS '01: Proceedings of the 15th international conference on SupercomputingDerivatives of almost arbitrary functions can be evaluated efficiently by automatic differentiation whenever the functions are given in the form of computer programs in a high-level programming language such as Fortran, C, or C++. Furthermore, in ...
Comments