skip to main content
research-article
Open Access

The simple essence of automatic differentiation

Published:30 July 2018Publication History
Skip Abstract Section

Abstract

Automatic differentiation (AD) in reverse mode (RAD) is a central component of deep learning and other uses of large-scale optimization. Commonly used RAD algorithms such as backpropagation, however, are complex and stateful, hindering deep understanding, improvement, and parallel execution. This paper develops a simple, generalized AD algorithm calculated from a simple, natural specification. The general algorithm is then specialized by varying the representation of derivatives. In particular, applying well-known constructions to a naive representation yields two RAD algorithms that are far simpler than previously known. In contrast to commonly used RAD implementations, the algorithms defined here involve no graphs, tapes, variables, partial derivatives, or mutation. They are inherently parallel-friendly, correct by construction, and usable directly from an existing programming language with no need for new data types or programming style, thanks to use of an AD-agnostic compiler plugin.

Skip Supplemental Material Section

Supplemental Material

a70-elliott.webm

webm

88.8 MB

References

  1. Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning . In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265–283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Andrew W. Appel. 2007. Compiling with Continuations. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Steve Awodey. 2006. Category theory. Oxford Logic Guides, Vol. 49. Oxford University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Richard Bird and Oege de Moor. 1996. The Algebra of Programming . Prentice-Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Max Bolingbroke. 2011. Constraint kinds for GHC. Blog post. http://blog.omega- prime.co.uk/2011/09/10/ constraint- kinds- for- ghc/ .Google ScholarGoogle Scholar
  6. François Chollet. 2016. Keras resources. GitHub repository. https://github.com/fcholletGoogle ScholarGoogle Scholar
  7. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2001. Introduction to Algorithms, Third Edition. The MIT Press and McGraw-Hill Book Company. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Olivier Danvy and Lasse R. Nielsen. 2001. Defunctionalization at work . In Proceedings of the 3rd ACM SIGPLAN International Conference on Principles and Practice of Declarative Programming (PPDP ’01). 162–174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Conal Elliott. 2009. Beautiful differentiation . In International Conference on Functional Programming (ICFP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Conal Elliott. 2017. Compiling to categories . Proceedings of the ACM on Programming Languages 1, ICFP, Article 48 (Sept. 2017), 24 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Conal Elliott. 2018. The simple essence of automatic differentiation (Extended version) . CoRR abs/1804.00746 (2018). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Brendan Fong, David I. Spivak, and Rémy Tuyéras. 2017. Backprop as functor: A compositional perspective on supervised learning . CoRR abs/1711.10455 (2017).Google ScholarGoogle Scholar
  13. Jeremy Gibbons. 2002. Calculating functional programs . In Algebraic and Coalgebraic Methods in the Mathematics of Program Construction. Lecture Notes in Computer Science, Vol. 2297. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Andy Gill. 2009. Type-safe observable sharing in Haskell . In Proceedings of the 2nd ACM SIGPLAN Symposium on Haskell (Haskell ’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Andreas Griewank. 1989. On Automatic Differentiation. In In Mathematical Programming: Recent Developments and Applications.Google ScholarGoogle Scholar
  17. Andreas Griewank and Andrea Walther. 2008. Evaluating Derivatives. Principles and Techniques of Algorithmic Differentiation (second ed.). Society for Industrial and Applied Mathematics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Joe Hermaszewski and Ben Gamari. 2017. vector-sized. http://github.com/expipiplus1/vector- sized Haskell library.Google ScholarGoogle Scholar
  19. Ralf Hinze. 2000. Memo functions, polytypically! . In 2nd Workshop on Generic Programming. 17–32.Google ScholarGoogle Scholar
  20. T. C. Hu and M. T. Shing. 1981. Computation of matrix chain products, Part I, Part II. Technical Report STAN-CS-TR-81-875. Stanford University, Department of Computer Science. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross B. Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding . CoRR abs/1408.5093 (2014).Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jerzy Karczmarczuk. 1999. Functional coding of differential forms . In Scottish Workshop on Functional Programming.Google ScholarGoogle Scholar
  23. Jerzy Karczmarczuk. 2000. Adjoint codes in functional framework .Google ScholarGoogle Scholar
  24. Jerzy Karczmarczuk. 2001. Functional differentiation of computer programs . Higher-Order and Symbolic Computation 14, 1 (2001). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Andrew Kennedy. 2007. Compiling with continuations, continued . In ACM SIGPLAN International Conference on Functional Programming. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Oleg Kiselyov and Chung-chieh Shan. 2004. Functional pearl: Implicit configurations—or, type classes reflect the values of types . In Proceedings of the 2004 ACM SIGPLAN Workshop on Haskell (Haskell ’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Edward Kmett. 2011. The adjunctions package. https://hackage.haskell.org/package/adjunctions . Haskell library.Google ScholarGoogle Scholar
  28. Edward Kmett, Barak Pearlmutter, and Jeffrey Mark Siskind. 2010. The ad package. https://hackage.haskell.org/package/ad . Haskell library.Google ScholarGoogle Scholar
  29. Joachim Lambek. 1980. From λ-calculus to cartesian closed categories. In To H.B. Curry: Essays on Combinatory Logic, Lambda Calculus, and Formalism, J.P. Seldin and J.R. Hindley (Eds.). Academic Press.Google ScholarGoogle Scholar
  30. Joachim Lambek. 1986. Cartesian closed categories and typed lambda-calculi. In Thirteenth Spring School of the LITP on Combinators and Functional Programming Languages. 136–175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Serge Lang. 1987. Linear Algebra (3rd ed.). Springer-Verlag.Google ScholarGoogle Scholar
  32. F. William Lawvere and Stephen H. Schanuel. 2009. Conceptual Mathematics: A First Introduction to Categories (2nd ed.). Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (27 5 2015), 436–444.Google ScholarGoogle Scholar
  34. Seppo Linnainmaa. 1970. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis. University of Helsinki.Google ScholarGoogle Scholar
  35. Saunders Mac Lane. 1998. Categories for the Working Mathematician. Springer New York.Google ScholarGoogle Scholar
  36. Hugo Daniel Macedo and José Nuno Oliveira. 2013. Typing linear algebra: A biproduct-oriented approach . Science of Computer Programming 78, 11 (2013), 2160–2191. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. José Pedro Magalhães, Atze Dijkstra, Johan Jeuring, and Andres Löh. 2010. A generic deriving mechanism for Haskell . In Haskell Symposium. 37–48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. José Pedro Magalhães et al. 2011. GHC.Generics. https://wiki.haskell.org/GHC.Generics Haskell wiki.Google ScholarGoogle Scholar
  39. Uwe Naumann. 2008. Optimal Jacobian accumulation is NP-complete. Mathematical Programming 112 (2008), 427–441. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Chris Olah. 2015. Neural networks, types, and functional programming. Blog post. http://colah.github.io/posts/ 2015- 09- NN- Types- FP/ .Google ScholarGoogle Scholar
  41. José Nuno Oliveira. 2018. Program Design by Calculation . Draft of textbook in preparation.Google ScholarGoogle Scholar
  42. Barak A. Pearlmutter and Jeffrey Mark Siskind. 2007. Lazy multivariate higher-order forward-mode AD . In Proceedings of the 34th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Barak A. Pearlmutter and Jeffrey Mark Siskind. 2008. Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator . ACM TOPLAS 30, 2 (March 2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Simon L. Peyton Jones, Simon Marlow, and Conal Elliott. 1999. Stretching the storage manager: Weak pointers and stable names in Haskell . In Implementation of Functional Languages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Marian Boykan Pour-El and Ian Richards. 1978. Differentiability properties of computable functions—A summary . Acta Cybernetica 4, 1 (1978), 123–125.Google ScholarGoogle Scholar
  46. Marian Boykan Pour-El and Ian Richards. 1983. Computability and noncomputability in classical analysis . Transactions of the American Mathematical Society 275, 2 (1983), 539–560.Google ScholarGoogle ScholarCross RefCross Ref
  47. Louis B. Rall. 1981. Automatic Differentiation: Techniques and Applications. Springer-Verlag.Google ScholarGoogle ScholarCross RefCross Ref
  48. John C. Reynolds. 1972. Definitional interpreters for higher-order programming languages . In Reprinted from the proceedings of the 25th ACM National Conference. ACM, 717–740. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Emily Riehl. 2016. Category Theory in Context. Dover Publications.Google ScholarGoogle Scholar
  50. R. Tyrrell Rockafellar. 1966. Characterization of the subdifferentials of convex functions . Pacific Journal of Mathematics 17, 3 (1966), 497–510.Google ScholarGoogle ScholarCross RefCross Ref
  51. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. Learning representations by back-propagating errors. In Neurocomputing: Foundations of Research. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Jeffrey Mark Siskind and Barak A. Pearlmutter. 2008. Nesting forward-mode AD in a functional framework . Higher Order Symbolic Computation 21, 4 (2008), 361–376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Bert Speelpenning. 1980. Compiling fast partial derivatives of functions given by algorithms . Ph.D. Dissertation. University of Illinois at Urbana-Champaign. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Michael Spivak. 1965. Calculus on Manifolds: A Modern Approach to Classical Theorems of Advanced Calculus. Addison-Wesley.Google ScholarGoogle Scholar
  55. Mitchell Wand. 1980. Continuation-based program transformation strategies. Journal of the ACM 27, 1 (1980), 164–180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. R. E. Wengert. 1964. A simple automatic derivative evaluation program. Communications of the ACM 7, 8 (1964), 463–464. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The simple essence of automatic differentiation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the ACM on Programming Languages
        Proceedings of the ACM on Programming Languages  Volume 2, Issue ICFP
        September 2018
        1133 pages
        EISSN:2475-1421
        DOI:10.1145/3243631
        Issue’s Table of Contents

        Copyright © 2018 Owner/Author

        This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 30 July 2018
        Published in pacmpl Volume 2, Issue ICFP

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader