Elsevier

Computational Materials Science

Volume 152, September 2018, Pages 60-69
Computational Materials Science

Matminer: An open source toolkit for materials data mining

https://doi.org/10.1016/j.commatsci.2018.05.018Get rights and content

Abstract

As materials data sets grow in size and scope, the role of data mining and statistical learning methods to analyze these materials data sets and build predictive models is becoming more important. This manuscript introduces matminer, an open-source, Python-based software platform to facilitate data-driven methods of analyzing and predicting materials properties. Matminer provides modules for retrieving large data sets from external databases such as the Materials Project, Citrination, Materials Data Facility, and Materials Platform for Data Science. It also provides implementations for an extensive library of feature extraction routines developed by the materials community, with 47 featurization classes that can generate thousands of individual descriptors and combine them into mathematical functions. Finally, matminer provides a visualization module for producing interactive, shareable plots. These functions are designed in a way that integrates closely with machine learning and data analysis packages already developed and in use by the Python data science community. We explain the structure and logic of matminer, provide a description of its various modules, and showcase several examples of how matminer can be used to collect data, reproduce data mining studies reported in the literature, and test new methodologies.

Introduction

Recently, the materials community has placed a renewed emphasis in collecting and organizing large data sets for research, materials design, and the eventual application of statistical or “machine learning” techniques. For example, the mining of databases comprised of density functional theory (DFT) calculations has been used to identify materials for batteries [1], [2], to aid the design of metal alloys [3], [4], and for many other applications [5]. Importantly, such data sets present new opportunities to develop predictive models through machine learning techniques: rather than designing and programming such models manually, such techniques produce predictive models by learning from a body of examples. Machine learning models have been demonstrated to predict properties of crystalline materials much faster than DFT [6], [7], [8], [9], estimate properties that are difficult to access via other computational tools [10], [11], and guide the search for new materials [12], [13], [14], [15], [16]. With the continued development of general-purpose data mining methods for many types of materials data [17], [18], [19] and the proliferation of material property databases [20], this emerging field of “materials informatics” is positioned to have a continued impact on materials design.

In this paper, we describe a new software library, “matminer”, for applying data-driven techniques to the materials domain. The main roles of matminer are depicted in Fig. 1: matminer assists the user in retrieving large data sets from common databases, extracts features to transform the raw data into representations suitable for machine learning, and produces interactive visualizations of the data for exploratory analysis. We note that matminer does not itself implement common machine learning algorithms; industry-standard tools (e.g., scikit-learn or Keras) are already developed and maintained by the larger data science community for this purpose. Instead, matminer's role is to connect these advanced machine learning tools to the materials domain.

Matminer solves many problems encountered when conducting data-driven research. For example, learning the Application Programming Interface (API) for each data source and preprocessing retrieved data adds significant complexity to the task of building new machine learning models. Matminer provides a simplified interface that abstracts the details of these API interactions, making it easy for the user to query and organize large data sets into the standard pandas [21] data format used by the Python data science community. Furthermore, as we will further discuss later in the text, matminer implements a suite of 47 distinct feature extraction modules capable of producing thousands of physically relevant descriptors that can be leveraged by machine learning algorithms to more efficiently determine input-output relationships. Although many such feature extraction methods are reported in the literature, many lack an open source implementation. Matminer not only implements these domain-specific feature extraction methods but provides a unified interface for their use, making it trivial to reproduce or compare (and, eventually, extend) these methods. Finally, matminer contains many pre-defined recipes of visualizations for exploring and discovering different data relationships. In aggregate, these features allow for cutting edge materials informatics research to be conducted with a high-level, easy-to-use interface.

We note that prior efforts have produced software for computing features for materials (e.g., Magpie[22], [23], pyMKS [24]), building deep learning models of molecular materials (e.g., deepchem [25], [26]), providing turnkey machine learning estimates of various properties, or integrating machine learning with other software [27], [28], [29]. In contrast to these prior efforts (which have their own intended applications and scope), matminer is designed to interact and integrate with standard Python data mining tools such as pandas and scikit-learn [30], implements a library of feature generation methods (“featurizers”) for a wide variety of materials science entities (e.g., compositions, crystal structures, and electronic structures), and includes tools to assist with data retrieval and visualization.

The source code for the version of matminer described in this manuscript (version 0.3.2) and examples of its use are available as supplementary information. Updated versions are regularly published to the Python Package Index (https://pypi.python.org/pypi/matminer). The actively developed version of matminer is available on GitHub at https://github.com/hackingmaterials/matminer. Matminer also includes a dedicated repository of examples and tutorials (many in an interactive, runnable Jupyter notebook format [31]) for using the data retrieval, featurization, and visualization tools, located at https://github.com/hackingmaterials/matminer_examples. Full documentation for matminer is also available from https://hackingmaterials.github.io/matminer/. The matminer code currently contains 109 unit tests to ensure the integrity of the code, which are run automatically with each code commit through a continuous integration process. A help forum for matminer is available at: https://groups.google.com/forum/#!forum/matminer.

Section snippets

Software architecture and design principles

A guiding principle of matminer is to integrate domain-specific knowledge and data about materials into larger ecosystem of Python data analysis software. The Python community has developed a rich suite of interoperable tools for data science, which are broadly used across the data science community and occasionally known as the “PyData” or “SciPy” stacks [32]. These libraries include NumPy and Scipy [33], which provide a suite of high-performance numerical methods, and Jupyter [31], which

Components of matminer

We now describe the main functions of matminer. We describe each of the three major components. data retrieval, featurization, and visualization, separately.

Examples of using matminer

Next, we present four usage examples that showcase the capabilities of matminer. The source code for these and other examples are available as part of the matminer_examples GitHub repository (https://github.com/hackingmaterials/matminer_examples). Users can download, inspect, and execute the full code for these examples themselves and modify them for their own applications.

Conclusion

Performing materials informatics requires developing a data pipeline that encompasses data retrieval, feature extraction, and visualization prior to the actual machine learning step. The matminer software described in this manuscript is designed to facilitate the development, reuse, and reproducibility of data pipelines for materials informatics applications. We have designed matminer to connect the domain-specific aspects of materials informatics (i.e., materials data extraction, feature

Acknowledgements

This code was intellectually led and primarily developed using funding provided by U.S. Department of Energy, Office of Basic Energy Sciences, Early Career Research Program, which funded the efforts of AJ, AD, AF, SB, and QW. LW and IF were supported by financial assistance award 70NANB14H012 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Material Design (CHiMaD), by the National Science Foundation as part of the Midwest

References (92)

  • C. Nyshadham et al.

    Acta Mater.

    (2017)
  • S. Kirklin et al.

    Acta Mater.

    (2016)
  • L. Ward et al.

    Curr. Opin. Solid State Mater. Sci.

    (2017)
  • Z. Wu et al.

    Chem. Sci.

    (2018)
  • A. Khorshidi et al.

    Comput. Phys. Commun.

    (2016)
  • S.P. Ong et al.

    Comput. Mater. Sci.

    (2013)
  • S. Curtarolo et al.

    Comput. Mater. Sci.

    (2012)
  • S.P. Ong et al.

    Comput. Mater. Sci.

    (2015)
  • K. Mathew et al.

    Comput. Mater. Sci.

    (2017)
  • T. Fast et al.

    Acta Mater.

    (2011)
  • R.F. Zhang et al.

    Comput. Phys. Commun.

    (2016)
  • X. Yang et al.

    Prediction of high-entropy stabilized solid-solution in multi-component alloys

    Mater. Chem. Phys.

    (2012)
  • H. Chen et al.

    Chem. Mater.

    (2012)
  • M. Aykol et al.

    Nat. Commun.

    (2016)
  • A. Jain et al.

    APL Mater.

    (2016)
  • L. Ward et al.

    Phys. Rev. B

    (2017)
  • M. Rupp et al.

    Phys. Rev. Lett.

    (2012)
  • J. Carrete et al.

    Phys. Rev. X

    (2014)
  • J.C. Mauro et al.

    Chem. Mater.

    (2016)
  • E.W. Bucholz et al.

    Tribol. Lett.

    (2012)
  • T.D. Sparks et al.

    Scr. Mater.

    (2015)
  • R. Yuan et al.

    Adv. Mater.

    (2018)
  • A. Mannodi-Kanakkithodi et al.

    Mater. Today

    (2017)
  • F.A. Faber et al.

    Phys. Rev. Lett.

    (2016)
  • F. Ren et al.

    A. Mehta, Sci. Adv.

    (2018)
  • A. Seko et al.

    Phys. Rev. B

    (2017)
  • R. Ramprasad et al.

    Npj Comput. Mater.

    (2017)
  • S.R. Kalidindi

    ISRN Mater Sci.

    (2012)
  • J. Hill et al.

    MRS Bull.

    (2016)
  • W. McKinney, Proc. 9th Python Sci. Conf. 1697900 (2010)...
  • L. Ward et al.

    Npj Comput. Mater.

    (2016)
  • ...
  • W. Daniel, B. David, F. Tony, K. Surya, R. Andrew, PyMKS: Materials Knowledge System in Python, 2014. doi:...
  • ...
  • E. Gossett et al.

    S. Curtarolo

    (2017)
  • ...
  • F. Pedregosa et al.

    J. Mach. Learn. Res.

    (2011)
  • F. Perez et al.

    Comput. Sci. Eng.

    (2007)
  • K.J. Millman et al.

    Comput. Sci. Eng.

    (2011)
  • S. van der Walt et al.

    Comput. Sci. Eng.

    (2011)
  • ...
  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,...
  • A. Frantzen et al.

    Angew. Chemie Int. Ed.

    (2004)
  • Y. Xu et al.

    Jpn. J. Appl. Phys

    (2011)
  • A. Jain et al.

    APL Mater.

    (2013)
  • ...
  • Cited by (554)

    View all citing articles on Scopus
    View full text