Elsevier

Ecological Informatics

Volume 36, November 2016, Pages 237-246
Ecological Informatics

Ensuring the quality of data packages in the LTER network data management system

https://doi.org/10.1016/j.ecoinf.2016.08.001Get rights and content

Highlights

  • LTER Network developed a screening system for data packages with metadata in EML.

  • The system is driven by the community's existing Best Practices.

  • All LTER data are accompanied by a report describing completeness and quality.

  • Site-contributed data from LTER are 95% compliant with the current checks.

Abstract

Considerable data analyses use automated workflows to ingest data from public repositories, and rely on data packages of high structural quality. The Long Term Ecological Research (LTER) Network now screens all packages entering its long-term archive to ensure completeness and quality, and to ascertain that metadata and data are structurally congruent, i.e., that the data typing and formats expressed in metadata agree with that found in data entities. The EML Congruence Checker (ECC) system is a component of the LTER Provenance Aware Synthesis Tracking Architecture (PASTA), and operates on data tables in packages described with Ecological Metadata Language using the EML Data Manager Library, written in Java. Checking is extensible for other data types and customizable via a template. Reports are retained as part of the submitted data package, and summaries here reflect the general usability of LTER data for a variety of purposes. On average in 2015, site-contributed data in the LTER catalog were 95% compliant (valid) with the current suite of checks.

Introduction

Data sets are an important contribution from Long Term Ecological Research (LTER) sites to the LTER Network Information System (NIS); these are intended to be used in cross-site synthesis projects for dissemination to federated catalogs and national repositories, and their long-term nature makes them irreplaceable for tracking environmental change (Gosz et al., 2010, Peters, 2010). In 2002, the LTER Network adopted an XML specification for its data exchange, Ecological Metadata Language or EML (Fegraus et al., 2005), followed closely by narrative guidelines for usage and recommendations for completeness (LTER, 2011). By mid-decade all LTER sites were contributing metadata records to a central catalog, and by 2009 were fully populating EML records (Michener et al., 2011, Porter, 2010). As with many other format specifications, an EML record supports machine reading and interpretation of its associated data entities, and code generators have been developed for ingestion of EML-described data entities into statistical, processing and database environments (e.g., Lin et al., 2008, Porter et al., 2012). The first-order quality standard for XML records is schema compliance, and for EML, additional parsing code checks that internal identifiers and their references adhere to specific rules (EML Project, 2008). However, experience with automated use of LTER data packages indicated that a significant fraction did not have metadata and data of sufficient structural detail (Leinfelder et al., 2008, Leinfelder et al., 2010). Clearly, any automated use required a higher level of metadata and data quality and congruence, i.e., data typing and formats expressed in metadata must agree with that found in data entities. To assist data contributors as they prepare datasets, we developed a mechanism to provide feedback on congruence and potential usability to data contributors.

The Internet plus foresighted policies have fostered an enormous growth in the amount of scientific data available for download in all domains (e.g., AAAS, 2011). Further, the adoption of common, well-structured formats and metadata specifications allows for sophisticated machine reading. Some communities have developed conventions for specific data types or uses, usually as lists of defined fields with recommendations for use (e.g., CF Conventions, Eaton et al., 2011). But in many research domains, practices are still evolving for data's best handling and delivery. As has been the experience of the LTER Network, simple recommendations are inadequate, and a necessary continuation of these efforts will be benchmarks and compliance metrics.

Generally, assessments have focused on metadata content. NOAA (2014) has developed rubrics and metrics for metadata, which has been extended with summaries for specific use cases, e.g., for spatial data in GEOSS (Zabala et al., 2013). Habermann (2014) extends that approach further by evaluating XML metadata records using abstract metadata concepts (mapped to individual metadata specifications) against community-defined compliance levels and rubrics. The system has been implemented for some types of FGCD and ISO-19139 documents, with results aggregated into summaries for review. For Linked Open Data (LOD) the “LOD Laundromat” (Beek et al., 2014) converts idiosyncratic input to a “cleaned sibling” (their term), and so removes the contributor from the cleansing process. The set of heuristics applied is mainly syntactic, with semantic interpretation of content to detect duplicate triples. Results can be aggregated into bulk reports on input quality. LTER efforts presented here represent an intermediate approach: we examine metadata from one specification (EML), data entities therein, plus their agreement (termed “congruence”). This strategy means we can examine datasets more deeply than with the Habermann approach, but we do not attempt to correct metadata (or data) per the LOD Laundromat. The reasoning is two-fold: first, LTER needed to assure more than just presence of metadata elements, it was important to assure that data entities could be machine read, thus the need to examine congruence. Secondly, most ecological data are so complex that heuristics for their repair were simply beyond the scope of this project. Hence, our system informs submitters of its findings for their judgment and repair.

The LTER Network has developed the EML Congruence Checker (ECC) to inform the dataset contributor about the structure of the data package, and indicate whether the asserted metadata accurately defines the data entity (table), i.e. to ensure “structural congruence”. There is minimal semantic checking. The ECC was developed with considerable community involvement, concomitant with the development of other advanced software for the LTER NIS (Servilla et al., 2016). All code and schemas are open source, and the system has been in production since 2013. Today, every incoming data package is subjected to up to 32 distinct checks, which encompass a variety of data and metadata features ranging from simple confirmation that certain metadata XPaths are present to assurance of congruence between metadata and data.

Approximately 60 checks are awaiting consideration, and implementation continues within other management constraints. The system has the potential to produce additional descriptive material for data values themselves, which may be developed at some later date, e.g., value ranges, frequency distributions, and qualitative comparisons to metadata content. We include here summaries for reports to date, and discuss error modes, highlighting ways that reports may provide input to the design of specific tools, or help identify gaps in a data management system.

Section snippets

Community input

An initial outline for data package checking was constructed in 2009 by a group of LTER site data managers and NIS developers, plus representatives from the National Center for Ecological Analysis and Synthesis (NCEAS), a major partner in the development of EML and its associated code. The ECC was developed as an LTER product, but throughout the process a broader community of data practitioners was engaged to ensure the ECC would be widely useable, with progress reports and/or breakouts formed

Results

The history of development is presented in Fig. 3. By Spring 2012, a total of 72 checks had been proposed, and many finalized during the workshop itself. Checks are continually maintained in a collaborative online document and archived as needed (Dataset: O'Brien et al., 2016). At the time of this writing, the total stands at 91 checks entered, with 32 implemented, and 21 designated as “deprecated” or “postponed”. Deprecated checks were sometimes obviated by other entries, but their record

New checks and future development

The system will accommodate new checks as the community determines these are necessary, and Fig. 3 shows 19 checks added since initial implementation in 2013. Further, existing checks may require modification; for example, a check with a response status of “warn” in 2013 may require reclassification to return an “error”, or conversely a check's response status may be relaxed. The declarative format of the template allows such changes to made with ease; simple edits to the XML file can modify a

Conclusion

Overall, uniformity and usability of datasets in the PASTA catalog is higher than what was found in the first generation LTER data catalog. We cannot quantify that improvement because experiences recorded with the first-generation catalog were anecdotal, and any checking was ad hoc or cursory at best. However, simply setting rules for admission to a catalog - with the involvement of the contributing community - has improved the landscape by encouraging redesign of data packaging systems, and

Funding

This work was supported by the National Science Foundation [grant numbers OCE-0620276, OCE-1232779, and Cooperative Agreements DEB-0832652 and DEB-0936498].

Acknowledgements

We are grateful to the participants on the 2012 workshop and members of the LTER Information Management working group on quality assessment: S. Bohm, E. Boose, J. Downing, M. Gastil-Buhl, C. Gries, B. Leinfelder.

References (24)

  • FegrausE.H. et al.

    Maximizing the value of ecological data with structured metadata: an introduction to ecological metadata language (EML) and principles for metadata creation

    Bull. Ecol. Soc. Am.

    (2005)
  • GoszJ.R. et al.

    Twenty-eight years of the US-LTER program: experience, results and research questions

  • Cited by (8)

    • Balancing the needs of consumers and producers for scientific data collections

      2021, Ecological Informatics
      Citation Excerpt :

      Data producers are typically also the authors of the data publication and make decisions on how to package the data using a variety of considerations, including: order of authors who need to be included, theme of the data, time range of the data, type of data, expected usage needs, project goals, sponsoring organization, and/or scale of the data. Once the data have been submitted to a repository with appropriate metadata describing the data, they are ideally reviewed by the repository and published (e.g., Kakalia et al., 2019; O'Brien et al., 2016). Typically, the repository assigns a DOI to each published dataset and provides an automated citation using DOI schema metadata fields (e.g., authors, title, publication year, publisher) (Fenner et al., 2019).

    • Smart Earth: A meta-review and implications for environmental governance

      2018, Global Environmental Change
      Citation Excerpt :

      Roche et al., (2015) surveyed 100 datasets associated with studies in journals that commonly publish ecological and evolutionary research, finding that 56% of the articles were linked to incompletely archived datasets. Calls for improved metadata have become increasingly common in ecology and biodiversity science (e.g. Frew and Dozier, 2012; Specht et al., 2015; O’Brien et al., 2016). “Good news narratives,” showcasing the purported benefits of Smart Earth technologies (Arts et al., 2015, p.661), often obscure questions of quality controls and who will set them.

    • Beyond the benchtop and the benthos: Dataset management planning and design for time series of ocean carbonate chemistry associated with Durafet®-based pH sensors

      2016, Ecological Informatics
      Citation Excerpt :

      SeaFET data ranged from pH 8.02–8.13 and provided context for the laboratory control treatment (pH = 8.00), with treatment exposures shown to exceed environmental pH (pH = 7.73; Fig. 4). Per MCR LTER policies, these data received full archival treatment (i.e., quality controlled metadata and full congruence per O'Brien et al., 2016--in this issue), but did not require the regularized measurement descriptions of our long-term time series. These data then became a reference dataset for local standardization of measurement metadata and data table structure, and these early investments allowed us to more quickly build a data management scheme for more complex long-term deployments.

    View all citing articles on Scopus
    View full text