Ensuring the quality of data packages in the LTER network data management system

doi:10.1016/j.ecoinf.2016.08.001

Ecological Informatics

Volume 36, November 2016, Pages 237-246

https://doi.org/10.1016/j.ecoinf.2016.08.001 Get rights and content

Highlights

•
LTER Network developed a screening system for data packages with metadata in EML.
•
The system is driven by the community's existing Best Practices.
•
All LTER data are accompanied by a report describing completeness and quality.
•
Site-contributed data from LTER are 95% compliant with the current checks.

Abstract

Considerable data analyses use automated workflows to ingest data from public repositories, and rely on data packages of high structural quality. The Long Term Ecological Research (LTER) Network now screens all packages entering its long-term archive to ensure completeness and quality, and to ascertain that metadata and data are structurally congruent, i.e., that the data typing and formats expressed in metadata agree with that found in data entities. The EML Congruence Checker (ECC) system is a component of the LTER Provenance Aware Synthesis Tracking Architecture (PASTA), and operates on data tables in packages described with Ecological Metadata Language using the EML Data Manager Library, written in Java. Checking is extensible for other data types and customizable via a template. Reports are retained as part of the submitted data package, and summaries here reflect the general usability of LTER data for a variety of purposes. On average in 2015, site-contributed data in the LTER catalog were 95% compliant (valid) with the current suite of checks.

Introduction

Data sets are an important contribution from Long Term Ecological Research (LTER) sites to the LTER Network Information System (NIS); these are intended to be used in cross-site synthesis projects for dissemination to federated catalogs and national repositories, and their long-term nature makes them irreplaceable for tracking environmental change (Gosz et al., 2010, Peters, 2010). In 2002, the LTER Network adopted an XML specification for its data exchange, Ecological Metadata Language or EML (Fegraus et al., 2005), followed closely by narrative guidelines for usage and recommendations for completeness (LTER, 2011). By mid-decade all LTER sites were contributing metadata records to a central catalog, and by 2009 were fully populating EML records (Michener et al., 2011, Porter, 2010). As with many other format specifications, an EML record supports machine reading and interpretation of its associated data entities, and code generators have been developed for ingestion of EML-described data entities into statistical, processing and database environments (e.g., Lin et al., 2008, Porter et al., 2012). The first-order quality standard for XML records is schema compliance, and for EML, additional parsing code checks that internal identifiers and their references adhere to specific rules (EML Project, 2008). However, experience with automated use of LTER data packages indicated that a significant fraction did not have metadata and data of sufficient structural detail (Leinfelder et al., 2008, Leinfelder et al., 2010). Clearly, any automated use required a higher level of metadata and data quality and congruence, i.e., data typing and formats expressed in metadata must agree with that found in data entities. To assist data contributors as they prepare datasets, we developed a mechanism to provide feedback on congruence and potential usability to data contributors.

The Internet plus foresighted policies have fostered an enormous growth in the amount of scientific data available for download in all domains (e.g., AAAS, 2011). Further, the adoption of common, well-structured formats and metadata specifications allows for sophisticated machine reading. Some communities have developed conventions for specific data types or uses, usually as lists of defined fields with recommendations for use (e.g., CF Conventions, Eaton et al., 2011). But in many research domains, practices are still evolving for data's best handling and delivery. As has been the experience of the LTER Network, simple recommendations are inadequate, and a necessary continuation of these efforts will be benchmarks and compliance metrics.

Generally, assessments have focused on metadata content. NOAA (2014) has developed rubrics and metrics for metadata, which has been extended with summaries for specific use cases, e.g., for spatial data in GEOSS (Zabala et al., 2013). Habermann (2014) extends that approach further by evaluating XML metadata records using abstract metadata concepts (mapped to individual metadata specifications) against community-defined compliance levels and rubrics. The system has been implemented for some types of FGCD and ISO-19139 documents, with results aggregated into summaries for review. For Linked Open Data (LOD) the “LOD Laundromat” (Beek et al., 2014) converts idiosyncratic input to a “cleaned sibling” (their term), and so removes the contributor from the cleansing process. The set of heuristics applied is mainly syntactic, with semantic interpretation of content to detect duplicate triples. Results can be aggregated into bulk reports on input quality. LTER efforts presented here represent an intermediate approach: we examine metadata from one specification (EML), data entities therein, plus their agreement (termed “congruence”). This strategy means we can examine datasets more deeply than with the Habermann approach, but we do not attempt to correct metadata (or data) per the LOD Laundromat. The reasoning is two-fold: first, LTER needed to assure more than just presence of metadata elements, it was important to assure that data entities could be machine read, thus the need to examine congruence. Secondly, most ecological data are so complex that heuristics for their repair were simply beyond the scope of this project. Hence, our system informs submitters of its findings for their judgment and repair.

The LTER Network has developed the EML Congruence Checker (ECC) to inform the dataset contributor about the structure of the data package, and indicate whether the asserted metadata accurately defines the data entity (table), i.e. to ensure “structural congruence”. There is minimal semantic checking. The ECC was developed with considerable community involvement, concomitant with the development of other advanced software for the LTER NIS (Servilla et al., 2016). All code and schemas are open source, and the system has been in production since 2013. Today, every incoming data package is subjected to up to 32 distinct checks, which encompass a variety of data and metadata features ranging from simple confirmation that certain metadata XPaths are present to assurance of congruence between metadata and data.

Approximately 60 checks are awaiting consideration, and implementation continues within other management constraints. The system has the potential to produce additional descriptive material for data values themselves, which may be developed at some later date, e.g., value ranges, frequency distributions, and qualitative comparisons to metadata content. We include here summaries for reports to date, and discuss error modes, highlighting ways that reports may provide input to the design of specific tools, or help identify gaps in a data management system.

Section snippets

Community input

An initial outline for data package checking was constructed in 2009 by a group of LTER site data managers and NIS developers, plus representatives from the National Center for Ecological Analysis and Synthesis (NCEAS), a major partner in the development of EML and its associated code. The ECC was developed as an LTER product, but throughout the process a broader community of data practitioners was engaged to ensure the ECC would be widely useable, with progress reports and/or breakouts formed

Results

The history of development is presented in Fig. 3. By Spring 2012, a total of 72 checks had been proposed, and many finalized during the workshop itself. Checks are continually maintained in a collaborative online document and archived as needed (Dataset: O'Brien et al., 2016). At the time of this writing, the total stands at 91 checks entered, with 32 implemented, and 21 designated as “deprecated” or “postponed”. Deprecated checks were sometimes obviated by other entries, but their record

New checks and future development

The system will accommodate new checks as the community determines these are necessary, and Fig. 3 shows 19 checks added since initial implementation in 2013. Further, existing checks may require modification; for example, a check with a response status of “warn” in 2013 may require reclassification to return an “error”, or conversely a check's response status may be relaxed. The declarative format of the template allows such changes to made with ease; simple edits to the XML file can modify a

Conclusion

Overall, uniformity and usability of datasets in the PASTA catalog is higher than what was found in the first generation LTER data catalog. We cannot quantify that improvement because experiences recorded with the first-generation catalog were anecdotal, and any checking was ad hoc or cursory at best. However, simply setting rules for admission to a catalog - with the involvement of the contributing community - has improved the landscape by encouraging redesign of data packaging systems, and

Funding

This work was supported by the National Science Foundation [grant numbers OCE-0620276, OCE-1232779, and Cooperative Agreements DEB-0832652 and DEB-0936498].

Acknowledgements

We are grateful to the participants on the 2012 workshop and members of the LTER Information Management working group on quality assessment: S. Bohm, E. Boose, J. Downing, M. Gastil-Buhl, C. Gries, B. Leinfelder.

References (24)

LeinfelderB.J. et al.
A metadata-driven approach to loading and querying heterogeneous scientific data
Ecol. Inform.
(2010)
MichenerW.K. et al.
Long term ecological research and information management
Ecological Informatics
(2011)
PetersD.P.C.
Accessible ecology: synthesis of the long, deep, and broad
Trends Ecol. Evol.
(2010)
PorterJ.H. et al.
Staying afloat in the sensor data deluge. Trends in ecology and evolution, February 2012
Vol.
(2012)
M. Servilla et al.
The contribution and reuse of LTER data in the provenance aware synthesis tracking architecture (PASTA) data repository
Ecol. Informatics.
(2016)
American Association for the Advancement of Science
Dealing with data
Spec. Sect. Sci.
(2011)
American Geophysical Union
Data Management Maturity Program
BeekW.L.R. et al.
LOD Laundromat: a uniform way of publishing other people's dirty data
EatonB.J. et al.
NetCDF Climate and Forecast (CF) Metadata Conventions
EML Project
Ecological Metadata Language

FegrausE.H. et al.

Maximizing the value of ecological data with structured metadata: an introduction to ecological metadata language (EML) and principles for metadata creation

Bull. Ecol. Soc. Am.

(2005)

GoszJ.R. et al.

Twenty-eight years of the US-LTER program: experience, results and research questions

Cited by (8)

Balancing the needs of consumers and producers for scientific data collections
2021, Ecological Informatics
Citation Excerpt :
Data producers are typically also the authors of the data publication and make decisions on how to package the data using a variety of considerations, including: order of authors who need to be included, theme of the data, time range of the data, type of data, expected usage needs, project goals, sponsoring organization, and/or scale of the data. Once the data have been submitted to a repository with appropriate metadata describing the data, they are ideally reviewed by the repository and published (e.g., Kakalia et al., 2019; O'Brien et al., 2016). Typically, the repository assigns a DOI to each published dataset and provides an automated citation using DOI schema metadata fields (e.g., authors, title, publication year, publisher) (Fenner et al., 2019).
Recent emphasis and requirements for open data publication have led to significant increases in data availability in the Earth sciences, which is critical to long-tail data integration. Currently, data are often published in a repository with an identifier and citation, similar to those for papers. Subsequent publications that use the data are expected to provide a citation in the reference section of the paper. However, the format of the data citation is still evolving, particularly with regards to citing dynamic data, subsets, and collections of data. Considering the motivations of both data producers and consumers, the most pressing need is to create user-friendly solutions that provide credit for data producers and enable accurate citation of data, particularly integrated data. Providing easy-to-use data citations is a critical foundation that is required to address the socio-technical challenges around data integration. Studies that integrate data from dozens or hundreds of datasets must often include data citations in supplementary material due to page limits. However, citations in the supplementary material are not indexed, making it difficult to track citations and thus giving credit to the data producer. In this paper, we discuss our experiences and the challenges we have encountered with current citation guidance. We also review the relative merits of the currently available mechanisms designed to enable compact citation of collections of data, such as data collections, data papers, and dynamic data citations. We consider these options for three data producer scenarios: a domain-specific data collection, a data repository, and a large-scale, multidisciplinary project. We posit that a new mechanism is also needed to enable citation of multiple datasets and credit to data producers.
Smart Earth: A meta-review and implications for environmental governance
2018, Global Environmental Change
Citation Excerpt :
Roche et al., (2015) surveyed 100 datasets associated with studies in journals that commonly publish ecological and evolutionary research, finding that 56% of the articles were linked to incompletely archived datasets. Calls for improved metadata have become increasingly common in ecology and biodiversity science (e.g. Frew and Dozier, 2012; Specht et al., 2015; O’Brien et al., 2016). “Good news narratives,” showcasing the purported benefits of Smart Earth technologies (Arts et al., 2015, p.661), often obscure questions of quality controls and who will set them.
Environmental governance has the potential to be significantly transformed by Smart Earth technologies, which deploy enhanced environmental monitoring via combinations of information and communication technologies (ICT), conventional monitoring technologies (e.g. remote sensing), and Internet of Things (IoT) applications (e.g. Environmental Sensor Networks (ESNs)). This paper presents a systematic meta-review of Smart Earth scholarship, focusing our analysis on the potential implications and pitfalls of Smart Earth technologies for environmental governance. We present a meta-review of academic research on Smart Earth, covering 3187 across the full range of academic disciplines from 1997 to 2017, ranging from ecological informatics to the digital humanities. We then offer a critical perspective on potential pathways for evolution in environmental governance frameworks, exploring five key Smart Earth issues relevant to environmental governance: data; real-time regulation; predictive management; open source; and citizen sensing. We conclude by offering suggestions for future research directions and trans-disciplinary conversations about environmental governance in a Smart Earth world.
Advances in Managing Long Term Ecological Research Data
2016, Ecological Informatics
Beyond the benchtop and the benthos: Dataset management planning and design for time series of ocean carbonate chemistry associated with Durafet®-based pH sensors
2016, Ecological Informatics
Citation Excerpt :
SeaFET data ranged from pH 8.02–8.13 and provided context for the laboratory control treatment (pH = 8.00), with treatment exposures shown to exceed environmental pH (pH = 7.73; Fig. 4). Per MCR LTER policies, these data received full archival treatment (i.e., quality controlled metadata and full congruence per O'Brien et al., 2016--in this issue), but did not require the regularized measurement descriptions of our long-term time series. These data then became a reference dataset for local standardization of measurement metadata and data table structure, and these early investments allowed us to more quickly build a data management scheme for more complex long-term deployments.
To better understand the impact of ocean acidification on marine ecosystems, an important ongoing research priority for marine scientists is to characterize present-day pH variability. Following recent technological advances, autonomous pH sensor deployments in shallow coastal marine environments have revealed that pH dynamics in coastal oceans are more variable in space and time than the discrete, open-ocean measurements that are used for ocean acidification projections. Data from these types of deployments will benefit the research community by facilitating the improved design of ocean acidification studies as well as the identification or evaluation of natural and human-influenced pH variability. Importantly, the collection of ecologically relevant pH data and a cohesive, user-friendly integration of results across sites and regions requires (1) effective sensor operation to ensure high-quality pH data collection and (2) efficient data management for accessibility and broad reuse by the marine science community. Here, we review the best practices for deployment, calibration, and data processing and quality control, using our experience with Durafet®-based pH sensors as a model. Next, we describe information management practices for streamlining preservation and distribution of data and for cataloging different types of pH sensor data, developed in collaboration with two U.S. Long Term Ecological Research (LTER) sites. Finally, we assess sensor performance and data recovery from 73 SeaFET deployments in the Santa Barbara Channel using our quality control guidelines and data management tools, and offer recommendations for improved data yields. Our experience provides a template for other groups contemplating using SeaFET technology as well as general steps that may be helpful for the design of data management for other complex sensors.
The Environmental Data Initiative: Connecting the past to the future through data reuse
2023, Ecology and Evolution
People, infrastructure, and data: A pathway to an inclusive and diverse ecological network of networks
2022, Ecosphere

View all citing articles on Scopus

View full text

Ensuring the quality of data packages in the LTER network data management system

Highlights

Abstract

Introduction

Section snippets

Community input

Results

New checks and future development

Conclusion

Funding

Acknowledgements

Ecol. Inform.

Ecological Informatics

Trends Ecol. Evol.

Vol.

Ecol. Informatics.

Dealing with data

Spec. Sect. Sci.

Data Management Maturity Program

LOD Laundromat: a uniform way of publishing other people's dirty data

NetCDF Climate and Forecast (CF) Metadata Conventions

Ecological Metadata Language

Maximizing the value of ecological data with structured metadata: an introduction to ecological metadata language (EML) and principles for metadata creation

Bull. Ecol. Soc. Am.

Twenty-eight years of the US-LTER program: experience, results and research questions