skip to main content
10.1145/2910896.2910906acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Data Curation with a Focus on Reuse

Published:19 June 2016Publication History

ABSTRACT

A dataset from the field of High Performance Computing (HPC) was curated with the focus on facilitating its reuse and to appeal to a broader audience beyond HPC specialists. At an early stage in the research project, the curators gathered requirements from prospective users of the dataset, focusing on how and for which research projects they would reuse the data. Users needs informed which curation tasks to conduct, which included: adding more information elements to the dataset to expand its content scope; removing personal information; and, packaging the data in a size, a format, and at a frequency of delivery that are convenient for access and analysis purposes. The curation tasks are embedded in the software that produces the data, and are implemented as an automated workflow that spans various HPC resources, in which the dataset is generated, processed and stored and the Texas ScholarWorks institutional repository, through which the data is published. Within this distributed architecture, the integrated data creation and curation workflow complies with long-term preservation requirements, and is the first one implemented as a collaboration between the supercomputing center where the data is created on ongoing basis, and the University Libraries at UT Austin where it is published. The targeted curation strategy included the design of proof of concept data analyses to evaluate if the curated data met the reuse scenarios proposed by users. The results suggest that the dataset is understandable, and that researchers can use it to answer some of the research questions they posed. Results also pointed to specific elements of the curation strategy that had to be improved and disclosed the difficulties involved in breaking data to new users.

References

  1. Agrawal, K., Fahey, M., McLay, R., and James, D. 2014. User environment tracking and problem detection with XALT. Proceedings of the First International Workshop on HPC User Support Tools (Nov. 2014), 32--40.DOI=http://dx.doi.org/10.1109/HUST.2014.6 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal. R. and Srikant, R. 1994. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB '94), Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 487--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Arlitsch, K.and O'Brien, P.S. 2012. Invisible institutional repositories: Addressing the low indexing ratios of IRs in Google Scholar. Library Hi Tech 30,1 (Mar. 2012), 60--81.Google ScholarGoogle ScholarCross RefCross Ref
  4. Borgman, C. L. 2012. The conundrum of sharing research data. Journal of the American Society for Information Science and Technology. 63, 6 (Jun. 2012), 1059--1078. DOI=http://dx.doi.org/10.1002/asi.22634 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Browne, J., DeLeon, R., Patra, A., Barth, W., Hammond, J., Jones, M., and Wang, F. 2014. Comprehensive, open-source resource usage measurement and analysis for HPC systems. Concurrency and Computation: Practice and Experience 26, 13 (Sep 2014), 2191--2209. DOI=http://dx.doi.org/10.1002/cpe.3245 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. CODATA-ICSTI Task Group on Data Citation Standards and Practices 2013. Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data. Data Science Journal 12,0, (Sep. 2013), CIDCR1-CIDCR75. DOI=http://dx.doi.org/10.2481/dsj.OSOM13-043Google ScholarGoogle ScholarCross RefCross Ref
  7. DataCite 2015. DataCite Metadata Search. http://search.datacite.org/uiGoogle ScholarGoogle Scholar
  8. Faniel, I. and Zimmerman, A. 2011. Beyond the Data Deluge: A Research Agenda for Large-Scale Data Sharing and Reuse. International Journal of Digital Curation, 6, 1 (2011) 58--69. DOI = http://dx.doi.org/10.2218/ijdc.v6i1.172Google ScholarGoogle ScholarCross RefCross Ref
  9. Feitelson, D., Tsafrir, D., and Krakov, D. 2014. Experience with using the Parallel Workloads Archive Journal of Parallel and Distributed Computing, 74, 10, (Oct. 2013), 2967--2982. DOI=http://dx.doi.org/10.1016/j.jpdc.2014.06.013Google ScholarGoogle Scholar
  10. Giaretta, D. 2007. The CASPAR Approach to Digital Preservation. The International Journal of Digital Curation 3,2 (July 2007), 112--121. DOI=http://dx.doi.org/10.2218/ijdc.v2i1.18Google ScholarGoogle Scholar
  11. Goodman, A., Pepe, A., Blocker, A. W., Borgman, C. L., Cranmer, K., Crosas, M., ... Slavkovic, A. 2014. Ten Simple Rules for the Care and Feeding of Scientific Data. PLoS Comput Biol. 10, 4, (Apr. 2014), e1003542. DOI=http://doi.org/10.1371/journal.pcbi.1003542Google ScholarGoogle Scholar
  12. Harvard Dataverse Project 2015. Dataset + File Management. http://guides.dataverse.org/en/4.2.2/user/dataset-managementGoogle ScholarGoogle Scholar
  13. Hey, T. and Trefethen, A. 2003. Grid Computing - Making the Global Infrastructure a Reality. West Sussex: Wiley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Higgins, S. 2008. The DCC Curation Lifecycle Model. The International Journal of Digital Curation 3,1 (July 2008), 134--140. DOI=http://dx.doi,org/10.2218/ijdc.v3i1.48Google ScholarGoogle ScholarCross RefCross Ref
  15. International Organizational for Standardization. (2012). ISO 14721: 2012: Space data and information transfer systems -- Open archival information systems (OAIS) -- Reference model. Genéve, Switzerland: International Organization for Standardization.Google ScholarGoogle Scholar
  16. James,D., McLay, R., Si Liu, R. Evans, T. Barth, W., Lamas-Linares, A., Budiardja, R., and Fahey, M. 2015. Tales from the trenches: can user support tools make a difference?. In Proceedings of the Second International Workshop on HPC User Support Tools(HUST '15). ACM, New York, NY, USA, , Article 2 , 11 pages. DOI=http://dx.doi.org/10.1145/2834996.2834998 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jurczyk, P. and Xiong, L. 2009. Distributed anonymization: Achieving privacy for both data subjects and data providers. In Data and Applications Security XXIII (Jan. 2009), 191--207. Springer Berlin Heidelberg. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kulasekaran, S., Trelogan, J., Esteva, M., and Johnson, M. 2014. Metadata Integration for an Archaeology Collection Architecture. International Conference On Dublin Core And Metadata Applications (Oct. 2013). Austin, TX, USA, 53--63. http://dcpapers.dublincore.org/pubs/article/view/3702 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lubell, J., Rachuri, S., Mani, M., and Subrahmanian, E. 2008. Sustaining Engineering Informatics: Towards Methods and Metrics for Digital Curation. The International Journal of Digital Curation. 3,2 (Nov. 2008), 59--73. DOI=http://dx.doi.org/ijdc.v3i2.58Google ScholarGoogle ScholarCross RefCross Ref
  20. Lyon, Colleen; Cofield, Melanie; Borrego, Gilbert; (2015): Reducing Metadata Errors in an IR with Distributed Submission Privileges; University of Texas at Austin. http://dx.doi.org/10.15781/T2KW2Google ScholarGoogle Scholar
  21. Management Council of the Consultative Committee for Space Data Systems 2012. Reference Model for an Open Archival Information System (OAIS) (June 2012). Washington, DC.Google ScholarGoogle Scholar
  22. McLay, R. and Fahey, M. R. 2015. Understanding the Software Needs of High End Computer Users with XALT. Texas Advanced Computing Center. Dataset. DOI=http://dx.doi.org/10.15781/T2PP4PGoogle ScholarGoogle Scholar
  23. Palmer, J., Gallo, S., Furlani, T., Jones, M., DeLeon, R., White, J., Simakov, N., Patra, A., Sperhac, J., Yearke, T., Rathsam, R., Innus, M., Cornelius, C., Browne, J., Barth, W., and Evans, R. 2015. Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources. Computing in Science and Engineering, 17, 4, (July 2015 )52--62. DOI=http://dx.doi.org/10.1109/MCSE.2015.68Google ScholarGoogle Scholar
  24. Rajasekar, A., Moore, R., Hou, C. Y., Lee, C. A., Marciano, R., de Torcy, A., Wan, M., Schroeder, W, Sheau-Yen, C, Gilbert, L., Tooby, P. and Zhu, B. 2010. iRODS Primer: integrated rule-oriented data system. Synthesis Lectures on Information Concepts, Retrieval, and Services, 2,1 1--143. doi:10.2200/S00233ED1V01Y200912ICR012 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Registry of Research Data Repositories 2015. About re3data. http://www.re3data.orgGoogle ScholarGoogle Scholar
  26. Starr, J., Castro, E., Crosas, M., Dumontier, M., Downs, R. R., Duerr, R., ... Clark, T. 2015. Achieving human and machine accessibility of cited data in scholarly publications. PeerJ. Computer Science 1, 1 (May 2015). DOI=http://doi.org/10.7717/peerj-cs.1Google ScholarGoogle ScholarCross RefCross Ref
  27. Schwartz, J and Cook, T. 2002. Archives, Records, and Power: The Making of Modern Memory. Archival Science, 2, 1--2 (Mar 2002), 1--19.Google ScholarGoogle ScholarCross RefCross Ref
  28. Tag Team 2015. Open Access Tracking Project. http://tagteam.harvard.edu/hubs/oatp/itemsGoogle ScholarGoogle Scholar
  29. Texas Advanced Computing Center 2015a. Corral User Guide. http://tacc.utexas.edu/user-guides/corralGoogle ScholarGoogle Scholar
  30. Texas Advanced Computing Center 2015b. Rodeo: General Cloud Computing and Storage. http://tacc.utexas.edu/systems/rodeoGoogle ScholarGoogle Scholar
  31. Texas Advanced Computing Center 2015c. Stampede User Guide. https://portal.tacc.utexas.edu/user-guides/stampedeGoogle ScholarGoogle Scholar
  32. Texas Advanced Computing Center 2015d. Wrangler User Guide. https://portal.tacc.utexas.edu/user-guides/wranglerGoogle ScholarGoogle Scholar
  33. Texas Scholar Works 2015. Frequently Asked Questions. http://repositories.lib.utexas.edu/pages/faq#getting_startedGoogle ScholarGoogle Scholar
  34. Towns, J., Cockerill, T., Dahan, M. Foster, I., Gaither, K., Grimshaw, A., Hazlewood, V.,Lathrop, S., Lifka, D., Peterson, G.D., Roskies, R., Scott, J.R., Wilkins-Diehr, N. 2014. XSEDE: Accelerating Scientific Discovery. Computing in Science and Engineering, 16, 5 (Sep 2014), 62--74. DOI=http://dx.doi.org10.1109/MCSE.2014.80Google ScholarGoogle ScholarCross RefCross Ref
  35. UTDR 2015. The University of Texas Digital Repository: About. http://repositories.lib.utexas.edu/Google ScholarGoogle Scholar
  36. White, E., Baldride, E., Brym, Z., Locey, K., McGlinn, D., and Supp, S. 2013. Nine Simple Ways to Make it Easier to (Re)use Your Data. Ideas in Ecology and Evolution, 6, 2 (Aug. 2013) 1--10. DOI= http://dx.doi.org/10.4033/iee.2013.6b.6.fGoogle ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Data Curation with a Focus on Reuse

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries
          June 2016
          316 pages
          ISBN:9781450342292
          DOI:10.1145/2910896

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 June 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          JCDL '16 Paper Acceptance Rate15of52submissions,29%Overall Acceptance Rate415of1,482submissions,28%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader