ABSTRACT
A dataset from the field of High Performance Computing (HPC) was curated with the focus on facilitating its reuse and to appeal to a broader audience beyond HPC specialists. At an early stage in the research project, the curators gathered requirements from prospective users of the dataset, focusing on how and for which research projects they would reuse the data. Users needs informed which curation tasks to conduct, which included: adding more information elements to the dataset to expand its content scope; removing personal information; and, packaging the data in a size, a format, and at a frequency of delivery that are convenient for access and analysis purposes. The curation tasks are embedded in the software that produces the data, and are implemented as an automated workflow that spans various HPC resources, in which the dataset is generated, processed and stored and the Texas ScholarWorks institutional repository, through which the data is published. Within this distributed architecture, the integrated data creation and curation workflow complies with long-term preservation requirements, and is the first one implemented as a collaboration between the supercomputing center where the data is created on ongoing basis, and the University Libraries at UT Austin where it is published. The targeted curation strategy included the design of proof of concept data analyses to evaluate if the curated data met the reuse scenarios proposed by users. The results suggest that the dataset is understandable, and that researchers can use it to answer some of the research questions they posed. Results also pointed to specific elements of the curation strategy that had to be improved and disclosed the difficulties involved in breaking data to new users.
- Agrawal, K., Fahey, M., McLay, R., and James, D. 2014. User environment tracking and problem detection with XALT. Proceedings of the First International Workshop on HPC User Support Tools (Nov. 2014), 32--40.DOI=http://dx.doi.org/10.1109/HUST.2014.6 Google ScholarDigital Library
- Agrawal. R. and Srikant, R. 1994. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB '94), Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 487--499. Google ScholarDigital Library
- Arlitsch, K.and O'Brien, P.S. 2012. Invisible institutional repositories: Addressing the low indexing ratios of IRs in Google Scholar. Library Hi Tech 30,1 (Mar. 2012), 60--81.Google ScholarCross Ref
- Borgman, C. L. 2012. The conundrum of sharing research data. Journal of the American Society for Information Science and Technology. 63, 6 (Jun. 2012), 1059--1078. DOI=http://dx.doi.org/10.1002/asi.22634 Google ScholarDigital Library
- Browne, J., DeLeon, R., Patra, A., Barth, W., Hammond, J., Jones, M., and Wang, F. 2014. Comprehensive, open-source resource usage measurement and analysis for HPC systems. Concurrency and Computation: Practice and Experience 26, 13 (Sep 2014), 2191--2209. DOI=http://dx.doi.org/10.1002/cpe.3245 Google ScholarDigital Library
- CODATA-ICSTI Task Group on Data Citation Standards and Practices 2013. Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data. Data Science Journal 12,0, (Sep. 2013), CIDCR1-CIDCR75. DOI=http://dx.doi.org/10.2481/dsj.OSOM13-043Google ScholarCross Ref
- DataCite 2015. DataCite Metadata Search. http://search.datacite.org/uiGoogle Scholar
- Faniel, I. and Zimmerman, A. 2011. Beyond the Data Deluge: A Research Agenda for Large-Scale Data Sharing and Reuse. International Journal of Digital Curation, 6, 1 (2011) 58--69. DOI = http://dx.doi.org/10.2218/ijdc.v6i1.172Google ScholarCross Ref
- Feitelson, D., Tsafrir, D., and Krakov, D. 2014. Experience with using the Parallel Workloads Archive Journal of Parallel and Distributed Computing, 74, 10, (Oct. 2013), 2967--2982. DOI=http://dx.doi.org/10.1016/j.jpdc.2014.06.013Google Scholar
- Giaretta, D. 2007. The CASPAR Approach to Digital Preservation. The International Journal of Digital Curation 3,2 (July 2007), 112--121. DOI=http://dx.doi.org/10.2218/ijdc.v2i1.18Google Scholar
- Goodman, A., Pepe, A., Blocker, A. W., Borgman, C. L., Cranmer, K., Crosas, M., ... Slavkovic, A. 2014. Ten Simple Rules for the Care and Feeding of Scientific Data. PLoS Comput Biol. 10, 4, (Apr. 2014), e1003542. DOI=http://doi.org/10.1371/journal.pcbi.1003542Google Scholar
- Harvard Dataverse Project 2015. Dataset + File Management. http://guides.dataverse.org/en/4.2.2/user/dataset-managementGoogle Scholar
- Hey, T. and Trefethen, A. 2003. Grid Computing - Making the Global Infrastructure a Reality. West Sussex: Wiley. Google ScholarDigital Library
- Higgins, S. 2008. The DCC Curation Lifecycle Model. The International Journal of Digital Curation 3,1 (July 2008), 134--140. DOI=http://dx.doi,org/10.2218/ijdc.v3i1.48Google ScholarCross Ref
- International Organizational for Standardization. (2012). ISO 14721: 2012: Space data and information transfer systems -- Open archival information systems (OAIS) -- Reference model. Genéve, Switzerland: International Organization for Standardization.Google Scholar
- James,D., McLay, R., Si Liu, R. Evans, T. Barth, W., Lamas-Linares, A., Budiardja, R., and Fahey, M. 2015. Tales from the trenches: can user support tools make a difference?. In Proceedings of the Second International Workshop on HPC User Support Tools(HUST '15). ACM, New York, NY, USA, , Article 2 , 11 pages. DOI=http://dx.doi.org/10.1145/2834996.2834998 Google ScholarDigital Library
- Jurczyk, P. and Xiong, L. 2009. Distributed anonymization: Achieving privacy for both data subjects and data providers. In Data and Applications Security XXIII (Jan. 2009), 191--207. Springer Berlin Heidelberg. Google ScholarDigital Library
- Kulasekaran, S., Trelogan, J., Esteva, M., and Johnson, M. 2014. Metadata Integration for an Archaeology Collection Architecture. International Conference On Dublin Core And Metadata Applications (Oct. 2013). Austin, TX, USA, 53--63. http://dcpapers.dublincore.org/pubs/article/view/3702 Google ScholarDigital Library
- Lubell, J., Rachuri, S., Mani, M., and Subrahmanian, E. 2008. Sustaining Engineering Informatics: Towards Methods and Metrics for Digital Curation. The International Journal of Digital Curation. 3,2 (Nov. 2008), 59--73. DOI=http://dx.doi.org/ijdc.v3i2.58Google ScholarCross Ref
- Lyon, Colleen; Cofield, Melanie; Borrego, Gilbert; (2015): Reducing Metadata Errors in an IR with Distributed Submission Privileges; University of Texas at Austin. http://dx.doi.org/10.15781/T2KW2Google Scholar
- Management Council of the Consultative Committee for Space Data Systems 2012. Reference Model for an Open Archival Information System (OAIS) (June 2012). Washington, DC.Google Scholar
- McLay, R. and Fahey, M. R. 2015. Understanding the Software Needs of High End Computer Users with XALT. Texas Advanced Computing Center. Dataset. DOI=http://dx.doi.org/10.15781/T2PP4PGoogle Scholar
- Palmer, J., Gallo, S., Furlani, T., Jones, M., DeLeon, R., White, J., Simakov, N., Patra, A., Sperhac, J., Yearke, T., Rathsam, R., Innus, M., Cornelius, C., Browne, J., Barth, W., and Evans, R. 2015. Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources. Computing in Science and Engineering, 17, 4, (July 2015 )52--62. DOI=http://dx.doi.org/10.1109/MCSE.2015.68Google Scholar
- Rajasekar, A., Moore, R., Hou, C. Y., Lee, C. A., Marciano, R., de Torcy, A., Wan, M., Schroeder, W, Sheau-Yen, C, Gilbert, L., Tooby, P. and Zhu, B. 2010. iRODS Primer: integrated rule-oriented data system. Synthesis Lectures on Information Concepts, Retrieval, and Services, 2,1 1--143. doi:10.2200/S00233ED1V01Y200912ICR012 Google ScholarDigital Library
- Registry of Research Data Repositories 2015. About re3data. http://www.re3data.orgGoogle Scholar
- Starr, J., Castro, E., Crosas, M., Dumontier, M., Downs, R. R., Duerr, R., ... Clark, T. 2015. Achieving human and machine accessibility of cited data in scholarly publications. PeerJ. Computer Science 1, 1 (May 2015). DOI=http://doi.org/10.7717/peerj-cs.1Google ScholarCross Ref
- Schwartz, J and Cook, T. 2002. Archives, Records, and Power: The Making of Modern Memory. Archival Science, 2, 1--2 (Mar 2002), 1--19.Google ScholarCross Ref
- Tag Team 2015. Open Access Tracking Project. http://tagteam.harvard.edu/hubs/oatp/itemsGoogle Scholar
- Texas Advanced Computing Center 2015a. Corral User Guide. http://tacc.utexas.edu/user-guides/corralGoogle Scholar
- Texas Advanced Computing Center 2015b. Rodeo: General Cloud Computing and Storage. http://tacc.utexas.edu/systems/rodeoGoogle Scholar
- Texas Advanced Computing Center 2015c. Stampede User Guide. https://portal.tacc.utexas.edu/user-guides/stampedeGoogle Scholar
- Texas Advanced Computing Center 2015d. Wrangler User Guide. https://portal.tacc.utexas.edu/user-guides/wranglerGoogle Scholar
- Texas Scholar Works 2015. Frequently Asked Questions. http://repositories.lib.utexas.edu/pages/faq#getting_startedGoogle Scholar
- Towns, J., Cockerill, T., Dahan, M. Foster, I., Gaither, K., Grimshaw, A., Hazlewood, V.,Lathrop, S., Lifka, D., Peterson, G.D., Roskies, R., Scott, J.R., Wilkins-Diehr, N. 2014. XSEDE: Accelerating Scientific Discovery. Computing in Science and Engineering, 16, 5 (Sep 2014), 62--74. DOI=http://dx.doi.org10.1109/MCSE.2014.80Google ScholarCross Ref
- UTDR 2015. The University of Texas Digital Repository: About. http://repositories.lib.utexas.edu/Google Scholar
- White, E., Baldride, E., Brym, Z., Locey, K., McGlinn, D., and Supp, S. 2013. Nine Simple Ways to Make it Easier to (Re)use Your Data. Ideas in Ecology and Evolution, 6, 2 (Aug. 2013) 1--10. DOI= http://dx.doi.org/10.4033/iee.2013.6b.6.fGoogle ScholarCross Ref
Index Terms
- Data Curation with a Focus on Reuse
Recommendations
Cyberinfrastructure Collaboration for Distributed Digital Preservation
ESCIENCE '08: Proceedings of the 2008 Fourth IEEE International Conference on eScienceThe data deluge is beginning to have an effect on libraries and archives. As custodians of the scholarly record, libraries and archives are being asked to play an active role in long-term digital preservation in both science and the humanities. A report ...
Data curation profiling of biocollections
ASIST '16: Proceedings of the 79th ASIS&T Annual Meeting: Creating Knowledge, Enhancing Lives through Information & TechnologyIn the contexts of the data deluge and open data, scientists studying biodiversity benefit from online access to global datasets of existing vouchered biological and paleontological collections. Using biocollections collected over time across the world ...
Data management and curation practices: the case of using DSpace and implications
ASIST '15: Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the CommunityData management and curation is a new challenge with the emerging trend of data-dependent scholarly research. Due to the lack of common standards and best practices, current data management and curation practices have been varied. This poster presents a ...
Comments