"A Framework for the Preservation of a Docker Container"

Iain Emsley and David De Roure have published "A Framework for the Preservation of a Docker Container" in the International Journal of Digital Curation.

Here's an excerpt:

Reliably building and maintaining systems across environments is a continuing problem. A project or experiment may run for years. Software and hardware may change as can the operating system. Containerisation is a technology that is used in a variety of companies, such as Google, Amazon and IBM, and scientific projects to rapidly deploy a set of services repeatably. Using Dockerfiles to ensure that a container is built repeatably, to allow conformance and easy updating when changes take place are becoming common within projects. Its seen as part of sustainable software development. Containerisation technology occupies a dual space: it is both a repository of software and software itself. In considering Docker in this fashion, we should verify that the Dockerfile can be reproduced. Using a subset of the Dockerfile specification, a domain specific language is created to ensure that Docker files can be reused at a later stage to recreate the original environment. We provide a simple framework to address the question of the preservation of containers and its environment. We present experiments on an existing Dockerfile and conclude with a discussion of future work. Taking our work, a pipeline was implemented to check that a defined Dockerfile conforms to our desired model, extracts the Docker and operating system details. This will help the reproducibility of results by creating the machine environment and package versions. It also helps development and testing through ensuring that the system is repeatably built and that any changes in the software environment can be equally shared in the Dockerfile. This work supports not only the citation process it also the open scientific one by providing environmental details of the work. As a part of the pipeline to create the container, we capture the processes used and put them into the W3C PROV ontology. This provides the potential for providing it with a persistent identifier and traceability of the processes used to preserve the metadata. Our future work will look at the question of linking this output to a workflow ontology to preserve the complete workflow with the commands and parameters to be given to the containers. We see this provenance within the build process useful to provide a complete overview of the workflow.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"Archiving Large-Scale Legacy Multimedia Research Data: A Case Study"

Claudia Yogeswaran and Kearsy Cormier have published "Archiving Large-Scale Legacy Multimedia Research Data: A Case Study " in the International Journal of Digital Curation.

Here's an excerpt:

In this paper we provide a case study of the creation of the DCAL Research Data Archive at University College London. In doing so, we assess the various challenges associated with archiving large-scale legacy multimedia research data, given the lack of literature on archiving such datasets. We address issues such as the anonymisation of video research data, the ethical challenges of managing legacy data and historic consent, ownership considerations, the handling of large-size multimedia data, as well as the complexity of multi-project data from a number of researchers and legacy data from eleven years of research.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"If These Crawls Could Talk: Studying and Documenting Web Archives Provenance"

Emily Maemura et al. have self-archived "If These Crawls Could Talk: Studying and Documenting Web Archives Provenance."

Here's an excerpt:

This study examines the decision space of web archives and its role in shaping what is and what is not captured in the web archiving process. By comparing how three different web archives collections were created and documented, we investigate how curatorial decisions interact with technical and external factors and we compare commonalities and differences. The findings reveal the need to understand both the social and technical context that shapes those decisions and the ways in which these individual decisions interact. Based on the study, we propose a framework for documenting key dimensions of a collection that addresses the situated nature of the organizational context, technical specificities, and unique characteristics of web materials that are the focus of a collection.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"The State of Assessing Data Stewardship Maturity —An Overview"

Ge Peng has published "The State of Assessing Data Stewardship Maturity —An Overview" in Data Science Journal.

Here's an excerpt:

Data stewardship encompasses all activities that preserve and improve the information content, accessibility, and usability of data and metadata. Recent regulations, mandates, policies, and guidelines set forth by the U.S. government, federal other, and funding agencies, scientific societies and scholarly publishers, have levied stewardship requirements on digital scientific data. This elevated level of requirements has increased the need for a formal approach to stewardship activities that supports compliance verification and reporting. Meeting or verifying compliance with stewardship requirements requires assessing the current state, identifying gaps, and, if necessary, defining a roadmap for improvement. This, however, touches on standards and best practices in multiple knowledge domains. Therefore, data stewardship practitioners, especially these at data repositories or data service centers or associated with data stewardship programs, can benefit from knowledge of existing maturity assessment models. This article provides an overview of the current state of assessing stewardship maturity for federally funded digital scientific data. A brief description of existing maturity assessment models and related application(s) is provided. This helps stewardship practitioners to readily obtain basic information about these models. It allows them to evaluate each model’s suitability for their unique verification and improvement needs.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"Text Data Mining from the Author’s Perspective: Whose Text, Whose Mining, and to Whose Benefit?"

Christine L. Borgman has self-archived "Text Data Mining from the Author's Perspective: Whose Text, Whose Mining, and to Whose Benefit?."

Here's an excerpt:

Given the many technical, social, and policy shifts in access to scholarly content since the early days of text data mining, it is time to expand the conversation about text data mining from concerns of the researcher wishing to mine data to include concerns of researcher-authors about how their data are mined, by whom, for what purposes, and to whose benefits.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"The Modern Research Data Portal: A Design Pattern for Networked, Data-Intensive Science"

Kyle Chard et al. have published "The Modern Research Data Portal: A Design Pattern for Networked, Data-Intensive Science" in PeerJ.

Here's an excerpt:

In this article, we first define the problems that research data portals address, introduce the legacy approach, and examine its limitations. We then introduce the MRDP design pattern and describe its realization via the integration of two elements: Science DMZs (Dart et al., 2013) (high-performance network enclaves that connect large-scale data servers directly to high-speed networks) and cloud-based data management and authentication services such as those provided by Globus (Chard, Tuecke & Foster, 2014). We then outline a reference implementation of the MRDP design pattern, also provided in its entirety on the companion web site, https://docs.globus.org/mrdp, that the reader can study—and, if they so desire, deploy and adapt to build their own high-performance research data portal. We also review various deployments to show how the MRDP approach has been applied in practice: examples like the National Center for Atmospheric Research's Research Data Archive, which provides for high-speed data delivery to thousands of geoscientists; the Sanger Imputation Service, which provides for online analysis of user-provided genomic data; the Globus data publication service, which provides for interactive data publication and discovery; and the DMagic data sharing system for data distribution from light sources. We conclude with a discussion of related technologies and summary.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"A Longitudinal Assessment of the Persistence of Twitter Datasets"

Arkaitz Zubiaga has self-archived "A Longitudinal Assessment of the Persistence of Twitter Datasets."

Here's an excerpt:

With social media datasets being increasingly shared by researchers, it also presents the caveat that those datasets are not always completely replicable. Having to adhere to requirements of platforms like Twitter, researchers cannot release the raw data and instead have to release a list of unique identifiers, which others can then use to recollect the data from the platform themselves. This leads to the problem that subsets of the data may no longer be available, as content can be deleted or user accounts deactivated. To quantify the impact of content deletion in the replicability of datasets in a long term, we perform a longitudinal analysis of the persistence of 30 Twitter datasets, which include over 147 million tweets. . . . Even though the ratio of available tweets keeps decreasing as the dataset gets older, we find that the textual content of the recollected subset is still largely representative of the whole dataset that was originally collected. The representativity of the metadata, however, keeps decreasing over time, both because the dataset shrinks and because certain metadata, such as the users' number of followers, keeps changing.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"Andrew W. Mellon Foundation Awards Grant to the Internet Archive for Long Tail Journal Preservation"

The Internet Archive has released "Andrew W. Mellon Foundation Awards Grant to the Internet Archive for Long Tail Journal Preservation."

Here's an excerpt:

The Andrew W. Mellon Foundation has awarded a research and development grant to the Internet Archive to address the critical need to preserve the "long tail" of open access scholarly communications. The project, Ensuring the Persistent Access of Long Tail Open Access Journal Literature, builds on prototype work identifying at-risk content held in web archives by using data provided by identifier services and registries. Furthermore, the project expands on work acquiring missing open access articles via customized web harvesting, improving discovery and access to this materials from within extant web archives, and developing machine learning approaches, training sets, and cost models for advancing and scaling this project’s work.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"Data Sustainability and Reuse Pathways of Natural Resources and Environmental Scientists"

Yi Shen has self-archived "Data Sustainability and Reuse Pathways of Natural Resources and Environmental Scientists."

Here's an excerpt:

This paper presents a multifarious examination of natural resources and environmental scientists' adventures navigating the policy change towards open access and cultural shift in data management, sharing, and reuse. Situated in the institutional context of Virginia Tech, a focus group and multiple individual interviews were conducted exploring the domain scientists' all-around experiences, performances, and perspectives on their collection, adoption, integration, preservation, and management of data. . . . Based on these findings, this study provides suggestions on data modeling and knowledge representation strategies to support the long-term viability, stewardship, accessibility, and sustainability of scientific data.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"Portage Releases Draft Institutional RDM Strategy Template"

The Portage Network has released "Portage Releases Draft Institutional RDM Strategy Template."

Here's an excerpt:

In response to the anticipated Tri-Agency research data management (RDM) policy, the Portage Institutional RDM Strategy Working Group has released a draft template and supporting guidance document that are designed to assist Canadian research institutions in developing an overarching strategy for RDM. These resources will exist as living documents, to be updated by the Working Group as needed.

See also: Template—Institutional Research Data Management Strategy and Institutional Research Data Management Strategy: Guidance Document.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"From Passive to Active, From Generic to Focused: How Can an Institutional Data Archive Remain Relevant in a Rapidly Evolving Landscape?"

Maria Cruz et al. have self-archived "From Passive to Active, From Generic to Focused: How Can an Institutional Data Archive Remain Relevant in a Rapidly Evolving Landscape?."

Here's an excerpt:

Founded in 2008 as an initiative of the libraries of three of the four technical universities in the Netherlands, the 4TU.Centre for Research Data (4TU.Research Data) provides since 2010 a fully operational, cross-institutional, long-term archive that stores data from all subjects in applied sciences and engineering.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"Stewardship in the ‘Age of Algorithms’"

Clifford Lynch has published "Stewardship in the 'Age of Algorithms'" in First Monday.

Here's an excerpt:

This paper explores pragmatic approaches that might be employed to document the behavior of large, complex socio-technical systems (often today shorthanded as "algorithms") that centrally involve some mixture of personalization, opaque rules, and machine learning components. Thinking rooted in traditional archival methodology–focusing on the preservation of physical and digital objects, and perhaps the accompanying preservation of their environments to permit subsequent interpretation or performance of the objects–has been a total failure for many reasons, and we must address this problem. The approaches presented here are clearly imperfect, unproven, labor-intensive, and sensitive to the often hidden factors that the target systems use for decision-making (including personalization of results, where relevant); but they are a place to begin, and their limitations are at least outlined. Numerous research questions must be explored before we can fully understand the strengths and limitations of what is proposed here. But it represents a way forward.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"CLIR Receives Sloan Foundation Grants for Software and Data Curation Fellows, Energy Fellows"

CLIR has released "CLIR Receives Sloan Foundation Grants for Software and Data Curation Fellows, Energy Fellows."

Here's an excerpt:

A $521,200 grant from Sloan's Energy and Environment program—its first to CLIR—will create a cohort of CLIR/Digital Library Federation (DLF) Postdoctoral Fellows in Data Curation for Energy Economics, a new area of focus for the postdoctoral fellowship program. Energy fellows will have joint appointments between energy research centers and libraries at four major universities for two years starting in 2018.

A $925,361 grant from Sloan's Digital Information Technology program, which has funded research data curation fellowships since 2012, will help support eight new scholar-practitioners to take leading roles in the development of sustainable approaches to software and research data curation in the sciences and social sciences.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"ARL Awarded Sloan Grant to Help Preserve Software, Save Cultural Record, Advance Discovery"

ARL has released "ARL Awarded Sloan Grant to Help Preserve Software, Save Cultural Record, Advance Discovery."

Here's an excerpt:

The Association of Research Libraries (ARL) has been awarded a $315,000 grant from the Alfred P. Sloan Foundation to develop and disseminate a Code of Best Practices in Fair Use for Software Preservation. This code will give individuals and institutions clear guidance on the legality of archiving software, in order to ensure continued access to digital files of all kinds and to offer hands-on understanding of the history of technology.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

Staffing for Effective Digital Preservation 2017: An NDSA Report

The National Digital Stewardship Alliance has released Staffing for Effective Digital Preservation 2017: An NDSA Report.

Here's an excerpt:

The 2017 Digital Preservation Staffing Survey provides a useful snapshot of the way digital preservation is accomplished in 2017 and how its practitioners feel about the effectiveness of their current organizational structures. It also builds on the 2012 survey and begins to establish data with which the digital preservation community can identify trends in staffing in the field.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"The Evolution, Approval and Implementation of the U.S. Geological Survey Science Data Lifecycle Model"

John L. Faundeen and Vivian B. Hutchison have published "The Evolution, Approval and Implementation of the U.S. Geological Survey Science Data Lifecycle Model" in the Journal of eScience Librarianship.

Here's an excerpt:

This paper details how the U.S. Geological Survey (USGS) Community for Data Integration (CDI) Data Management Working Group developed a Science Data Lifecycle Model, and the role the Model plays in shaping agency-wide policies and data management applications. Starting with an extensive literature review of existing data lifecycle models, representatives from various backgrounds in USGS attended a two-day meeting where the basic elements for the Science Data Lifecycle Model were determined. Refinements and reviews spanned two years, leading to finalization of the model and documentation in a formal agency publication.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"Evolving Roles of Preservation Professionals: Trends in Position Announcements from 2004 to 2015"

Mary M. Miller and Martha Horan have published "Evolving Roles of Preservation Professionals: Trends in Position Announcements from 2004 to 2015" in Library Resources & Technical Services.

Here's an excerpt:

As research libraries continue to expand the scope of content they acquire, manage, and make accessible, the preservation charge within organizations is broadening. Libraries and other cultural heritage institutions must balance the preservation of books, manuscripts, archives, and audiovisual materials with born-digital and digitized content. As preservation challenges and strategies evolve, professional positions in preservation must also evolve to meet the needs of academic and other cultural institutions. The ability to quantify how preservation positions are changing, and to identify the required skill sets and educational backgrounds needed for preservation professionals, is central to navigating this shift. To begin to address this, the authors collected and analyzed announcements for professional preservation positions in libraries and archives from 2004 through 2015. They compared the contents of announcements between earlier and more recent years to identify potential trends in preservation employment.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"Pursuing Best Performance in Research Data Management by Using the Capability Maturity Model and Rubrics "

Jian Qinet al. have published "Pursuing Best Performance in Research Data Management by Using the Capability Maturity Model and Rubrics " in the Journal of eScience Librarianship.

Here's an excerpt:

The RDM CMM [Capability Maturity Model] includes five chapters describing five key process areas for research data management: 1) data management in general; 2) data acquisition, processing, and quality assurance; 3) data description and representation; 4) data dissemination; and 5) repository services and preservation. In each chapter, key data management practices are organized into four groups according to the CMM's generic processes: commitment to perform, ability to perform, tasks performed, and process assessment (combining the original measurement and verification). For each area of practice, the document provides a rubric to help projects or organizations assess their level of maturity in RDM.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Digital Scholarship | Digital Scholarship Sitemap

Version 8 of the Research Data Curation Bibliography Released

Digital Scholarship has released Version 8 of the Research Data Curation Bibliography. This selective bibliography includes over 680 English-language articles, books, and technical reports that are useful in understanding the curation of digital research data in academic and other research institutions. Printed from the HTML page, it is over 130 pages long.

The Research Data Curation Bibliography covers topics such as research data creation, acquisition, metadata, provenance, repositories, management, policies, support services, funding agency requirements, peer review, publication, citation, sharing, reuse, and preservation.

Most sources have been published from January 2009 through September 2017; however, a limited number of earlier key sources are also included. The bibliography includes links to freely available versions of included works. If such versions are unavailable, links to the publishers' descriptions are provided.

Abstracts are included in this bibliography if a work is under a Creative Commons Attribution License (BY and national/international variations), a Creative Commons public domain dedication (CC0), or a Creative Commons Public Domain Mark and this is clearly indicated in the work.

The Research Data Curation Bibliography is under a Creative Commons Attribution 4.0 International License.

Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works and 2012 Supplement | Digital Curation and Digital Preservation Works | Digital Scholarship | Digital Scholarship Sitemap

"Persistence Statements: Describing Digital Stickiness"

John Kunze et al. have published "Persistence Statements: Describing Digital Stickiness" in Data Science Journal.

Here's an excerpt:

In this paper we present a draft vocabulary for making "persistence statements." These are simple tools for pragmatically addressing the concern that anyone feels upon experiencing a broken web link. Scholars increasingly use scientific and cultural assets in digital form, but choosing which among many objects to cite for the long term can be difficult. There are few well-defined terms to describe the various kinds and qualities of persistence that object repositories and identifier resolvers do or don’t provide. Given an object's identifier, one should be able to query a provider to retrieve human- and machine-readable information to help judge the level of service to expect and help gauge whether the identifier is durable enough, as a sort of long-term bet, to include in a citation. The vocabulary should enable providers to articulate persistence policies and set user expectations.

Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

100% Online Professional Science Master’s Degree in Digital Curation at UNC Chapel Hill Announced

University of North Carolina at Chapel Hill has announced its new Professional Science Master's Degree in Digital Curation.

Here's an excerpt from the announcement:

This innovative, 100% online program is now accepting applications for the initial cohort of students who will begin classes in January 2018. Deadline to apply for January admission is October 10, 2017.

Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

The Role of Research Libraries in the Creation, Archiving, Curation, and Preservation of Tools for the Digital Humanities

RLUK has released The Role of Research Libraries in the Creation, Archiving, Curation, and Preservation of Tools for the Digital Humanities.

Here's an excerpt:

The purpose of this report is to present and discuss the results of the 'Research Libraries and Digital Humanities Tools' project undertaken by RLUK. The project aimed to explore the role that libraries currently have or can potentially have in the creation, archiving, curation, and preservation of tools for Digital Humanities research; it is part of RLUK's goal to understand the role that research libraries play in digital scholarship, identify specific areas where they can add value as well as facilitate the sharing of existing best practice.

Therefore, a survey was conducted where professionals, mostly from research libraries within the RLUK membership, took part and reported on the variety of Digital Humanities projects they support and the different ways in which they engage with scholarly work in the area. Additional discussions with some of these participants not only shed further light into the collaborative activities formed in the context of various initiatives, such as the production and preservation of tools, but also into the different models of involvement in Digital Humanities scholarship.

Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"Information Scientist Herbert Van de Sompel to Receive Paul Evan Peters Award"

CNI released "Information Scientist Herbert Van de Sompel to Receive Paul Evan Peters Award."

Here's an excerpt:

An accomplished researcher and information scientist, Van de Sompel is perhaps best known for his role in the development of protocols designed to expose data and make them accessible to other systems, forging links that connect related information, thereby enhancing, facilitating, and deepening the research process. These initiatives include the OpenURL framework (stemming from his earlier work on the SFX link resolver), as well as the Open Archives Initiative (OAI), which included the Protocol for Metadata Harvesting (OAI-PMH) and the Object Reuse and Exchange (OAI-ORE) scheme. Other notable contributions include the Memento protocol, which enables browsers to access earlier versions of the Web easily, and ResourceSync, which allows applications to remain synchronized with evolving content collections.

Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap