Digital Curation & Digital Preservation – Page 4

"Data Curation in Interdisciplinary and Highly Collaborative Research"

This paper provides a systematic analysis of publications that discuss data curation in interdisciplinary and highly collaborative research (IHCR). Using content analysis methodology, it examined 159 publications and identified patterns in definitions of interdisciplinarity, projects’ participants and methodologies, and approaches to data curation. The findings suggest that data is a prominent component in interdisciplinarity. In addition to crossing disciplinary and other boundaries, IHCR is defined as curating and integrating heterogeneous data and creating new forms of knowledge from it. Using personal experiences and descriptive approaches, the publications discussed challenges that data curation in IHCR faces, including an increased overhead in coordination and management, lack of consistent metadata practices, and custom infrastructure that makes interoperability across projects, domains, and repositories difficult. The paper concludes with suggestions for future research.

https://doi.org/10.2218/ijdc.v17i1.835

Scholarly Communication Librarianship and Open Knowledge

The book consists of three parts. Part I offers definitions of scholarly communication and scholarly communication librarianship and provides an introduction to the social, economic, technological, and policy/legal pressures that underpin and shape scholarly communication work in libraries. These pressures, which have framed ACRL’s understanding of scholarly communication for the better part of the past two decades, have unsettled many foundational assumptions and practices in the field, removing core pillars of scholarly communication as it was practiced in the twentieth century. These pressures have also cleared fresh ground, and scholarly communication practitioners have begun to seed the space with values and practices designed to renew and often improve the field. Part II begins with an introduction to "open," the core response to the pressures described in part I. This part offers a general overview of the idea of openness in scholarly communication followed by chapters on different permutations and practices of open, each edited by a recognized expert of these areas with authors of their selection. Amy Buckland edited chapter 2.1, "Open Access." Brianna Marshall edited chapter 2.2, "Open Data." Lillian Hogendoorn edited chapter 2.3, "Open Education." Micah Vandegrift edited chapter 2.4, "Open Science and Infrastructure." Each of them brought on incredible expertise through contributors whom they identified, through both original contributions and repurposing existing openly licensed work, which is something we want to model where possible. Part III consists of twenty-four concise perspectives, intersections, and case studies from practicing librarians and closely related stakeholders, which we hope will stimulate discussion and reflection on theory and implications for practice. In every single case, we’re really excited by the editors and authors and the ideas they bring to the whole. Each contribution features light pedagogical apparatuses like suggested further reading, discussion or reflection prompts, and potential activities. It’s all available for free and openly licensed with a Creative Commons Attribution Non-Commercial (CC BY-NC) license, so anyone is encouraged to grab whatever parts are useful and to adapt and repurpose and improve them to meet specific course goals and student needs within the confines of the license.

https://bit.ly/SCLAOK

"ACME-FAIR: a Guide for Research Performing Organisations (RPO)"

The overall purpose of ACME-FAIR is to help those managing and delivering relevant professional services to self-assess how they are enabling researchers and their colleagues to do just that. Each part deals with one of the key issues that Research Performing Organisations (RPO) face in establishing the capabilities to put the FAIR principles into practice. . . . Each of the 7 guides has a thematic introduction, an overview of the relevant capabilities, and a rubric for assessing the levels of maturity and community engagement for each capability.

https://tinyurl.com/yckfdjtd

"An Approach to Assess the Quality of Jupyter Projects Published by GLAM Institutions"

Jupyter Notebooks have become a powerful tool to foster use of these collections by digital humanities researchers. Based on previous approaches for quality assessment, which have been adapted for cultural heritage collections, this paper proposes a methodology for assessing the quality of projects based on Jupyter Notebooks published by relevant GLAM institutions. A list of projects based on Jupyter Notebooks using cultural heritage data has been evaluated. Common features and best practices have been identified. A detailed analysis, that can be useful for organizations interested in creating their own Jupyter Notebooks projects, has been provided. Open issues requiring further work and additional avenues for exploration are outlined.

https://doi.org/10.1002/asi.24835

"Umbrella Data Management Plans to Integrate FAIR Data: Lessons From the ISIDORe and BY-COVID Consortia for Pandemic Preparedness"

The Horizon Europe project ISIDORe is dedicated to pandemic preparedness and responsiveness research. It brings together 17 research infrastructures (RIs) and networks to provide a broad range of services to infectious disease researchers. An efficient and structured treatment of data is central to ISIDORe’s aim to furnish seamless access to its multidisciplinary catalogue of services, and to ensure that users’ results are treated FAIRly. ISIDORe therefore requires a data management plan (DMP) covering both access management and research outputs, applicable over a broad range of disciplines, and compatible with the constraints and existing practices of its diverse partners.

Here, we describe how, to achieve that aim, we undertook an iterative, step-by-step, process to build a community-approved living document, identifying good practices and processes, on the basis of use cases, presented as proof of concepts. International fora such as the RDA and EOSC, and primarily the BY-COVID project, furnished registries, tools and online data platforms, as well as standards, and the support of data scientists. Together, these elements provide a path for building an umbrella, FAIR-compliant DMP, aligned as fully as possible with FAIR principles, which could also be applied as a framework for data management harmonisation in other large-scale, challenge-driven projects. Finally, we discuss how data management and reuse can be further improved through the use of knowledge models when writing DMPs and, how, in the future, an inter-RI network of data stewards could contribute to the establishment of a community of practice, to be integrated subsequently into planned trans-RI competence centres.

https://doi.org/10.5334/dsj-2023-035

Digital Preservation Coalition: "Digital Preservation Documentation: A Guide"

This guide discusses the importance of documentation, who it is for, and highlights some of the features of good and bad documentation. It goes on to provide some tips on creating documentation, including some of the tools or platforms available. Review and update of documentation is discussed, as are requirements for long term preservation.

http://doi.org/10.7207/documentation-23

"Data Management Plan Tools: Overview and Evaluation"

Data Management Plans (DMPs) are crucial for a structured research data management and often a mandatory part of research proposals. DMP tools support the development of DMPs. Among the variety of tools available, it can be difficult for researchers, data stewards and institutions to choose the one that is most appropriate for their specific needs and context. We evaluated 18 DMP tools according to 31 requirement parameters covering aspects relating to basic functions, technical aspects and user-friendliness. The highest total evaluation scores were reached by Data Stewardship Wizard (703.5), DMPTool (615.5) and RDMO NFDI4Ing (549.5). The tools evaluated satisfied between 10 % and 87 % of the requirement parameters. 11 tools cover at least half of the parameters. In terms of correlation among the tools, which indicates to which degree their scores in the different requirement parameters are alike, we found the highest correlation for ezDMP and GFBio DMPT. Regarding the relatedness between the tools, 85 % of the DMP tools were positively and 16 % negatively correlated. Accounting for the recent developments in the area of DMP tools, this study provides an up-to-date evaluation that can support tool developers in identifying potential improvements, and hosting institutions to select a tool suited to their specific needs.

https://tinyurl.com/yewhv8rn

"Understanding Barriers Affecting the Adoption and Usage of Open Access Data in the Context of Organizations"

Although the benefits of organizational adoption are significant, most OAD-related projects fail because of organizational barriers and resistance to adoption. This study first aims to find these organizational barriers to adopting OAD to raise awareness of the obstacles organizations must overcome. Towards this aim, after conducting a systematic literature review (SLR) and an expert panel, a research model based on the Technology – Organization – Environment (TOE) framework is proposed in this study. As a result of SLR, 97 barriers were identified from ten primary studies. After critically examining these barriers, a research model classifying 22 crucial barriers to organizational OAD adoption based on the TOE framework is proposed.

https://doi.org/10.1016/j.dim.2023.100049

Paywall: "Images as Metadata: A New Perspective for Describing Research Data"

Abstract Through studies and work developed over the last few years, we propose a new approach to description, where images can have a preponderant role in the description of data, assuming the role of metadata. We present several pieces of evidence, point out their challenges and determine the opportunities this new perspective can have in the research. Images have specific characteristics that can be leveraged in improving data description. Historical evidence establish that images have always been used and produced in research, yet their representational ability has never been harnessed to describe data and give more context to the scientific process.

https://doi.org/10.1080/19386389.2023.2252722

"Tracing Data: A Survey Investigating Disciplinary Differences in Data Citation"

Data citations, or citations in reference lists to data, are increasingly seen as an important means to trace data reuse and incentivize data sharing. Although disciplinary differences in data citation practices have been well documented via scientometric approaches, we do not yet know how representative these practices are within disciplines. Nor do we yet have insight into researchers’ motivations for citing — or not citing — data in their academic work. Here, we present the results of the largest known survey (n = 2,492) to explicitly investigate data citation practices, preferences, and motivations, using a representative sample of academic authors by discipline, as represented in the Web of Science (WoS). We present findings about researchers’ current practices and motivations for reusing and citing data and also examine their preferences for how they would like their own data to be cited. We conclude by discussing disciplinary patterns in two broad clusters, focusing on patterns in the social sciences and humanities, and consider the implications of our results for tracing and rewarding data sharing and reuse.

https://doi.org/10.1162/qss_a_00264

"Expanding the Data Ark: An Attempt to Make the Data from Highly Cited Social Science Papers Publicly Available"

Access to scientific data can enable independent reuse and verification; however, most data are not available and become increasingly irrecoverable over time. This study aimed to retrieve and preserve important datasets from 160 of the most highly-cited social science articles published between 2008-2013 and 2015-2018. We asked authors if they would share data in a public repository — the Data Ark — or provide reasons if data could not be shared. Of the 160 articles, data for 117 (73%, 95% CI [67% – 80%]) were not available and data for 7 (4%, 95% CI [0% – 12%]) were available with restrictions. Data for 36 (22%, 95% CI [16% – 30%]) articles were available in unrestricted form: 29 of these datasets were already available and 7 datasets were made available in the Data Ark. Most authors did not respond to our data requests and a minority shared reasons for not sharing, such as legal or ethical constraints. These findings highlight an unresolved need to preserve important scientific datasets and increase their accessibility to the scientific community.

https://doi.org/10.31222/osf.io/w9crz

Digital Preservation Coalition: Choosing a Persistent Identifier Type for Your Digital Objects

This report is intended to help you get started using persistent identifiers (PIDs) for digital objects. Its intended audience is people who are involved in digital preservation in heritage and research organizations. The report answers questions such as: "What are persistent identifiers?", "Why are they important?", "Which type should you choose?", "Are you ready for them?", and "How should you implement them?". The report does not specifically cover persistent identifiers for people, organizations, grants, workflows, and so on, but some of the same general concepts would also apply

http://doi.org/10.7207/twgn23-02

"Towards a Toolbox for Automated Assessment of Machine-Actionable Data Management Plans"

Most research funders require Data Management Plans (DMPs). The review process can be time consuming, since reviewers read text documents submitted by researchers and provide their feedback. Moreover, it requires specific expert knowledge in data stewardship, which is scarce. Machine-actionable Data Management Plans (maDMPs) and semantic technologies increase the potential for automatic assessment of information contained in DMPs. However, the level of automation and new possibilities are still not well-explored and leveraged. This paper discusses methods for the automation of DMP assessment. It goes beyond generating human-readable reports. It explores how the information contained in maDMPs can be used to provide automated pre-assessment or to fetch further information, allowing reviewers to better judge the content. We map the identified methods to various reviewer goals.

https://doi.org/10.5334/dsj-2023-028

"Engaging with Researchers and Raising Awareness of FAIR and Open Science through the FAIR+ Implementation Survey Tool (FAIRIST)"

Seven years after the seminal paper on FAIR was published, that introduced the concept of making research outputs Findable, Accessible, Interoperable, and Reusable, researchers still struggle to understand how to implement the principles. For many researchers, FAIR promises long-term benefits for near-term effort, requires skills not yet acquired, and is one more thing in a long list of unfunded mandates and onerous requirements for scientists. Even for those required to, or who are convinced that they must make time for FAIR research practices, their preference is for just-in-time advice properly sized to the scientific artifacts and process. Because of the generality of most FAIR implementation guidance, it is difficult for a researcher to adjust to the advice according to their situation. Technological advances, especially in the area of artificial intelligence (AI) and machine learning (ML), complicate FAIR adoption, as researchers and data stewards ponder how to make software, workflows, and models FAIR and reproducible. The FAIR+ Implementation Survey Tool (FAIRIST) mitigates the problem by integrating research requirements with research proposals in a systematic way. FAIRIST factors in new scholarly outputs, such as nanopublications and notebooks, and the various research artifacts related to AI research (data, models, workflows, and benchmarks). Researchers step through a self-serve survey process and receive a table ready for use in their data management plan (DMP) and/or work plan. while gaining awareness of the FAIR Principles and Open Science concepts. FAIRIST is a model that uses part of the proposal process as a way to do outreach, raise awareness of FAIR dimensions and considerations, while providing timely assistance for competitive proposals.

https://doi.org/10.5334/dsj-2023-032

"The Effects of Research Data Management Services: Associating the Data Curation Lifecycle with Open Research Output"

This study seeks to understand the relationship between research data management (RDM) services framed in the data curation life cycle and the production of open data. An electronic questionnaire was distributed to US researchers and RDM specialists, and the results were analyzed using Chi-Square tests for association. The data curation life cycle does associate with the production of open data and shareable research, but tasks like data management plans have stronger associations with the production of open data. The findings analyze the intersection of these concepts and provide insight into RDM services that facilitate the production of open data and shareable research.

https://doi.org/10.5860/crl.84.5.751

"re3data — Indexing the Global Research Data Repository Landscape Since 2012"

For more than ten years, re3data, a global registry of research data repositories (RDRs), has been helping scientists, funding agencies, libraries, and data centers with finding, identifying, and referencing RDRs. As the world’s largest directory of RDRs, re3data currently describes over 3,000 RDRs on the basis of a comprehensive metadata schema. The service allows searching for RDRs of any type and from all disciplines, and users can filter results based on a wide range of characteristics. The re3data RDR descriptions are available as Open Data accessible through an API and are utilized by numerous Open Science services. re3data is engaged in various initiatives and projects concerning data management and is mentioned in the policies of many scientific institutions, funding organizations, and publishers. This article reflects on the ten-year experience of running re3data and discusses ten key issues related to the management of an Open Science service that caters to RDRs worldwide.

https://doi.org/10.1038/s41597-023-02462-y

"Towards Research Software-ready Libraries"

Software is increasingly acknowledged as valid research output. Academic libraries adapt to this change to become research software-ready. Software publication and citation are key areas in this endeavor. We present and discuss the current state of the practice of software publication and software citation, and discuss four areas of activity that libraries engage in: (1) technical infrastructure, (2) training and support, (3) software management and curation, (4) policies.

https://doi.org/10.1515/abitech-2023-0031

"Data Management Plan Implementation, Assessments, and Evaluations: Implications and Recommendations"

Data management plans (DMPs) have become nearly a worldwide requirement for research funding. To meet these new funding agency expectations, information professionals across domains and the world have worked to create resources and services to successfully implement and sometimes assess DMPs. This essay presents a series of case studies from different institutions across the globe to highlight current practices and share recommendations for future work. A summary of various projects related to DMP implementation, assessment, and evaluation in different contexts provides a useful overview of current practices. The essay concludes with recommendations for practical oversight and scoring to improve DMPs’ utility in enabling the sharing of data.

https://doi.org/10.5334/dsj-2023-027

Digital Preservation Maturity Modelling Tool: "New DPC Resource ‘Level-up with RAM’ Now Available on General Release"

Designed to enable rapid benchmarking of an organization’s digital preservation capability, the DPC RAM is a digital preservation maturity modelling tool which is applicable for organizations of any size in any sector, and for all content of long-term value.

https://tinyurl.com/4xxe3dcu

Level up with DPC RAM

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |

"Computational Reproducibility of Jupyter Notebooks from Biomedical Publications"

Jupyter notebooks facilitate the bundling of executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows. The reproducibility of computational aspects of research is a key component of scientific reproducibility but has not yet been assessed at scale for Jupyter notebooks associated with biomedical publications. We address computational reproducibility at two levels: First, using fully automated workflows, we analyzed the computational reproducibility of Jupyter notebooks related to publications indexed in PubMed Central. We identified such notebooks by mining the articles full text, locating them on GitHub and re-running them in an environment as close to the original as possible. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications. Second, this study represents a reproducibility attempt in and of itself, using essentially the same methodology twice on PubMed Central over two years. Out of 27271 notebooks from 2660 GitHub repositories associated with 3467 articles, 22578 notebooks were written in Python, including 15817 that had their dependencies declared in standard requirement files and that we attempted to re-run automatically. For 10388 of these, all declared dependencies could be installed successfully, and we re-ran them to assess reproducibility. Of these, 1203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones. Running the other notebooks resulted in exceptions. We zoom in on common problems, highlight trends and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.

https://arxiv.org/abs/2308.07333

More about Jupyter notebooks.

The Jupyter Notebook is an interactive computing environment that enables users to author notebook documents that include code, interactive widgets, plots, narrative text, equations, images and even video! The Jupyter name comes from 3 programming languages: Julia, Python, and R. It is a popular tool for literate programming. Donald Knuth first defined literate programming as a script, notebook, or computational document that contains an explanation of the program logic in a natural language (e.g. English or Mandarin), interspersed with snippets of macros and source code, which can be compiled and rerun. You can think of it as an executable paper!

Washington University Libraries: "New Grant to Preserve Born-Digital Poetry"

The Washington University Libraries were awarded a two-year grant by the Mellon Foundation to support an exploration of essential questions surrounding the acquisition, discoverability, preservation, and use of born-digital poetry collections. The $250,000 award will enable the University Libraries to develop online resources and systems to process, preserve, and steward the collections of a new generation of digital-native poets. . . .

The first of its kind to focus on issues of acquisition, preservation, and wider access to born-digital materials, the project will process a wide range of digital materials from the archive of poet and academic Mary Jo Bang. Consequently, the project will eventually make it possible for students and researchers to access born-digital collections and gain a better understanding and insight into the unprecedented ways in which poetry is created in a digital era. The project also aims to lay the foundation for new benchmarks and guidelines on preservation and access to born-digital archives at libraries and museums and for personal poetry archives.

https://tinyurl.com/3baahaj5

"Actually Accessible Data: An Update and a Call to Action"

As funder, journal, and disciplinary norms and mandates have foregrounded obligations of data sharing and opportunities for data reuse, the need to plan for and curate data sets that can reach researchers and end-users with disabilities has become even more urgent. We begin by exploring the disability studies literature, describing the need for advocacy and representation of disabled scholars as data creators, subjects, and users. We then survey the landscape of data repositories, curation guidelines, and research-data-related standards, finding little consideration of accessibility for people with disabilities. We suggest three sets of minimal good practices for moving toward truly accessible research data: 1) ensuring Web accessibility for data repositories; 2) ensuring accessibility of common text formats, including those used in documentation; and 3) enhancement of visual and audiovisual materials. We point to some signs of progress in regard to truly accessible data by highlighting exemplary practices by repositories, standards, and data professionals. Accessibility needs to become a mainstream component of curation practice included in every training, manual, and primer.

https://tinyurl.com/2p8p4dau

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |

"A Decade of Surveys on Attitudes to Data Sharing Highlights Three Factors for Achieving Open Science"

Over a 10 year period Carol Tenopir of DataONE and her team conducted a global survey of scientists, managers and government workers involved in broad environmental science activities about their willingness to share data and their opinion of the resources available to do so. . . .

The most surprising result was that a higher willingness to share data corresponded with a decrease in satisfaction with data sharing resources across nations (e.g., skills, tools, training) (Fig.1). That is, researchers who did not want to share data were satisfied with the available resources, and those that did want to share data were dissatisfied. Researchers appear to only discover that the tools are insufficient when they begin the hard work of engaging in open science practices. This indicates that a cultural shift in the attitudes of researchers needs to precede the development of support and tools for data management.

https://tinyurl.com/4sx54c6d

Data Sharing for Research: A Compendium of Case Studies, Analysis, and Recommendations

This report contains eight case studies that look at specific corporate/academic data-sharing partnerships in depth, from initiation through the publication of research findings. These case studies illuminate practical challenges for implementing corporate data sharing with researchers. Some common themes that emerged from the case studies include:

Successful data-sharing partnerships use Data-Sharing Agreements that require both the company and researchers to take steps to protect privacy.

Some of the challenges of data sharing include technical knowledge and infrastructure gaps between companies and researchers, and the continuing need for ethics and privacy review for industry-based research.

Promising practices for data sharing include the use of Privacy Enhancing Technologies and company-created, public-facing data-sharing menus to facilitate new partnerships.

While data sharing has significant costs and inherent risks, the risks can be managed, and the benefits to researchers, companies, and society make data sharing worth the effort.

https://tinyurl.com/a9axcscp

"Internet Archive Responds to Recording Industry Lawsuit Targeting Obsolete Media"

Late Friday, some of the world’s largest record labels, including Sony and Universal Music Group, filed a lawsuit against the Internet Archive and others for the Great 78 Project, a community effort for the preservation, research and discovery of 78 rpm records that are 70 to 120 years old. . . .

Of note, the Great 78 Project has been in operation since 2006 to bring free public access to a largely forgotten but culturally important medium. Through the efforts of dedicated librarians, archivists and sound engineers, we have preserved hundreds of thousands of recordings that are stored on shellac resin, an obsolete and brittle medium. The resulting preserved recordings retain the scratch and pop sounds that are present in the analog artifacts; noise that modern remastering techniques remove.

These preservation recordings are used in teaching and research, including by university professors like Jason Luther of Rowan University, whose students use the Great 78 collection as the basis for researching and writing podcasts for use in class assignments . . . While this mode of access is important, usage is tiny—on average, each recording in the collection is only accessed by one researcher per month.

https://tinyurl.com/bdevycm5