Digital Curation & Digital Preservation – Page 2

“Supporting the Research Data Management Journey of a Postgraduate Student at the University of St Andrews”

Most research funders have requirements for data management plans and open data to foster good research data management practices. In order to embed these practices in the postgraduate research (PGR) student journey we have introduced the requirement for a data management plan as part of the first-year progress review and the encouragement to make data underpinning theses publicly available. To support students through these processes we provide a suite of training workshops and are available for one-to-one consultations. User feedback and frequently asked questions are used to review and improve our support offering.

This brief report discusses the planning and implementation processes for data management plan requirement and encouragement of underpinning data. It dives deeper into the workflows, especially for the data deposit, and describes training and support available to students. Statistics on training uptake, data management plan submissions and annual trends for data deposit are also presented. The report concludes with lessons learnt and the team’s plans for the near future.

https://doi.org/10.2218/ijdc.v19i1.980

“How Will We Prepare for an Uncertain Future? The Value of Open Data and Code for Unborn Generations Facing Climate Change”

What is the unit of knowledge that we would most like to protect for future generations? Is it the scientific publication? Or is it our datasets? Datasets are snapshots in space and time of n-dimensional hypervolumes of information that are resources in and of themselves—each giving numerous insights into the measured world [134,135]. New publishing paradigms, such as Octopus, allow researchers to link multiple ‘Analysis’ and/or ‘Interpretation’ publications to a single ‘Results’ publication as alternative analyses and interpretations of the same data [159]. A more traditional research paper, on the other hand, is one realization of many possible assessments of the data that were originally collected, and a wide diversity of results can be obtained when many individuals analyse one dataset with the same research question in mind [160,161]. That is, publications are one version of an oversimplified projection through n-dimensional space which communicate stories that our human minds can comprehend. Manuscript narratives, by necessity, leave out information to craft such a story.

This is not to say that scientific publications in and of themselves are not useful. On the contrary, they frame our current and historical understanding of the world and put scientific inquiry into the relevant spatial and temporal context. Scientific articles offer analysis and interpretation of data which will allow future generations to understand why certain policies, management actions, or approaches were attempted and/or abandoned. However, if future researchers are not granted access to our (past) data, future humans will have to repeat costly (e.g. time and resources) experiments, laboriously extract information directly from figures, tables and text in the articles themselves (assuming the relevant information is available and detailed enough, although there is evidence that this is not the case in at least some disciplines [55,162]) or will have to trust our analytical procedures and our intuitions and perceptions about the data we collected [160,161].

https://doi.org/10.1098/rspb.2024.1515

“Leveraging Task-Specific Large Language Models to Enhance Research Data Management Services”

Applying prompt engineering and RAG [Retrieval-Augmented Generation ] to research data management and sharing activities offers numerous opportunities for enhancing institutional research data support services. Here, we present just a few illustrative examples that highlight how these technologies could significantly improve service efficiencies, reduce researcher burden, and support adherence with evolving policies. These examples aim to inspire further exploration and future work rather than serve as extensive case studies.

Task-Specific, Agent-Based Chatbots for Data Management and Sharing Plans (DMSPs): Agent-based chatbots can assist researchers in drafting DMSPs by prompting for specific information based on funder requirements. This would offer researchers an interactive, guided experience that streamlines the process of developing a DMSP. The chatbot can be pre-loaded with knowledge of DMSP policies, institutional resources, and common pitfalls observed during plan reviews. Moreover, by incorporating review criteria, these chatbots could also provide real-time feedback on draft plans, allowing researchers to refine their submissions before institutional review.

Automated Text Extraction for Structured Compliance Reporting: Using these approaches, institutions can also automate the extraction of key details from narrative-based DMSPs and transform them into structured, formatted fields. This could be particularly useful for converting narrative-based DMSPs into actionable steps for researchers, service providers, and compliance officers, enabling efficient monitoring and follow-up on data management and sharing commitments.

Customized Knowledge Retrieval for Policy Guidance and Updates: Institutions can further leverage these approaches to develop tools that offer researchers up-to-date guidance on data management and sharing policies from major funders and publishers as well as institutional requirements. For instance, a researcher could query these tools to receive the latest mandates, institutional requirements, or best practices related to data management and sharing. This capability would reduce the burden for researchers in tracking down the most recent policy update.

https://tinyurl.com/bdee5u29

Paywall: Data Culture in Academic Libraries: A Practical Guide to Building Communities, Partnerships, and Collaborations

In five parts, Data Culture in Academic Libraries: A Practical Guide to Building Communities, Partnerships, and Collaborations can help you foster an institutional culture that favors the curation, creation, and wider use of datasets.

Data at all Levels

Data Services and Instruction

Data Outreach

Data Communities

Data Partnerships

https://tinyurl.com/ydsmdjbj

“From Data Creator to Data Reuser: Distance Matters”

Sharing research data is necessary, but not sufficient, for data reuse. Open science policies focus more heavily on data sharing than on reuse, yet both are complex, labor-intensive, expensive, and require infrastructure investments by multiple stakeholders. The value of data reuse lies in relationships between creators and reusers. By addressing knowledge exchange, rather than mere transactions between stakeholders, investments in data management and knowledge infrastructures can be made more wisely. Drawing upon empirical studies of data sharing and reuse, we develop the metaphor of distance between data creator and data reuser, identifying six dimensions of distance that influence the ability to transfer knowledge effectively: domain, methods, collaboration, curation, purposes, and time and temporality. We explore how social and socio-technical aspects of these dimensions may decrease – or increase – distances to be traversed between creators and reusers. Our theoretical framing of the distance between data creators and prospective reusers leads to recommendations to four categories of stakeholders on how to make data sharing and reuse more effective: data creators, data reusers, data archivists, and funding agencies. ‘It takes a village’ to share research data – and a village to reuse data. Our aim is to provoke new research questions, new research, and new investments in effective and efficient circulation of research data; and to identify criteria for investments at each stage of data and research life cycles.

https://tinyurl.com/3429p526

“Research Data Lifecycle (RDLC): An Investigation into the Disciplinary Focus, Use Cases, Creator Backgrounds, Stages and Shapes of RDLC Models”

In this paper, we report the results of a study examining 78 Research and Data Lifecycle (RDLC) models located in a review of the literature. Through synthesis-analysis and the nominal group technique, we investigated the RDLC models from the point of view of their disciplinary focus, use cases, model creators, as well as the specific stages and shapes. Our study revealed that the majority of the disciplinary focus for the models was generic, science, or multi-disciplinary. Models originating in the social sciences and humanities are less common. The use cases varied in a wide spectrum, with a total of 34 different scenarios. The creators and authors of the RDLC models came from more than 20 countries with the majority of the models created as a result of collaboration within or across different organizations. Our stage and shape analysis also outlined key characteristics of the RDLC models by showing the commonalities and variations of named stages and varying structures of the models. As one of the first empirical investigations examining the deep substance of the RDLC models, our study provides significant insights into the context and setting where the models were developed, as well as the details with regard to the stages and shapes, and thereby identified gaps that may impact the use and value of the models. As such, our study establishes a foundation for further studies on the practical utilization of the RDLC models in research data management practice and education.

https://doi.org/10.2218/ijdc.v19i1.860

“Research Data Management and Crowdsourcing Personal Histories”

Drawing on experiences of the University of Oxford’s Sustainable Digital Scholarship (SDS) service and the World War Two crowdsourcing project ‘Their Finest Hour’, this paper explores how institutional digital repositories (such as the SDS platform) can be successfully leveraged to publish and sustainably host crowdsourced (‘warm-data’) collections beyond their funding period.

The paper examines the challenges in applying FAIR (Findable, Accessible, Interoperable, Reusable) principles to a collection containing first-hand testimonies and digitised objects of significant sentimental value, addressing both practical and ethical considerations, including the management of copyright, handling of sensitive material, use of AI tools and adherence to good research data management practices, with limited resources.

Reflecting on the importance of a caring approach to data stewardship, the paper examines how the ethos of the Their Finest Hour project, and its commitment to honouring contributors and their families, led organically to an alignment with CARE (Collective Benefit, Authority to Control, Responsibility, Ethics) principles, originally developed for Indigenous data governance. It also explores the potential for the wider application of CARE principles for crowdsourced collections such as the Their Finest Hour Online Archive, while acknowledging and respecting the origins of this framework.

Lastly, it offers some practical ‘lessons learned’ to help GLAM and Higher Education professionals working with crowdsourced collections and personal histories to navigate some of the research data management challenges that they may encounter, while also highlighting the importance of understanding FAIR and CARE principles and how they can be applied to these types of data collections.

https://doi.org/10.5334/johd.265

“Copyright and Licencing for Cultural Heritage Collections as Data”

Cultural Heritage (CH) institutions have been exploring innovative ways to publish digital collections to facilitate reuse, through initiatives like Collections as data and the International GLAM Labs Community. When making a digital collection available for computational use, it is crucial to have reusable and machine-readable open licences and copyright terms. While existing studies address copyright for digital collections, this study focuses specifically on the unique requirements of collections as data. This research highlights both the legal and technical aspects of copyright concerning collections as data. It discusses permissible uses of copyrighted collections, emphasising the need for interoperable, machine-readable licences and open licences. By reviewing current literature and examples, this study presents best practices and examples to help CH institutions better navigate copyright and licencing issues, ultimately enhancing their ability to convert their content into collections as data for computational research.

https://doi.org/10.5334/johd.263

“Towards the Interoperability of Scholarly Repository Registries”

The enactment of Open Science relies on scholarly repositories that make research products findable and accessible, while scholarly repository registries maintain authoritative metadata and persistent identifiers (PIDs) to help researchers and infrastructure providers discover and access needed repositories. However, the proliferation of repositories targeting different research products (e.g., publications, data, and software) or serving specific disciplines has led to the creation of multiple registries whose scope is not mutually exclusive. . . . While favouring the existence of a plurality of registries, this paper advocates for their interoperability, which is essential to eliminate the aforementioned barriers and enable their full, unambiguous utilisation. We analyse the data models of four prominent registries—FAIRsharing, re3data, OpenDOAR, and ROAR—and classify their properties and overlap. We provide a crosswalk between their data models and suggest a common data model shared across the examined registries to pave the way toward interoperability. As a means of validation, we include a coverage evaluation of the proposed data model.The paper adopts a pragmatic approach towards scholarly registry interoperability and suggests a common metadata model to foster the exchange of information across these platforms.

https://doi.org/10.1007/s00799-025-00414-y

“Conceptualizing Aggregate-Level Description in Web Archives”

Web archives collections are often excluded from archival science discussions, and their description instead focuses on bibliographic approaches to item-level metadata. This article argues that web archives are best understood using approaches of archival description, focusing on a case study of the Danish Netarchive, a long-running national web archive. By capturing and preserving web sites for the purposes of legal deposit, the Netarchive creates and maintains historical records of the web. Examining the Netarchive’s systems and activities through the lens of archival representation, this article develops a typology of representational artifacts that support this work, including the use of database entities, wiki documentation, classification and management via Jira issues, and codes, identifiers, and structures embedded in network protocols themselves. The analysis considers how meaningful aggregations can be understood via these representational schemes, systems and architectures, and how the nature of born-networked records challenges concepts of singular, hierarchical orderings of records aggregations. The closing discussion proposes new modes of description that address these multiple interconnected systems, and raises questions about what this might mean for aggregate-level description in the context of digital and born-networked records more broadly.

https://doi.org/10.5334/johd.265

“Research Data Management and Crowdsourcing Personal Histories”

Drawing on experiences of the University of Oxford’s Sustainable Digital Scholarship (SDS) service and the World War Two crowdsourcing project ‘Their Finest Hour’, this paper explores how institutional digital repositories (such as the SDS platform) can be successfully leveraged to publish and sustainably host crowdsourced (‘warm-data’) collections beyond their funding period.

The paper examines the challenges in applying FAIR (Findable, Accessible, Interoperable, Reusable) principles to a collection containing first-hand testimonies and digitised objects of significant sentimental value, addressing both practical and ethical considerations, including the management of copyright, handling of sensitive material, use of AI tools and adherence to good research data management practices, with limited resources.

Reflecting on the importance of a caring approach to data stewardship, the paper examines how the ethos of the Their Finest Hour project, and its commitment to honouring contributors and their families, led organically to an alignment with CARE (Collective Benefit, Authority to Control, Responsibility, Ethics) principles, originally developed for Indigenous data governance. It also explores the potential for the wider application of CARE principles for crowdsourced collections such as the Their Finest Hour Online Archive, while acknowledging and respecting the origins of this framework.

Lastly, it offers some practical ‘lessons learned’ to help GLAM and Higher Education professionals working with crowdsourced collections and personal histories to navigate some of the research data management challenges that they may encounter, while also highlighting the importance of understanding FAIR and CARE principles and how they can be applied to these types of data collections.

https://doi.org/10.5334/johd.265

"Understanding How to Identify and Manage Personal Identifying Information (PII) to Further Data Interoperability"

Respect for research participant rights is a key aspect for consideration when creating and utilizing interoperable data. From that perspective, requirements for sharing research data often call for the data to be de-identified, i.e., the removal of all personal identifying information (PII) prior to data sharing, to ensure that the participant’s data privacy rights are not infringed upon. However, what constitutes PII is often a point of confusion amongst researchers who are not familiar with privacy laws and regulations. This paper hopes to provide some clarity around what makes research data identifiable by presenting it under a different perspective from what most researchers are familiar with. It also provides a framework to help researchers determine where PII could exist within their data that they can use to help with privacy impact evaluations. The goal is to empower researchers to share their data with greater confidence that the privacy rights of their research subjects have been sufficiently protected, enabling access to greater amounts of data for research use.

https://tinyurl.com/2p95xtd2

"Staking Out the Stakeholders: Using NIST’s Research Data Framework within a Public University System"

Purpose: This article first introduces and contextualizes the National Institute of Standards and Technology (NIST) Research Data Framework (RDaF) and then explores its application in a local context.

Setting/Participants: The State University of New York (SUNY) System, both at a system-wide level and at two individual SUNY campuses, developed an approach to applying RDaF to improve research data management (RDM) practices.

Brief Description: As institutions work to establish sound, coordinated services and infrastructure that meet local needs, they look to strategic guidance and established best practices for doing so responsibly and successfully. Modeled after their Cybersecurity and Privacy Frameworks, NIST began developing RDaF in 2019 to address pressing research data community needs. The RDaF provides a comprehensive, structured approach to be used by diverse stakeholders to better understand the benefits, risks, and costs of research data management (RDM).

Results/Outcome: NIST continues to work with other organizations on RDaF’s utility in different contexts, and SUNY’s application offers both a use case and lessons learned that may offer other institutions a practical, grounded approach for leveraging the power of RDaF to improve their RDM strategy.

Conclusions: RDaF’s comprehensive guidance offers a robust, flexible framework for building thorough RDM strategy, whatever an organization’s institutional readiness.

https://tinyurl.com/55v3k7ux

"Persistent Identifiers for Instruments and Facilities: Current State, Challenges, and Opportunities"

Objective: Persistent Identifiers (PIDs) are central to the vision of open science described in the FAIR Principles. However, the use of PIDs for scientific instruments and facilities is decentralized and fragmented. This project aims to develop community-based standards, guidelines, and best practices for how and why PIDs can be assigned to facilities and instruments.

Methods: We hosted several online and in-person focus groups and discussions, cumulating in a two-day in-person workshop featuring stakeholders from a variety of organizations and disciplines, such as instrument and facilities operators, PID infrastructure providers, researchers who use instruments and facilities, journal publishers, university administrators, federal funding agencies, and information and data professionals.

Results: Our first-year efforts resulted in four main areas of interest: developing a better understanding of the current PID ecosystem; clarifying how and when PIDs could be assigned to scientific instruments and facilities; challenges and barriers involved with assigning PIDs; incentives for researchers, facility managers, and other stakeholders to encourage the use of PIDs.

Conclusions: The potential for PIDs to facilitate the discovery, connection, and attribution of research instruments and facilities indicates an obvious value in their use. The lack of standards of how and when they are created, assigned, updated, and used is a major barrier to their widespread use. Data and information professionals can work to create relationships with stakeholders, provide relevant education and outreach activities, and integrate PIDs for instruments and facilities into their data curation and publication workflows.

https://tinyurl.com/3b8r6xrx

"In Sharing We Trust. Taking Advantage of a Diverse Consortium to Build a Transparent Data Service in Catalonia "

The Consorci de Serveis Universitaris de Catalunya (CSUC) is a consortium that serves 13 universities and 33 research centers in Catalonia and neighboring communities. In 2017 the Consortium created an Open Science department to collaborate with universities and research centers on facilitating the adoption of Open Science requirements. Even though CSUC also offers services to researchers directly (for example, its supercomputing resources), this report will focus on CSUC’s work with its member institutions to create and offer data management services. We will explain how CSUC has led the creation of a robust shared governance system, and how it takes advantage of the diversity of its members to create useful, high quality, and transparent services for all researchers in the Catalan research system. Through sharing each other’s experiences, values and priorities, the result is better than separate ad-hoc solutions. The process also creates a community of practitioners that develop expertise together with the help of professional development opportunities organized by CSUC, like recurrent self-learning labs focused on data curation tools, techniques and processes.

https://tinyurl.com/r2msbnsv

"Researchers and Research Data: Improving and Incentivising Sharing and Archiving "

There has been a lot of discussion within the scientific community around the issues of reproducibility in research, with questions being raised about the integrity of research due to failure to reproduce or confirm the findings of some of the studies. Researchers need to adhere to the FAIR (findable, accessible, interoperable, and reusable) principles to contribute to collaborative and open science, but these open data principles can also support reproducibility and issues around ensuring data integrity. This article uses observations and metrics from data sharing and research integrity related activities, undertaken by a Research Integrity and Data Specialist at the Francis Crick Institute, to discuss potential reasons behind a slow uptake of FAIR data practices. We then suggest solutions undertaken at the Francis Crick institute which can be followed by institutes and universities to improve the integrity of research from a data perspective. One major solution discussed is the implementation of a data archive system at the Francis Crick Institute to ensure the integrity of data long term, comply with our funders’ data management requirements, and to safeguard our researchers against any potential research integrity allegations in the future.

https://tinyurl.com/wkhw548z

"Starting with the Digital Doesn’t Make it Easier: Developing Transparent Born Digital Acquisition Policies for Archives "

As organizations continue to overwhelmingly abandon all forms of paper-based record keeping, libraries are still adapting to increased offers of born digital archival donations. Simple misunderstandings or disconnects between the units facilitating donations and maintaining born-digital collections creates pain-points for donor relations and can result in a lack of transparency over how their records may be processed. To facilitate better donor transparency and cross-area collaboration over born digital records, Special Collections and archives need comprehensive policies and shifts in training and collaboration paradigms. This paper analyses the intersections of born digital archiving, collection development polices, donor relations, human-supported AI tools, and digital records education within American academic libraries to propose a functional toolkit for born digital acquisitions. Unrealistic expectations of collection processing, retention, growth, and publication onto openly accessible platforms can quickly overwhelm a libraries’ digital collections’ team due to size, need for digital forensics work, copyright limitations, or other capacity-related issues. Intertwined within this discussion is an additional discourse over the need to carefully curate our digital spaces not only for practical cost reasons, but due to the environmental costs of massive data storage solutions. Through an analysis of the elements stated above, the paper will reflect on the need to integrate born digital materials into archival acquisition procedures and provide practical solutions to meet this need.

https://tinyurl.com/8r3ucesb

"Trends and Changes in Academic Libraries’ Data Management Functions: A Topic Modeling Analysis of Job Advertisements"

This study aims to (i) track trends in academic library data management positions, (ii) identify key themes in job advertisements related to data management, and (iii) examine how these themes have evolved. Using text mining techniques, this study applied Latent Dirichlet Allocation (LDA) and TF-IDF vectorization to systematically analyze 803 job advertisements related to data management posted on the IFLA LIBJOBS platform from 1996 to 2023. The findings reveal that the development of these positions has undergone three phases: exploration, growth, and adjustment. Four core themes in data management functions emerged: “Cataloging and Metadata Management,” “Data Services and Support,” “Research Data Management,” and “Systems Management and Maintenance.” Over time, these themes have evolved from distinct roles to a more balanced distribution.

https://doi.org/10.1016/j.acalib.2025.103017

"Paris Declaration Calls for Data-Driven Forensics to Spearhead the Fight Against Fake Science"

Supporters of research integrity have signed a new declaration calling for data-driven forensics – known as Forensic Scientometrics (FoSci) – to lead the charge in detecting, exposing and even preventing fake science. . . .

The event involved researchers, experts, and professionals from around the world who are committed to upholding research integrity, many well-known sleuths among them. Attendees signed the declaration over the following weekend. . . .

The FoSci Paris Declaration has made the following key commitments:

Advocate for transformation

Open a dialogue with policymakers to design de-incentivizing strategies to tackle the mass production of problematic papers

Advocate for reform of institutions involved in scientific research based on the group’s findings

Develop expertise and share knowledge

Facilitate training for researchers and professionals exploring these questions

Share and provide research and data in the FoSci community

Establish a regular cycle of professional meetings

Improve the tools and methods of forensic scientometrics

Improve the group’s ability to communicate its findings

Inform editorial boards, publishers, research institutions, governments and all relevant involved parties about the group’s work

Participate in building software and tools to enable the reproducibility of their forensics findings

Establish points of contact between FoSci members and concerned organizations

https://tinyurl.com/mrywc3ch

"A Primer for Applying and Interpreting Licenses for Research Data and Code"

This primer gives data curators an overview of the licenses that are commonly applied to datasets and code, familiarizes them with common requirements in institutional data policies, and makes recommendations for working with researchers who need to apply a license to their research outputs or understand a license applied to data or code they would like to reuse. While copyright issues are highly case-dependent, the introduction to the data copyright landscape and the general principles provided here can help data curators empower researchers to understand the copyright context of their own data.

https://tinyurl.com/34738m4s

Paywall: "Evolution of the “Long-Tail” Concept for Scientific Data"

This paper examines the changing landscape of discussions about long-tail data over time. . . . The review also bridges discussions on data curation in Library & Information Science (LIS) and domain-specific contexts, contributing to a more comprehensive understanding of the long-tail concept’s utility for effective data management outcomes. The review aims to provide a more comprehensive understanding of this concept, its terminological diversity in the literature, and its utility for guiding data management, overall informing current and future information science research and practice.

https://doi.org/10.1002/asi.24967

Framework for Managing University Open Source Software

This document serves as a comprehensive guide for universities looking to develop or refine an open source software framework. It provides the foundational knowledge and tools needed to create an environment that supports open source that is aligned with the unique needs and goals of each institution.

https://doi.org/10.5281/zenodo.14392733

Paywall: "Preserving Digital Humanities Projects Using Principles of Digital Longevity"

This chapter reports on a large, global survey undertaken by the Endings Project and introduces the “Endings compliance” toolbox, guiding librarians and archivists in assisting DH scholars to frame their work for cost-effective preservation. The chapter argues that collaboration among technologists, scholars, librarians, and archivists throughout the project lifecycle is essential to address longevity challenges in DH work, particularly for preserving complex web applications. Clear indicators of project completion are necessary, along with contingency plans for potential disruptions. Libraries and archives can avoid the pitfalls of complex software stacks through such collaboration, and by adhering to known preservation principles.

https://tinyurl.com/rta7scyd

"Early Electronic Journals: A Preservation Survey"

In 1994, the Association of Research Libraries (ARL) published a print directory containing information on every electronic journal that could be identified, anywhere in the world. Thirty years later, this study surveys the current availability and preservation status of those 443 journals. While a significant number of these journals are no longer available, the results indicate that independent preservation efforts by individuals and small groups were a major factor in preserving many of the remaining publications.

https://doi.org/10.1016/j.acalib.2024.102989

"Scaling Up Digital Preservation Workflows With Homegrown Tools and Automation"

At NC State University Libraries, the Special Collections Research Center leverages an integrated system of locally developed applications and open-source technologies to facilitate the long-term preservation of digitized and born-digital archival assets. These applications automate many previously manual tasks, such as creating access derivatives from preservation scans and ingest into preservation storage. They have allowed us to scale up the number of digitized assets we create and publish online; born-digital assets we acquire from storage media, appraise, and package; and total assets in local and distributed preservation storage. The origin of these applications lies in scripted workflows put into use more than a decade ago, and the applications were built in close collaboration with developers in the Digital Library Initiatives department between 2011 and 2023. This paper presents a strategy for managing digital curation and preservation workflows that does not solely depend on standalone and third-party applications. It describes our iterative approach to deploying these tools, the functionalities of each application, and sustainability considerations of managing in-house applications and using Academic Preservation Trust for offsite preservation.

https://tinyurl.com/4mjpzth2