Digital Curation & Digital Preservation

“What’s in a Name? Exploring How Voluntary Library Data Literacy Workshop Titles and Descriptions Affect Learner Motivations to Enroll”

This study examined a large teaching and research-intensive university’s data library that offers several data literacy workshops. Although the data library’s voluntary data literacy workshops can be popular, with some workshops waitlisted, interest ebbs and flows. One way to improve the situation is to better market library workshops through effectively crafting workshop titles and descriptions that encourage engagement. Duke and Tucker (2007) state that it is important to market academic library services to increase service use and meet the needs of its users. Understanding marketing barriers is essential to improving workshop engagement.

https://doi.org/10.1016/j.acalib.2025.103045

“Enhancing FAIR Data Practices in the Norwegian Research Data Archive: Towards Research Objects and Improved Interoperability”

The increasing volume and complexity of research data necessitate robust data management practices to ensure data is Findable, Accessible, Interoperable, and Reusable (FAIR). The Norwegian Research Data Archive (NRDA) is at the forefront of efforts to create a comprehensive platform for researchers to share and archive their data. This paper discusses NRDA’s ongoing initiatives to enhance its infrastructure in alignment with FAIR principles, emphasizing the integration of Research Objects (ROs) and RO-Crate technologies. These improvements aim to facilitate better data discoverability, accessibility, and interoperability, thereby fostering a more integrated and sustainable data ecosystem. The paper also highlights NRDA’s collaborative efforts with other platforms via the use of Research Objects to support data sharing and reuse across repositories. By focusing on standardized metadata, persistent identifiers, and interoperability, NRDA is advancing Open Science practices, ultimately contributing to a more transparent, efficient, and collaborative research environment. The challenges and future directions of these initiatives are also explored, providing insights into the ongoing efforts to create a more open and interconnected scientific landscape.

https://doi.org/10.52825/ocp.v5i.1202

“Portico to Preserve Clarivate’s Ebook Central”

Portico has signed an agreement with Clarivate to preserve books available to academic libraries through Ebook Central. This agreement ensures the long-term preservation of this expansive collection. Portico will also receive new books added to Ebook Central in the future.

https://librarytechnology.org/pr/31206

“The New Zealand Thesis Project: Connecting a Nation’s Dissertations Using Wikidata”

Introduction: Libraries hold large amounts of bibliographic data, with great potential for enrichment with linked open data. The New Zealand Thesis Project explored this potential by uploading thesis metadata records from New Zealand institutional repositories to Wikidata, a collaborative linked data knowledge base.

Description of Project: Nine New Zealand tertiary institutions collaborated with four Wikidata experts to upload a combined national dataset of doctoral and master’s theses. Thesis records, including author and advisor names and richly described with main subject statements, were extracted from each repository, combined, and data cleaned before being uploaded to Wikidata. The team then undertook additional data enrichment, round-tripped Wikidata’s QID identifiers back to individual repositories, and used the new records to cite theses on authors’ Wikipedia pages. Wikidata queries and other visualizations were created to demonstrate how connecting the thesis metadata to records for authors, advisors, institutions, and subjects allows new insights into our collections.

Next Steps: Documentation is being fine-tuned to support future similar projects, and a second combined upload is under discussion to continue growing the New Zealand Thesis Project. There is considerable scope to continue enriching Wikidata records, some of which is already underway by Wikidata volunteers.

https://doi.org/10.31274/jlsc.18295

“The Quest to Share Data”

Data sharing in scientific research is widely acknowledged as crucial for accelerating progress and innovation. Mandates from funders, such as the NIH’s updated Data Sharing Policy, have been beneficial in promoting data sharing. However, the effectiveness of such mandates relies heavily on the motivation of data providers. Despite policy-imposed requirements, many researchers may only comply minimally, resulting in data that is inadequately reusable. Here, we discuss the multifaceted challenges of incentivizing data sharing and the complex interplay of factors involved. Our paper delves into the motivations of various stakeholders, including funders, investigators, and data users, highlighting the differences in perspectives and concerns. We discuss the role of guidelines, such as the FAIR principles, in promoting good data management practices but acknowledge the practical and ethical challenges in implementation. We also examine the impact of infrastructure on data sharing effectiveness, emphasizing the need for systems that support efficient data discovery, access, and analysis. We address disparities in resources and expertise among researchers and concerns related to data misuse and misinterpretation. Here, we advocate for a holistic approach to incentivizing data sharing beyond mere compliance with mandates. It calls for the development of reward systems, financial incentives, and supportive infrastructure to encourage researchers to share data enthusiastically and effectively. By addressing these challenges collaboratively, the scientific community can realize the full potential of data sharing to advance knowledge and innovation.

https://doi.org/10.3389/fninf.2025.1570568

“What Are Journals and Reviewers Concerned about in Data Papers? Evidence From Journal Guidelines and Review Reports”

The evolution of data journals and the increase in data papers call for associated peer review, which is intricately linked yet distinct from traditional scientific paper review. This study investigates the data paper review guidelines of 22 scholarly journals that publish data papers and analyses 131 data papers’ review reports from the journal Data. Peer review is an essential part of scholarly publishing. Although the 22 data journals employ disparate review models, their review purposes and requirements exhibit similarities. Journal guidelines provide authors and reviewers with comprehensive references for reviewing, which cover the entire life cycle of data. Reviewer attitudes predominantly encompass Suggestion, Inquiry, Criticism and Compliment during the specific review process, focusing on 18 key targets including manuscript writing, diagram presentation, data process and analysis, references and review and so forth. In addition, objective statements and other general opinions are also identified. The findings show the distinctive characteristics of data publication assessment and summarise the main concerns of journals and reviewers regarding the evaluation of data papers.

https://doi.org/10.1002/leap.2001

“Are Data Papers Cited as Research Data? Preliminary Analysis on Interdisciplinary Data Paper Citations”

Introduction. Research data sharing and reuse have become increasingly important in modern science, and data papers represent a new academic publication genre aimed at enhancing the visibility, sharing, and reuse of research data. However, whether citations to data papers reflect actual data reuse remains largely unexplored. This paper presents preliminary findings from a project designed to address this gap.

Method. we conducted a content analysis to manually annotate 437 citation sentences from 309 research articles referencing 50 data papers published in Data in Brief, a chief academic journal that only publishes data papers. The data papers were sampled from five knowledge domains based on a paper-level classification system.

Results. Our results show that most citations to all selected data papers (89%) are unrelated to the research data being described in the paper, instead focusing on the research findings or methodologies. This suggests that data papers are being cited similarly to traditional research articles, despite their unique purpose and content.

Conclusion. These findings raise questions about the effectiveness of data papers as representations of research data within the scholarly communication system, as well as their utility in quantitative studies on data reuse.

https://tinyurl.com/3f5u33fs

“Can LLMs Categorize the Specialized Documents from Web Archives in a Better Way?”

The explosive growth of web archives presents a significant challenge: manually curating specialized document collections from this vast data. Existing approaches rely on supervised techniques, but recent advancements in Large Language Models (LLMs) offer new possibilities for automating collection creation. Large Language Models (LLMs) are demonstrating impressive performance on various tasks even without fine-tuning. This paper investigates the effectiveness of prompt design in achieving results comparable to fine-tuned models. We explore different prompting techniques for collecting specialized documents from web archives like UNT.edu, Michigan.gov, and Texas.gov. We then analyze the performance of LLMs under various prompt configurations. Our findings highlight the significant impact of incorporating task descriptions within prompts. Additionally, including the document type as justification for the search scope leads to demonstrably better results. This research suggests that well-crafted prompts can unlock the potential of LLMs for specialized tasks, potentially reducing reliance on resource-intensive fine-tuning. This research paves the way for automating specialized collection creation using LLMs and prompt engineering.

https://dl.acm.org/doi/10.1145/3677389.3702591

“CODE beyond FAIR”

FAIR principles are a set of guidelines aiming at simplifying the distribution of scientific data to enhance reuse and reproducibility. This article focuses on research software, which significantly differs from data through its living nature, and its relationship with free and open-source software. Based on the second French plan for Open Science, we provide a tiered roadmap to improve the state of research software, which is inclusive to all stakeholders in the research software ecosystem: scientific staff, but also institutions, funders, libraries and publishers.

https://inria.hal.science/hal-04930405

Paywall: “Challenges in Tracking Archive’s Data Reuse in Social Sciences”

Identifying data reuse is challenging, due to technical reasons, and, in particular, incorrect citation practices among scholars. This paper aims to propose an automatic method to track the reuse of data deposited in the archives joined to the CESSDA (Consortium of European Social Science Data Archives) infrastructure. The paper also offers an overview on the identified data to understand the characteristics of the most reused data sets.

https://doi.org/10.1108/DLP-07-2024-0112

“To be FAIR: Theory Specification Needs an Update”

Innovations in open science and meta-science have focused on rigorous *theory testing*, yet methods for specifying, sharing, and iteratively improving theories remain underdeveloped. To address these limitations, we introduce *FAIR theory*: A standard for specifying theories as Findable, Accessible, Interoperable, and Reusable information artifacts. FAIR theories are Findable in well-established archives, Accessible in practical terms and in terms of their ability to be understood, Interoperable for specific purposes, e.g., to guide control variable selection, and Reusable so that they can be iteratively improved through collaborative efforts. This paper adapts the FAIR principles for theory, reflects on the FAIRness of contemporary theoretical practices in psychology, introduces a workflow for FAIRifying theory, and explores FAIR theories’ potential impact in terms of reducing research waste, enabling meta-research on the structure and development of theories, and incorporating theory into reproducible research workflows – from hypothesis generation to simulation studies. We make use of well-established open science infrastructure, including Git for version control, GitHub for collaboration, and Zenodo for archival and search indexing. By applying the principles and infrastructure that have already revolutionized sharing of data and publications to theory, we establish a sustainable, transparent, and collaborative approach to theory development. FAIR theory equips scholars with a standard for systematically specifying and refining theories, bridging a critical gap in open research practices and supporting the renewed interest in theory development in psychology and beyond. FAIR theory provides a structured, cumulative framework for theory development, increasing efficiency and potentially accelerating the pace of cumulative knowledge acquisition.

https://doi.org/10.31234/osf.io/t53np_v1

“Implementing and Learning from a Summer Research Data Management Training Program for Student Researchers”

Background

This study explores a library-led research data management (RDM) training program at a Canadian post-secondary institution that targeted students participating in summer research assistantships as well as their faculty supervisors. This paper describes the program in detail and shares findings from a student reflection assignment about practicing RDM for the first time.

Methods

The RDM training program included four requirements: attending an introductory RDM session; attending a data management plan (DMP) workshop; submitting a DMP for feedback; and completing a reflection assignment. Where consent was obtained (n=19), reflection assignments were analyzed using a qualitative content analysis approach.

Results

35 faculty supervisors registered 53 students to participate. 62.2% (n=33) of students completed all components of the program. Perceived benefits of completing a DMP included improved project planning, supporting best practices, potential for data reuse, and team communication. Perceived challenges included the inflexibility of DMPs, difficulty populating DMPs, demands on researchers’ time, and lack of long-term utility. 73.6% of students (n=14/19) reported that building a DMP helped them with their summer projects.

Conclusion

Through instruction, practical engagement, and reflection within the context of real-world research, the program supported participants in learning about and practicing RDM, and provided insights for academic librarians who wish to refine or develop training in their local contexts as they continue to navigate emerging expectations from funders and publishers.

https://doi.org/10.21083/partnership.v19i2.7753

“Frontiers introduces FAIR² Data Management”

FAIR² Data Management leverages AI-assisted curation to structure research data for publication, making it easier to find, reuse, and analyze—both by humans and machines—so researchers can focus on discovery rather than data preparation. By making datasets shareable and optimized for reuse, FAIR² Data Management enhances research efficiency and reproducibility, accelerating breakthroughs in global health, planetary sustainability, and scientific innovation. . . .

FAIR² (FAIR Squared) extends the FAIR principles by defining a formal specification that makes research data AI-ready, aligned with Responsible AI principles, and structured for deep scientific reuse. Compatible with MLCommons Croissant’s AI-ready format, it integrates essential elements for scientific rigor, reproducibility, and interoperability. FAIR² ensures data is richly documented and linked to provenance, methodology, and a detailed data dictionary, creating a context-rich representation of each dataset. It also integrates with TensorFlow, JAX, and PyTorch, enabling AI-driven analysis and easy sharing on Kaggle and Hugging Face, amplifying its impact across disciplines.

https://tinyurl.com/3bwjbsw6

“Developing Practices for FAIR and Linked Data in Heritage Science”

Heritage Science has a lot to gain from the Open Science movement but faces major challenges due to the interdisciplinary nature of the field, as a vast array of technological and scientific methods can be applied to any imaginable material. Historical and cultural contexts are as significant as the methods and material properties, which is something the scientific templates for research data management rarely take into account. While the FAIR data principles are a good foundation, they do not offer enough practical help to researchers facing increasing demands from funders and collaborators. In order to identify the issues and needs that arise “on the ground floor”, the staff at the Heritage Laboratory at the Swedish National Heritage Board took part in a series of workshops with case studies. The results were used to develop guides for good data practices and a list of recommended online vocabularies for standardised descriptions, necessary for findable and interoperable data. However, the project also identified areas where there is a lack of useful vocabularies and the consequences this could have for discoverability of heritage studies on materials from areas of the world that have historically been marginalised by Western culture. If Heritage Science as a global field of study is to reach its full potential this must be addressed.

https://doi.org/10.1038/s40494-025-01598-x

“The Economic Impact of Open Science: A Scoping Review”

This paper summarised a comprehensive scoping review of the economic impact of Open Science (OS), examining empirical evidence from 2000 to 2023. It focuses on Open Access (OA), Open/FAIR Data (OFD), Open Source Software (OSS), and Open Methods, assessing their contributions to efficiency gains in research production, innovation enhancement, and economic growth. Evidence, although limited, indicates that OS accelerates research processes, reduces the related costs, fosters innovation by improving access to data and resources and this ultimately generates economic growth. Specific sectors, such as life sciences, are researched more and the literature exhibits substantial gains, mainly thanks to OFD and OA. OSS supports productivity, while the very limited studies on Open Methods indicate benefits in terms of productivity gains and innovation enhancement. However, gaps persist in the literature, particularly in fields like Citizen Science and Open Evaluation, for which no empirical findings on economic impact could be detected. Despite limitations, empirical evidence on specific cases highlight economic benefits. This review underscores the need for further metrics and studies across diverse sectors and regions to fully capture OS’s economic potential.

https://doi.org/10.31222/osf.io/kqse5_v1

“The Economic Impact of Open Science: A Scoping Review”

This paper summarised a comprehensive scoping review of the economic impact of Open Science (OS), examining empirical evidence from 2000 to 2023. It focuses on Open Access (OA), Open/FAIR Data (OFD), Open Source Software (OSS), and Open Methods, assessing their contributions to efficiency gains in research production, innovation enhancement, and economic growth. Evidence, although limited, indicates that OS accelerates research processes, reduces the related costs, fosters innovation by improving access to data and resources and this ultimately generates economic growth. Specific sectors, such as life sciences, are researched more and the literature exhibits substantial gains, mainly thanks to OFD and OA. OSS supports productivity, while the very limited studies on Open Methods indicate benefits in terms of productivity gains and innovation enhancement. However, gaps persist in the literature, particularly in fields like Citizen Science and Open Evaluation, for which no empirical findings on economic impact could be detected. Despite limitations, empirical evidence on specific cases highlight economic benefits. This review underscores the need for further metrics and studies across diverse sectors and regions to fully capture OS’s economic potential.

https://osf.io/preprints/metaarxiv/kqse5_v1

“Building Trustworthy AI Solutions: Integrating Artificial Intelligence Literacy into Records Management and Archival Systems”

This paper explores the essential role of Artificial Intelligence (AI) competencies and literacy in the fields of records management and archival practices, within the framework of the InterPARES Trust AI project. . . . The study employs two complementary approaches: (1) a detailed competency framework developed through literature reviews, interviews with archival professionals who have applied AI to the processing of records, and validation workshops with practitioners; and (2) a comprehensive AI literacy framework derived from multiple case studies and theoretical discussions. . . . Findings indicate that archival professionals can leverage AI in their work practices by acquiring basic AI literacy, practical AI skills, data-related skills, tool-testing and evaluation, adaptation of AI to their workflows, and by actively engaging in collaborative projects with information technology (IT) developers.

https://doi.org/10.48550/arXiv.2307.14852

“Datafication and Cultural Heritage Collections Data Infrastructures: Critical Perspectives on Documentation, Cataloguing and Data-sharing in Cultural Heritage Institutions”

The role of cultural heritage collections within the research ecosystem is rapidly changing. From often-passive primary source or reference point for humanities research, cultural heritage collections are now becoming integral part of large-scale interdisciplinary inquiries using computational-driven methods and tools. This new status for cultural heritage collections, in the ‘collections-as-data’ era, would not be possible without foundational work that was and is still going on ‘behind the scenes’ in cultural heritage institutions through cataloguing, documentation and curation of cultural heritage records. This article assesses the landscape for cultural heritage collections data infrastructure in the UK through an empirical and critical perspective, presenting insights on the infrastructure that cultural heritage organisations use to record and manage their collections, exploring the range of systems being used, the levels of complexity or ease at which collections data can be accessed, and the shape of interactions between software suppliers, cultural heritage organisations, and third-party partners. The paper goes on to include a critical analysis of the findings based on the sector’s approach to ‘3s’, that is standards, skill sets and scale, and how that applies to different cultural heritage organisations throughout the data lifecycle, from data creation, stewardship to sharing and re-using.

https://doi.org/10.5334/johd.277

“Building as They Come: Comparative Case Studies of Co-constructing Data Visualization Services with Academic Communities”

Academic libraries are well-situated to be strong supporters of democratizing and building knowledge and expertise in the use of data and data visualization as they cut across all of academia, regardless of discipline or department. Within the past decade, many academic libraries across North America have added data visualization services to their offerings. This has been done in several ways, from existing librarians with related portfolios like GIS or research data learning new skills to libraries creating new positions with the focus on the portfolio on data visualization. This chapter presents and compares two case studies of building data visualization services at York University Libraries and McMaster University Library.

https://hdl.handle.net/10315/42647

“Data and Code Availability in Political Science Publications from 1995 to 2022”

In this paper, we assess the availability of reproduction archives in political science. By “reproduction archive,” we mean the data and code supporting quantitative research articles that allows others to reproduce the computations described in the published paper. We collect a random sample of quantitative research articles published in political science from 1995 to 2022. We find that—even in 2022—most quantitative research articles do not point a reproduction archive. However, practices are improving. In 2014, when the DA-RT symposium was published in PS, about 12% of quantitative research articles point to the data and code. Eight years later, in 2022, that has increased to 31%. This underscores a massive shift in norms, requirements, and infrastructure. Still, only a minority of articles share the supporting data and code.

https://doi.org/10.31235/osf.io/a5yxe_v2

“Peer Review of Data Papers: Does It Achieve Expectations for Facilitating Data Sharing and Reuse?”

This paper presents a qualitative study of open peer review reports of data papers in a data journal Earth System Science Data. We examine to what extent the actual review practices of data papers align with identifying the most valuable datasets and promoting data reuse. We conclude that peer reviewers adopted a variety of criteria to evaluate data papers, but it is still challenging for reviewers to identify the most valuable datasets that should be reused. In addition, our findings demonstrate the correlation between data paper evaluations and subsequent reuse of the underlying datasets.

https://dx.doi.org/10.2139/ssrn.5130257

CODATA: “Official Publication of DDI Cross-Domain Integration (DDI-CDI) Version 1.0”

DDI-CDI extends traditional DDI metadata to describe data beyond the social, behavioral, and economic (SBE) domains, addressing the need for broader capabilities. It supports descriptions of event and sensor data (“long” data), key-value data (often associated with “big” data and no-SQL data), and multi-dimensional data. By integrating these with traditional “wide” (or “rectangular”) DDI data descriptions, DDI-CDI enables the management and production of integrated data sets from diverse sources.

Further descrption from “DDI-CDI (DDI Cross-Domain Integration)”:

DDI-CDI is a new standard which is designed to be used with research data from any domain. While it minimally describes metadata for cataloguing and citation, its fundamental purpose is to describe data and process. The specification is domain-neutral and covers the majority of data structures in common use today: Wide, Long, Multi-Dimensional and Key-Value. It offers, for the first time, a mechanism to interoperate disparate data from multiple disciplines and domains at the lowest level of granularity i.e. the datum itself. While it is designed to complement its siblings in the DDI Alliance Product Suite – DDI-Codebook and DDI-Lifecycle, which operate in the Social, Behavioral and Economic domain – it is also intended to work with a wide variety of other domain-specific and generic metadata specifications. Integration is a first-order consideration in DDI-CDI and so it is designed from the ground up to work well with controlled vocabularies from any domain as well as with other standards.

https://tinyurl.com/yvph3r68

“Data Stewardship Decoded: Mapping Its Diverse Manifestations and Emerging Relevance at a Time of AI”

Data stewardship has become a critical component of modern data governance, especially with the growing use of artificial intelligence (AI). Despite its increasing importance, the concept of data stewardship remains ambiguous and varies in its application. This paper explores four distinct manifestations of data stewardship to clarify its emerging position in the data governance landscape. These manifestations include a) data stewardship as a set of competencies and skills, b) a function or role within organizations, c) an intermediary organization facilitating collaborations, and d) a set of guiding principles. The paper subsequently outlines the core competencies required for effective data stewardship, explains the distinction between data stewards and Chief Data Officers (CDOs), and details the intermediary role of stewards in bridging gaps between data holders and external stakeholders. It also explores key principles aligned with the FAIR framework (Findable, Accessible, Interoperable, Reusable) and introduces the emerging principle of AI readiness to ensure data meets the ethical and technical requirements of AI systems. The paper emphasizes the importance of data stewardship in enhancing data collaboration, fostering public value, and managing data reuse responsibly, particularly in the era of AI. It concludes by identifying challenges and opportunities for advancing data stewardship, including the need for standardized definitions, capacity building efforts, and the creation of a professional association for data stewardship.

https://arxiv.org/abs/2502.10399

Data Curation Network: “New Format Data Curation Primers in 2024”

We’re excited to share three new data curation primers released by the Data Curation Network, focusing on critical formats and approaches in scientific and cultural data management: FITS (Flexible Image Transport System), TIFF (Tagged Image File Format), and Linked Data.

(Links added in the above.)

https://tinyurl.com/3kht2syn

U.S. Research Data Summit: Strengthening Cooperation Across Organizations and Sectors: Proceedings of a Workshop

On October 10-11, 2023, the National Academies of Sciences, Engineering, and Medicine hosted the U.S. Research Data Summit at the National Academy of Sciences Building in Washington, DC. The summit was undertaken by a planning committee organized under the U.S. National Committee for CODATA. The summit was informed by input from 29 organizations, including leaders from federal government agencies, the private sector, public and nonprofit organizations, and research institutions. This publication summarizes the presentations and discussion of the summit.

https://tinyurl.com/yjbuhkwz