Digital Curation & Digital Preservation – Page 7

"Identifying the Most Important Facilitators of Open Research Data Sharing and Reuse in Epidemiology: A Mixed-Methods Study"

To understand how open research data sharing and reuse can be further improved in the field of Epidemiology, this study explores the facilitating role that infrastructural and institutional arrangements play in this research discipline. It addresses two research questions: 1) What influence do infrastructural and institutional arrangements have on open research data sharing and reuse practices in the field of Epidemiology? And 2) how could infrastructural and institutional instruments used in Epidemiology potentially be useful to other research disciplines? First, based on a systematic literature review, a conceptual framework of infrastructural and institutional instruments for open research data facilitation is developed. Second, the conceptual framework is applied in interviews with Epidemiology researchers. The interviews show that two infrastructural and institutional instruments have a very high influence on open research data sharing and reuse practices in the field of Epidemiology, namely (a) access to a powerful search engine that meets open data search needs and (b) support by data stewards and data managers. Third, infrastructural and institutional instruments with a medium, high, or very high influence were discussed in a research workshop involving data stewards and research data officers from different research fields. This workshop suggests that none of the influential instruments identified in the interviews are specific to Epidemiology. Some of our findings thus seem to apply to multiple other disciplines. This study contributes to Science by identifying field-specific facilitators and challenges for open research data in Epidemiology, while at the same time revealing that none of the identified influential infrastructural and institutional instruments were specific to this field. Practically, this implies that open data infrastructure developers, policymakers, and research funding organizations may apply certain infrastructural and institutional arrangements to multiple research disciplines to facilitate and enhance open research data sharing and reuse.

https://doi.org/10.1371/journal.pone.0297969

"From Data Creator to Data Reuser: Distance Matters"

Sharing research data is complex, labor-intensive, expensive, and requires infrastructure investments by multiple stakeholders. Open science policies focus on data release rather than on data reuse, yet reuse is also difficult, expensive, and may never occur. Investments in data management could be made more wisely by considering who might reuse data, how, why, for what purposes, and when. Data creators cannot anticipate all possible reuses or reusers; our goal is to identify factors that may aid stakeholders in deciding how to invest in research data, how to identify potential reuses and reusers, and how to improve data exchange processes. Drawing upon empirical studies of data sharing and reuse, we develop the theoretical construct of distance between data creator and data reuser, identifying six distance dimensions that influence the ability to transfer knowledge effectively: domain, methods, collaboration, curation, purposes, and time and temporality. These dimensions are primarily social in character, with associated technical aspects that can decrease — or increase — distances between creators and reusers. We identify the order of expected influence on data reuse and ways in which the six dimensions are interdependent. Our theoretical framing of the distance between data creators and prospective reusers leads to recommendations to four categories of stakeholders on how to make data sharing and reuse more effective: data creators, data reusers, data archivists, and funding agencies.

https://arxiv.org/abs/2402.07926

"NIST Releases Version 2.0 of Research Data Framework (RDaF)"

The NIST RDaF is a resource for the entire research data community, including both organizations and individuals engaged in research data management in any discipline, community members from industry, universities, government departments and agencies, funders, and scholarly publishers. . . .

[It is:]

a map of the research data management ecosystem, organized by a canonical data lifecycle, into topics and subtopics;

a dynamic guide for research data stakeholders to understand best practices in research data management and dissemination;

a resource for understanding costs, benefits, and risks associated with research data management;

and a consensus document based on inputs and conversations amongst stakeholders in research data.

Providing perhaps the most comprehensive view of the research data ecosystem to date, the NIST RDaF comprises: 6 lifecycle stages, 50 topics, and 335 subtopics (programmatic and operational activities, concepts, and other important factors relevant to research data management with definitions), 14 overarching themes, 8 “generic” profiles (samples for common job functions or roles), and over 1,000 informational references (standards, guidelines, and policies, that assist stakeholders in addressing that subtopic).

http://tinyurl.com/fbep35xu

"WARC-GPT: An Open-Source Tool for Exploring Web Archives Using AI"

Using WARC-GPT, you can ask specific questions in natural language against a collection of WARC files. Rather than relying on keyword searches and metadata filters to sort through search results, WARC-GPT provides a new starting point for search using multi-document full-text search with summarization to explore the contents of web archives. WARC-GPT lists the sources used to generate the response and relevant text excerpts, which you can use to verify the information provided and identify points of interest within a collection of web archives.

http://tinyurl.com/3vvpsyj9

"Realities of Academic Data Sharing (RADS) Initiative Releases Reports on Expenses of Making Data Publicly Accessible, Project Methodology"

This report presents data on the average yearly cost of DMS activities for institutional units, as well as direct DMS expenses incurred by researchers per funded research project. These expenses were then analyzed together, showing an average combined overall cost of $2,500,000 (with total institutional expenses ranging from approximately $800,000 to over $6,000,000).

http://tinyurl.com/5xsw32we

Paywall: "Open Educational Resources on Preservation: An Overview"

This article aims to provide an overview of the available open educational resources on preservation through an investigation of open educational resource platforms, finding open educational resources on preservation, and analysing them according to the theoretical background on preservation. This provides an understanding of what kinds of open educational resources exist in the field of preservation and also informs the way a new open educational resource should be created.

https://doi.org/10.1177/03400352231219660

"Google Will No Longer Back Up the Internet: Cached Webpages Are Dead"

Google will no longer be keeping a backup of the entire Internet. Google Search’s "cached" links have long been an alternative way to load a website that was down or had changed, but now the company is killing them off. Google "Search Liaison" Danny Sullivan confirmed the feature removal in an X post, saying the feature "was meant for helping people access pages when way back, you often couldn’t depend on a page loading. These days, things have greatly improved."

http://tinyurl.com/uznbyacn

"HERITRACE: Tracing Evolution and Bridging Data for Streamlined Curatorial Work in the GLAM Domain"

HERITRACE is a semantic data management system tailored for the GLAM sector. It is engineered to streamline data curation for non-technical users while also offering an efficient administrative interface for technical staff. The paper compares HERITRACE with other established platforms such as OmekaS, Semantic MediaWiki, Research Space, and CLEF, emphasizing its advantages in user friendliness, provenance management, change tracking, customization capabilities, and data integration. The system leverages SHACL for data modeling and employs the OpenCitations Data Model (OCDM) for provenance and change tracking, ensuring a harmonious blend of advanced technical features and user accessibility. Future developments include the integration of a robust authentication system and the expansion of data compatibility via the RDF Mapping Language (RML), enhancing HERITRACE’s utility in digital heritage management.

https://arxiv.org/abs/2402.00477

"Data Management in Distributed, Federated Research Infrastructures: The Case of EPOS"

Data management is a key activity when Open Data stewardship through services complying with the FAIR principles is required, as it happens in many National and European initiatives. Existing guidelines and tools facilitate the drafting of Data Management Plans by focusing on a set of common parameters or questions. In this paper we describe how data management is carried out in EPOS, the European Research Infrastructure for providing access to integrated data and services in the solid Earth domain. EPOS relies on a federated model and is committed to remain operational in the long term. In EPOS, five key dimensions were identified for the Federated Data Management, namely the management of: thematic data; e-infrastructure for data integration; community of data providers committed to data provision processes; sustainability; and policies. On the basis of the EPOS experience, which is to some extent applicable to other research infrastructures, we propose additional components that may extend the EU Horizon 2020 Data Management Guidelines template, thus comprehensively addressing the Federated Data Management in the context of distributed Research Infrastructures.

https://doi.org/10.5334/dsj-2024-005

"Do Disappearing Data Repositories Pose a Threat to Open Science and the Scholarly Record? "

Only little more than half of the research data repositories in the sample have detailed strategies they use to mitigate data loss. It is important to note that none of the strategies analysed offers a permanent solution; instead, infrastructure maintenance requires continuous efforts. The burden of infrastructure maintenance and data preservation is currently placed on individual repositories alone; preservation systems comparable to those for scholarly texts, such as CLOCKSS, are not widely spread and can be difficult to realise.

http://tinyurl.com/3snrhxpk

"DataCite Launches First Release of the Data Citation Corpus"

DataCite, in partnership with the Chan Zuckerberg Initiative (CZI), is delighted to announce the first release of the Data Citation Corpus. A major milestone in the Make Data Count initiative, the release makes eight million data citations openly available and usable for the first time via an interactive dashboard and public data file.

https://makedatacount.org/first-release-of-the-open-global-data-citation-corpus/

"Agile Research Data Management with Open Source: LinkAhead"

Research data management (RDM) in academic scientific environments increasingly enters the focus as an important part of good scientific practice and as a topic with big potentials for saving time and money. Nevertheless, there is a shortage of appropriate tools, which fulfill the specific requirements in scientific research. We identified where the requirements in science deviate from other fields and proposed a list of requirements which RDM software should answer to become a viable option. We analyzed a number of currently available technologies and tool categories for matching these requirements and identified areas where no tools can satisfy researchers’ needs. Finally we assessed the open-source RDMS (research data management system) LinkAhead for compatibility with the proposed features and found that it fulfills the requirements in the area of semantic, flexible data handling in which other tools show weaknesses.

https://doi.org/10.48694/inggrid.3866

"RDA Professionalising Data Stewardship — What Does a Career Track for Data Stewards Look Like?"

The report "What does a career track for data stewards look like?" provides an initial discussion of the results of the RDA IG Professionalising Data Stewardship career tracks survey completed in 2022. The survey asked respondents who self-identified as performing data stewardship roles about their job titles, educational background, match between educational background and area of professional activity, contract types, as well as future career perspectives. The report is of value to international and national projects and initiatives seeking to define and develop the professional role of data stewards as well as to organizations employing data stewards that seek to define career progression pathways for data stewards and better job satisfaction and job security of data stewards. The report is also of value to the emerging communities of data stewards because it provides evidence base for understanding which professionals already fulfill data stewardship roles and how these professionals perceive their career paths.

https://zenodo.org/records/10571388

"Towards a Shared Framework: A Classificatory Matrix for Teaching Data Standards"

Standards for research data can be a mystifying topic for both researchers and data professionals. A common source of confusion is that they are multipurpose: standards can (and should) be applied to both primary data and metadata, enabling a wide range of functions from the search features in a repository to the integration of disparate data sources. This paper reviews examples of classificatory approaches used by both librarians and researchers to describe data standards. This literature is synthesized into a classificatory matrix that can be used to map different types of standards. The matrix is constructed around two organizing principles: purpose (finding or using data) and type of information controlled (meaning or syntax). The objective of this classificatory exercise is to encourage further discussion about the misunderstandings between researchers and data support professionals and to spur further development of the educational resources needed to improve understanding and use of data standards.

https://doi.org/10.7191/jeslib.758

"Digital Scholarly Journals Are Poorly Preserved: A Study of 7 Million Articles"

This work reveals an alarming preservation deficit. Only 0.96% of Crossref members (n = 204) can be confirmed to digitally preserve over 75% of their content in three or more of the archives that we studied. (Note that when, in this article, we write "preserved," we mean "that we were able to confirm as preserved," as per the specified limitations of this study.) A slightly larger proportion, i.e., 8.5% (n = 1,797), preserved over 50% of their content in two or more archives. However, many members, i.e., 57.7% (n = 12,257), only met the threshold of having 25% of their material in a single archive. Most worryingly, 32.9% (n = 6,982) of Crossref members seem not to have any adequate digital preservation in place, which is against the recommendations of the Digital Preservation Coalition.

https://doi.org/10.31274/jlsc.16288

"On the Readiness of Scientific Data for a Fair and Transparent Use in Machine Learning"

To ensure the fairness and trustworthiness of machine learning (ML) systems, recent legislative initiatives and relevant research in the ML community have pointed out the need to document the data used to train ML models. Besides, data-sharing practices in many scientific domains have evolved in recent years for reproducibility purposes. In this sense, the adoption of these practices by academic institutions has encouraged researchers to publish their data and technical documentation in peer-reviewed publications such as data papers. In this study, we analyze how this scientific data documentation meets the needs of the ML community and regulatory bodies for its use in ML technologies. We examine a sample of 4041 data papers of different domains, assessing their completeness and coverage of the requested dimensions, and trends in recent years, putting special emphasis on the most and least documented dimensions. As a result, we propose a set of recommendation guidelines for data creators and scientific data publishers to increase their data’s preparedness for its transparent and fairer use in ML technologies.

https://arxiv.org/abs/2401.10304

Scaling Up: How Data Curation Can Help Address Key Issues in Qualitative Data Reuse and Big Social Research

This book explores the connections between qualitative data reuse, big social research, and data curation. A review of existing literature identifies the key issues of context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership. Through interviews of qualitative researchers, big social researchers, and data curators, the author further examines each key issue and produces new insights about how domain differences affect each community of practice’s viewpoints, different strategies that researchers and curators use to ensure responsible practice, and different perspectives on data curation.

https://doi.org/10.1007/978-3-031-49222-8

"Towards a Quality Indicator for Research Data Publications and Research Software publications — A Vision from the Helmholtz Association"

Research data and software are widely accepted as an outcome of scientific work. However, in comparison to text-based publications, there is not yet an established process to assess and evaluate quality of research data and research software publications. This paper presents an attempt to fill this gap. Initiated by the Working Group Open Science of the Helmholtz Association the Task Group Helmholtz Quality Indicators for Data and Software Publications currently develops a quality indicator for research data and research software publications to be used within the Association. This report summarizes the vision of the group of what all contributes to such an indicator. The proposed approach relies on generic well-established concepts for quality criteria, such as the FAIR Principles and the COBIT Maturity Model. It does — on purpose — not limit itself to technical implementation possibilities to avoid using an existing metric for a new purpose. The intention of this paper is to share the current state for further discussion with all stakeholders, particularly with other groups also working on similar metrics but also with entities that use the metrics.

https://arxiv.org/abs/2401.08804

"Data Stewardship: Case Studies from North-American, Dutch, and Finnish Universities"

This work seeks to elaborate the picture of different data stewardship programs running in different institutional arrangements and research environments. Design/methodology/approach – Drawing from autoethnography and case study methods, this study described three distinct data stewardship programs from Purdue University (United States), Delft Technical University (Netherlands) and Aalto University (Finland). In addition, this work investigated the institutional arrangements and national research environments of the programs. The focus was on initiatives led by academic libraries or similar services. Findings – This work demonstrates that data stewardship may be understood differently within different national and institutional contexts. The data stewardship programs differed in terms of roles, organization and funding structures. Moreover, the mesh of policies and legislation, organizational structures, and national infrastructures differed.

https://arxiv.org/abs/2312.04092

"Fair Sharing of Health Data: A Systematic Review of Applicable Solutions"

Health science researchers face additional specific challenges. Firstly, ethical and legal issues are barriers regarding the sharing of IPD. Legislation, like the General Data Protection Regulation (GDPR [16]) in Europe or the Health Insurance Portability and Accountability Act (HIPAA [17]) in the USA, prevents research data from being openly shared. IPD can only be shared publicly after the removal of all information allowing the identification of the individual participants, unless explicit consent has been obtained from the individual participants. Furthermore, the legislation has been growing stronger over the years. State laws have emerged in the USA, like the CCPA in California [18], as well as European legislation such as the Convention 108 [19] or the proposal for a reform of ePrivacy legislation [20].

Secondly, health data are diverse and heterogeneous and can be of very different types and formats, depending on the field they belong to, e.g., imaging, genomics, and mass spectrometry. Handling these data requires specific expertise and tools which can usually only be found in the specialized, dedicated communities.

The objective of this paper is to identify and evaluate technical solutions to implement systematic data sharing in an academic context, in order to help researchers making their data FAIR. We will evaluate various software programs and online platforms used in academic projects to manage and store data through a systematic literature review focusing on the implementation of the FAIR principles and the ability to support sharing of Individual Participant Data (IPD).

https://doi.org/10.1007/s12553-023-00789-5

"Emerging Quality Assurance Practices in the Library of Congress Web Archives"

Building sustainable quality assurance practices is a challenge for today’s preservationists, who want to be sure that content preserved in web archives is not only the correct content, but in working order. This often means that archived web content should be replayed via Wayback rendering software in good fidelity when compared to the original website. The exponentially growing scale of web archives necessitates a multipronged approach to identify what is (and is not) being preserved, and where improvements can be made. This paper will explore actions that can take place iteratively throughout the web archiving life cycle, as part of a larger system of review where multiple individuals can contribute, including non-technical Library staff and subject matter experts. The processes described are part of a novel workflow in the Library of Congress Web Archiving Program.

https://tinyurl.com/2p9b4pve

Partial Paywall: The Nordic Model of Digital Archiving

Bringing together contributions from practitioners and academics to offer a range of international case studies, this book offers practical solutions for archivists in terms of governance, technologies and processes. It highlights and analyses the cornerstones of the Nordic model of archiving: reliance on standards; powerful regulatory instruments — especially in public sector archiving, including legislation; and collaboration between archivists and government agencies, and among different tiers of central and local government.

One of four open access chapters: "The Nordic Model of Digital Archiving."

https://doi.org/10.4324/9781003325406

"Software Preservation after the Internet"

Software preservation must consider knowledge management as a key challenge. We suggest a conceptualization of software preservation approaches that are available at different stages of the software lifecycle and can support memory institutions to assess the current state of software items in their collection, the capabilities of their infrastructure, and completeness and applicability of knowledge that is required to successfully steward the collection.

https://tinyurl.com/8y9svs7x

"E-preservation of Old and Rare Books: A Structured Approach for Creating a Digital Collection "

Antique books, old and rare documents are fragile and vulnerable to different hazards. Preserving them for an extended period is a real challenge. From ancient times people started expressing their knowledge by writing and keeping records and subsequently started collecting and storing these at later ages as antique materials. These can be seen in different museums, libraries, archives, individual households, and other places all over the world. Preserving and conserving these antique, old and rare books, documents etc. in good condition is a challenge for librarians, conservators, preservation administrators or persons associated with storing these. In this paper, details of the digital preservation of such a collection available in the Directorate of Historical and Antiquarian Studies (DHAS), Guwahati, Assam, India, are discussed. DHAS is a Government of Assam wing and is mainly mandated to collect, preserve and research historical and antiquarian resources. The collection of DHAS is one of the oldest collections and has been serving as a study and research centre in Assam since 1928. A special drive has been taken for the digital preservation of an identified part of the collection, with grant support from the National Archive of India. This paper discusses the entire project process starting from the project proposal formulation to the structuring of the digital collection. The paper sequentially discusses the different steps of the entire work of digitization of a collection of 241 old and rare books from the main collection of DHAS.

http://www.ijdc.net/article/view/855

Paywall: "Data Curation Education: Cross-Disciplinary Analysis of Master’s Programs "

The main goal of this study is to analyze the course content from the syllabi of various programs to understand what is being taught in LIS schools throughout graduate-level education. Further, because the need for data curation is apparent across different disciplines, and thus not only LIS but also other disciplines have been offering data curation courses, this study also analyzed syllabi from other disciplines. . . . Our findings suggest a notable growth in LIS education in data curation since 2012, but LIS education still provides less training in technical skills. There was also a distinctive difference in educational approach to teach data curation between LIS (user- and service-oriented) and other disciplines (technical skills-focused), which brought different strengths and weaknesses in curriculum.

https://tinyurl.com/bdfjwjbh