Data Curation, Open Data, and Research Data Management – Page 3

Paywall: "The New Information Retrieval Problem: Data Availability"

In this paper, we discuss a method for exploring and locating datasets made available by scientists from federally funded projects in the US. The data pathways method was tested on federal awards. Here we describe the method and the results from analyzing fifty federal awards granted by the National Science Foundation to pursue data resources and their availability in publications, data repositories, or institutional repositories. The data pathways approach contributes to the development of a practical approach on availability that captures the current ways in which data are accessible from federally funded science projects –ranging from institutional repositories, journal data deposit, PI and project web pages, and science data platforms, among other found possibilities

https://doi.org/10.1002/pra2.796

Paywall: "’Garbage Bags Full of Files: Exploring Sociotechnical Perceptions of Formats within the Recovery and Reuse of Scientific Data"

This paper explores the social and technical perceptions of physical and digital formats as they relate to work in the recovery and reuse of scientific data, specifically historical, archival, and defunct data sources. . . . Based on 23 qualitative interviews with practitioners conducting data recovery and reuse, ranging from marine biologists to data librarians, we study how they understand, engage with, and utilize formats within their data curation work. . . . The paper focuses on practitioner perceptions of formats around the following themes: how practitioners’ historical relationships to certain challenging formats inform their ongoing curation practices; the importance of contexts in prioritizing or ignoring formats within scientific curation work; and how formats reveal larger sociotechnical issues.

https://doi.org/10.1002/pra2.798

"Conceptualizing Slow Curation"

The pressure to do things quickly is a constant in our professional lives as data curators. But what if we slowed down our work? Taking inspiration from the Slow Movement and its various sub-genres, we propose the idea of Slow Curation, specifically for the application of curating research data. Data curation is the process of reviewing datasets to prepare them for sharing, use, and reuse. We have identified areas where Slow can be embraced in the Data Curation Network’s CURATE(D) model. We also advocate for a few ways, outside of curation, where data curators can collaboratively advocate for Slowing down our work for the better. Join us as we practice radical self-care and advocate for our communities by embracing Slow and easing our way out of Busy culture.

https://doi.org/10.7191/jeslib.740

"Implementation of a Federated Information System by Means of Reuse of Research Data Archived in Research Data Repositories"

At universities, research data is increasingly stored in research data repositories according to a data management plan (DMP) and thus made available for further use. The challenge of reusing hundreds, thousands, or millions of data sets is to obtain an overview of the data in a short period of time and to search through all the data. The high variability of the formats used to store research data requires a new approach to data reusability that focuses on the visualisation and searchability of archived research data, which can also be combined with each other. In this article, we present a practical DMP that describes how information systems can be created on demand by reusing research data archived in research data repositories and how these systems can be merged into a federated information system. As a result, in our projects, information systems have been created in minutes or a couple of hours with few resources. The initial effort to create a federated system remains; however, this allows federated searches to be performed. Extending a federated system to include other information systems can then be accomplished by making a few configurations and manageable adjustments to the source code.

https://doi.org/10.5334/dsj-2023-039

"Connecting Fragmented Support on Campus: Growing Research Data Services Programs Through Collaboration"

Research data services are provided by multiple units across and beyond the library, which is why communication and collaboration are paramount to building support for researchers. By exploring how Research Data Services (RDS) programs can function in the fragmented landscape of research support on campuses, we outline the role of collaboration in building programs. In this paper, we discuss building an RDS program by emphasizing three strategies for collaboration: collaborating within the library, collaborating across campus, and collaborating externally with those without direct ties to your organization. The aim of this paper is to offer attainable examples and strategies for building collaborations across campuses for libraries that have small or nascent RDS programs—how to approach and cultivate partnerships, how to set realistic goals, and how to work holistically within the fragmented academy.

https://tinyurl.com/9hbz49df

Paywall: "DMPFrame: A Conceptual Metadata Framework for Data Management Plans"

We have examined 12 open-source DMP tools, in particular, to evaluate the metadata adopted by these tools. The current study spots and highlights the gaps in the DMP metadata management in DMP tools and suggests DMPFrame as a conceptual framework addressing such gaps to improve the existing tools’ DMP metadata management practices. Based on the examined DMP tool’s metadata elements analysis and mapping, DMPFrame manages DMP metadata under 6 categories, namely, contributors, project, funding, organization, DMP, and output. The current study also suggests a systematic workflow that DMP tools could incorporate for metadata creation for DMPs.

https://doi.org/10.1080/19386389.2023.2268474

"Disappearing Repositories — Taking an Infrastructure Perspective on the Long-Term Availability of Research Data"

Currently, there is limited research investigating the phenomenon of research data repositories being shut down, and the impact this has on the long-term availability of data. This paper takes an infrastructure perspective on the preservation of research data by using a registry to identify 191 research data repositories that have been closed and presenting information on the shutdown process. The results show that 6.2 % of research data repositories indexed in the registry were shut down. The risks resulting in repository shutdown are varied. The median age of a repository when shutting down is 12 years. Strategies to prevent data loss at the infrastructure level are pursued to varying extent. 44 % of the repositories in the sample migrated data to another repository, and 12 % maintain limited access to their data collection. However, both strategies are not permanent solutions. Finally, the general lack of information on repository shutdown events as well as the effect on the findability of data and the permanence of the scholarly record are discussed.

https://arxiv.org/abs/2310.06712

"IFLA ARL Section’s ‘Inclusiveness through Openness’ Conference Proceedings Now Available!"

All videos and slides from this August IFLA Academic & Research Libraries Section (ARL) Satellite conference to the 2023 WLIC in Rotterdam IFLA conference are now available:

https://tinyurl.com/4cywvp9h

"The Rise of Open Science: Tracking the Evolution and Perceived Value of Data and Methods Link-Sharing Practices"

In recent years, funding agencies and journals increasingly advocate for open science practices (e.g. data and method sharing) to improve the transparency, access, and reproducibility of science. However, quantifying these practices at scale has proven difficult. In this work, we leverage a large-scale dataset of 1.1M papers from arXiv that are representative of the fields of physics, math, and computer science to analyze the adoption of data and method link-sharing practices over time and their impact on article reception. To identify links to data and methods, we train a neural text classification model to automatically classify URL types based on contextual mentions in papers. We find evidence that the practice of link-sharing to methods and data is spreading as more papers include such URLs over time. Reproducibility efforts may also be spreading because the same links are being increasingly reused across papers (especially in computer science); and these links are increasingly concentrated within fewer web domains (e.g. Github) over time. Lastly, articles that share data and method links receive increased recognition in terms of citation count, with a stronger effect when the shared links are active (rather than defunct). Together, these findings demonstrate the increased spread and perceived value of data and method sharing practices in open science.

https://arxiv.org/abs/2310.03193

"Introducing Open Data Editor (beta): Towards a No-Code Data App for Everyone "

Intuitive Data Editing: Open Data Editor (beta) provides a user-friendly, spreadsheet-like interface that allows you to view, edit, and validate your data effortlessly.

Data Transformation: Easily transform your data from one format to another with a wide range of supported data formats, including CSV, Excel, JSON, and more.

Data Validation: Ensure data quality and consistency with built-in validation checks that generate a visual validation report, making it super easy for you to clean your data.

Schema Management: Define and manage data schemas to ensure data consistency and compliance with standards.

Data Publishing: Seamlessly publish your data to the web or data portals. It is easy to publish the processed data to CKAN, Github and Zenodo with a single button click, making it accessible to a wider audience and increasing its impact.

Generative AI: Optionally add a generative AI provider to unlock many features based on chat-based language models. The feature is currently limited to OpenAI, but more providers will be added soon.

https://tinyurl.com/2xwcp87x

Kristin Briney: The Research Data Management Workbook

The Research Data Management Workbook is made up of a collection of exercises for researchers to improve their data management. The Workbook contains exercises across the data lifecycle, though the range of activities is not comprehensive. Instead, exercises focus on discrete practices within data management that are structured and can be reproduced by any researcher.

The book is divided into chapters, loosely by phases of the data lifecycle, with one or more exercises in each chapter. Every exercise comes with a description of its value within data management, instructions on how to do the exercise, original source of the exercise (when applicable), and the exercise itself.

https://tinyurl.com/2p8sk5xd

"Data Curation in Interdisciplinary and Highly Collaborative Research"

This paper provides a systematic analysis of publications that discuss data curation in interdisciplinary and highly collaborative research (IHCR). Using content analysis methodology, it examined 159 publications and identified patterns in definitions of interdisciplinarity, projects’ participants and methodologies, and approaches to data curation. The findings suggest that data is a prominent component in interdisciplinarity. In addition to crossing disciplinary and other boundaries, IHCR is defined as curating and integrating heterogeneous data and creating new forms of knowledge from it. Using personal experiences and descriptive approaches, the publications discussed challenges that data curation in IHCR faces, including an increased overhead in coordination and management, lack of consistent metadata practices, and custom infrastructure that makes interoperability across projects, domains, and repositories difficult. The paper concludes with suggestions for future research.

https://doi.org/10.2218/ijdc.v17i1.835

Scholarly Communication Librarianship and Open Knowledge

The book consists of three parts. Part I offers definitions of scholarly communication and scholarly communication librarianship and provides an introduction to the social, economic, technological, and policy/legal pressures that underpin and shape scholarly communication work in libraries. These pressures, which have framed ACRL’s understanding of scholarly communication for the better part of the past two decades, have unsettled many foundational assumptions and practices in the field, removing core pillars of scholarly communication as it was practiced in the twentieth century. These pressures have also cleared fresh ground, and scholarly communication practitioners have begun to seed the space with values and practices designed to renew and often improve the field. Part II begins with an introduction to "open," the core response to the pressures described in part I. This part offers a general overview of the idea of openness in scholarly communication followed by chapters on different permutations and practices of open, each edited by a recognized expert of these areas with authors of their selection. Amy Buckland edited chapter 2.1, "Open Access." Brianna Marshall edited chapter 2.2, "Open Data." Lillian Hogendoorn edited chapter 2.3, "Open Education." Micah Vandegrift edited chapter 2.4, "Open Science and Infrastructure." Each of them brought on incredible expertise through contributors whom they identified, through both original contributions and repurposing existing openly licensed work, which is something we want to model where possible. Part III consists of twenty-four concise perspectives, intersections, and case studies from practicing librarians and closely related stakeholders, which we hope will stimulate discussion and reflection on theory and implications for practice. In every single case, we’re really excited by the editors and authors and the ideas they bring to the whole. Each contribution features light pedagogical apparatuses like suggested further reading, discussion or reflection prompts, and potential activities. It’s all available for free and openly licensed with a Creative Commons Attribution Non-Commercial (CC BY-NC) license, so anyone is encouraged to grab whatever parts are useful and to adapt and repurpose and improve them to meet specific course goals and student needs within the confines of the license.

https://bit.ly/SCLAOK

"Data Reuse among Digital Humanities Scholars: A Qualitative Study of Practices, Challenges and Opportunities"

The study investigates the challenges and opportunities in reusing research data among digital humanities (DH) scholars. Its findings may serve as a case study for how disciplinary practices influence the ways in which scholars reuse data. . . .

The study found that lack of time and resources, inconsistent data practices, technical training gaps, labour intensity and difficulties in finding data were the most challenging. Participants revealed a number of enabling factors in data reuse as well, and chief among them were collaboration and autodidacticism as a feature of DH. The results indicate a gap between data reusers and data sharers — low rates of sharing reduce the amount of findable and accessible data available for reuse. Both data reusers and data sharers must begin to see themselves as embedded into the research data lifecycle within the research infrastructure.

https://tinyurl.com/4hy77dsz

"ACME-FAIR: a Guide for Research Performing Organisations (RPO)"

The overall purpose of ACME-FAIR is to help those managing and delivering relevant professional services to self-assess how they are enabling researchers and their colleagues to do just that. Each part deals with one of the key issues that Research Performing Organisations (RPO) face in establishing the capabilities to put the FAIR principles into practice. . . . Each of the 7 guides has a thematic introduction, an overview of the relevant capabilities, and a rubric for assessing the levels of maturity and community engagement for each capability.

https://tinyurl.com/yckfdjtd

"An Approach to Assess the Quality of Jupyter Projects Published by GLAM Institutions"

Jupyter Notebooks have become a powerful tool to foster use of these collections by digital humanities researchers. Based on previous approaches for quality assessment, which have been adapted for cultural heritage collections, this paper proposes a methodology for assessing the quality of projects based on Jupyter Notebooks published by relevant GLAM institutions. A list of projects based on Jupyter Notebooks using cultural heritage data has been evaluated. Common features and best practices have been identified. A detailed analysis, that can be useful for organizations interested in creating their own Jupyter Notebooks projects, has been provided. Open issues requiring further work and additional avenues for exploration are outlined.

https://doi.org/10.1002/asi.24835

"Umbrella Data Management Plans to Integrate FAIR Data: Lessons From the ISIDORe and BY-COVID Consortia for Pandemic Preparedness"

The Horizon Europe project ISIDORe is dedicated to pandemic preparedness and responsiveness research. It brings together 17 research infrastructures (RIs) and networks to provide a broad range of services to infectious disease researchers. An efficient and structured treatment of data is central to ISIDORe’s aim to furnish seamless access to its multidisciplinary catalogue of services, and to ensure that users’ results are treated FAIRly. ISIDORe therefore requires a data management plan (DMP) covering both access management and research outputs, applicable over a broad range of disciplines, and compatible with the constraints and existing practices of its diverse partners.

Here, we describe how, to achieve that aim, we undertook an iterative, step-by-step, process to build a community-approved living document, identifying good practices and processes, on the basis of use cases, presented as proof of concepts. International fora such as the RDA and EOSC, and primarily the BY-COVID project, furnished registries, tools and online data platforms, as well as standards, and the support of data scientists. Together, these elements provide a path for building an umbrella, FAIR-compliant DMP, aligned as fully as possible with FAIR principles, which could also be applied as a framework for data management harmonisation in other large-scale, challenge-driven projects. Finally, we discuss how data management and reuse can be further improved through the use of knowledge models when writing DMPs and, how, in the future, an inter-RI network of data stewards could contribute to the establishment of a community of practice, to be integrated subsequently into planned trans-RI competence centres.

https://doi.org/10.5334/dsj-2023-035

"Data Management Plan Tools: Overview and Evaluation"

Data Management Plans (DMPs) are crucial for a structured research data management and often a mandatory part of research proposals. DMP tools support the development of DMPs. Among the variety of tools available, it can be difficult for researchers, data stewards and institutions to choose the one that is most appropriate for their specific needs and context. We evaluated 18 DMP tools according to 31 requirement parameters covering aspects relating to basic functions, technical aspects and user-friendliness. The highest total evaluation scores were reached by Data Stewardship Wizard (703.5), DMPTool (615.5) and RDMO NFDI4Ing (549.5). The tools evaluated satisfied between 10 % and 87 % of the requirement parameters. 11 tools cover at least half of the parameters. In terms of correlation among the tools, which indicates to which degree their scores in the different requirement parameters are alike, we found the highest correlation for ezDMP and GFBio DMPT. Regarding the relatedness between the tools, 85 % of the DMP tools were positively and 16 % negatively correlated. Accounting for the recent developments in the area of DMP tools, this study provides an up-to-date evaluation that can support tool developers in identifying potential improvements, and hosting institutions to select a tool suited to their specific needs.

https://tinyurl.com/yewhv8rn

"Understanding Barriers Affecting the Adoption and Usage of Open Access Data in the Context of Organizations"

Although the benefits of organizational adoption are significant, most OAD-related projects fail because of organizational barriers and resistance to adoption. This study first aims to find these organizational barriers to adopting OAD to raise awareness of the obstacles organizations must overcome. Towards this aim, after conducting a systematic literature review (SLR) and an expert panel, a research model based on the Technology – Organization – Environment (TOE) framework is proposed in this study. As a result of SLR, 97 barriers were identified from ten primary studies. After critically examining these barriers, a research model classifying 22 crucial barriers to organizational OAD adoption based on the TOE framework is proposed.

https://doi.org/10.1016/j.dim.2023.100049

Paywall: "Images as Metadata: A New Perspective for Describing Research Data"

Abstract Through studies and work developed over the last few years, we propose a new approach to description, where images can have a preponderant role in the description of data, assuming the role of metadata. We present several pieces of evidence, point out their challenges and determine the opportunities this new perspective can have in the research. Images have specific characteristics that can be leveraged in improving data description. Historical evidence establish that images have always been used and produced in research, yet their representational ability has never been harnessed to describe data and give more context to the scientific process.

https://doi.org/10.1080/19386389.2023.2252722

"Tracing Data: A Survey Investigating Disciplinary Differences in Data Citation"

Data citations, or citations in reference lists to data, are increasingly seen as an important means to trace data reuse and incentivize data sharing. Although disciplinary differences in data citation practices have been well documented via scientometric approaches, we do not yet know how representative these practices are within disciplines. Nor do we yet have insight into researchers’ motivations for citing — or not citing — data in their academic work. Here, we present the results of the largest known survey (n = 2,492) to explicitly investigate data citation practices, preferences, and motivations, using a representative sample of academic authors by discipline, as represented in the Web of Science (WoS). We present findings about researchers’ current practices and motivations for reusing and citing data and also examine their preferences for how they would like their own data to be cited. We conclude by discussing disciplinary patterns in two broad clusters, focusing on patterns in the social sciences and humanities, and consider the implications of our results for tracing and rewarding data sharing and reuse.

https://doi.org/10.1162/qss_a_00264

"Expanding the Data Ark: An Attempt to Make the Data from Highly Cited Social Science Papers Publicly Available"

Access to scientific data can enable independent reuse and verification; however, most data are not available and become increasingly irrecoverable over time. This study aimed to retrieve and preserve important datasets from 160 of the most highly-cited social science articles published between 2008-2013 and 2015-2018. We asked authors if they would share data in a public repository — the Data Ark — or provide reasons if data could not be shared. Of the 160 articles, data for 117 (73%, 95% CI [67% – 80%]) were not available and data for 7 (4%, 95% CI [0% – 12%]) were available with restrictions. Data for 36 (22%, 95% CI [16% – 30%]) articles were available in unrestricted form: 29 of these datasets were already available and 7 datasets were made available in the Data Ark. Most authors did not respond to our data requests and a minority shared reasons for not sharing, such as legal or ethical constraints. These findings highlight an unresolved need to preserve important scientific datasets and increase their accessibility to the scientific community.

https://doi.org/10.31222/osf.io/w9crz

"Towards a Toolbox for Automated Assessment of Machine-Actionable Data Management Plans"

Most research funders require Data Management Plans (DMPs). The review process can be time consuming, since reviewers read text documents submitted by researchers and provide their feedback. Moreover, it requires specific expert knowledge in data stewardship, which is scarce. Machine-actionable Data Management Plans (maDMPs) and semantic technologies increase the potential for automatic assessment of information contained in DMPs. However, the level of automation and new possibilities are still not well-explored and leveraged. This paper discusses methods for the automation of DMP assessment. It goes beyond generating human-readable reports. It explores how the information contained in maDMPs can be used to provide automated pre-assessment or to fetch further information, allowing reviewers to better judge the content. We map the identified methods to various reviewer goals.

https://doi.org/10.5334/dsj-2023-028

"Engaging with Researchers and Raising Awareness of FAIR and Open Science through the FAIR+ Implementation Survey Tool (FAIRIST)"

Seven years after the seminal paper on FAIR was published, that introduced the concept of making research outputs Findable, Accessible, Interoperable, and Reusable, researchers still struggle to understand how to implement the principles. For many researchers, FAIR promises long-term benefits for near-term effort, requires skills not yet acquired, and is one more thing in a long list of unfunded mandates and onerous requirements for scientists. Even for those required to, or who are convinced that they must make time for FAIR research practices, their preference is for just-in-time advice properly sized to the scientific artifacts and process. Because of the generality of most FAIR implementation guidance, it is difficult for a researcher to adjust to the advice according to their situation. Technological advances, especially in the area of artificial intelligence (AI) and machine learning (ML), complicate FAIR adoption, as researchers and data stewards ponder how to make software, workflows, and models FAIR and reproducible. The FAIR+ Implementation Survey Tool (FAIRIST) mitigates the problem by integrating research requirements with research proposals in a systematic way. FAIRIST factors in new scholarly outputs, such as nanopublications and notebooks, and the various research artifacts related to AI research (data, models, workflows, and benchmarks). Researchers step through a self-serve survey process and receive a table ready for use in their data management plan (DMP) and/or work plan. while gaining awareness of the FAIR Principles and Open Science concepts. FAIRIST is a model that uses part of the proposal process as a way to do outreach, raise awareness of FAIR dimensions and considerations, while providing timely assistance for competitive proposals.

https://doi.org/10.5334/dsj-2023-032

"The Effects of Research Data Management Services: Associating the Data Curation Lifecycle with Open Research Output"

This study seeks to understand the relationship between research data management (RDM) services framed in the data curation life cycle and the production of open data. An electronic questionnaire was distributed to US researchers and RDM specialists, and the results were analyzed using Chi-Square tests for association. The data curation life cycle does associate with the production of open data and shareable research, but tasks like data management plans have stronger associations with the production of open data. The findings analyze the intersection of these concepts and provide insight into RDM services that facilitate the production of open data and shareable research.

https://doi.org/10.5860/crl.84.5.751