Data Curation, Open Data, and Research Data Management – Page 5

"Prevalence and Predictors of Data and Code Sharing in the Medical and Health Sciences: Systematic Review with Meta-Analysis of Individual Participant Data"

The review found that public code sharing was persistently low across medical research. Declarations of data sharing were also low, increasing over time, but did not always correspond to actual sharing of data. The effectiveness of mandatory data sharing policies varied substantially by journal and type of data, a finding that might be informative for policy makers when designing policies and allocating resources to audit compliance.

https://doi.org/10.1136/bmj-2023-075767

Directions in Digital Scholarship: Support for Digital, Data-Intensive, and Computational Research in Academic Libraries

This report of a 2023 Coalition for Networked Information (CNI) initiative takes a broad look at library engagement with digital scholarship (DS) and examines connections with data-intensive and computational research over roughly the past five years and into the future. . . . To understand trends in DS programs, including attention to the impact of the pandemic, especially with reference to the importance of physical spaces and in-person programming, evidence was gathered from several sources, including online interviews with 12 library and DS leaders, profiles of 47 libraries’ DS programs, and conversations during two online forums representing a total of 24 institutions. Findings from these sources are analyzed and synthesized in this report.

https://tinyurl.com/398nzhcx

"Signing Data Citations Enables Data Verification and Citation Persistence"

Increasingly, digital datasets are being published with assigned identifiers, then cited in papers as the basis for repeatable experiments. To help future readers find and verify data, customary citations can be extended with content signatures, which can be introduced without having to replace existing identifier such as DOIs and ARKs. That is, signatures can be seen as complementary identifiers to help keep specific versions of cited data findable and identifiable as they evolve and change locations. For example, if a DOI identifies an evolving dataset, rather than a fixed version — i.e., content drift is expected — the DOI can safely be cited for the sake of attribution, metadata linking, and citation statistics (e.g., by Crossref (https://www.crossref.org) and DataCite (https://datacite.org)), while the content signature helps the reader find the exact content that was cited, possibly with assistance from metadata linked to the DOI. Additionally, a citation that includes both the DOI (for example) and content signature of a dataset creates a fixed mapping between the two identifiers. Then, unintentional content drift by the DOI can be detected and reported, and an alternative location may potentially be discovered by consulting public content signature registries.

https://doi.org/10.1038/s41597-023-02230-y

"Perceived Benefits of Open Data Are Improving but Scientists Still Lack Resources, Skills, and Rewards"

Addressing global scientific challenges requires the widespread sharing of consistent and trustworthy research data. Identifying the factors that influence widespread data sharing will help us understand the limitations and potential leverage points. We used two well-known theoretical frameworks, the Theory of Planned Behavior and the Technology Acceptance Model, to analyze three DataONE surveys published in 2011, 2015, and 2020. These surveys aimed to identify individual, social, and organizational influences on data-sharing behavior. In this paper, we report on the application of multiple factor analysis (MFA) on this combined, longitudinal, survey data to determine how these attitudes may have changed over time. The first two dimensions of the MFA were named willingness to share and satisfaction with resources based on the contributing questions and answers. Our results indicated that both dimensions are strongly influenced by individual factors such as perceived benefit, risk, and effort. Satisfaction with resources was significantly influenced by social and organizational factors such as the availability of training and data repositories. Researchers that improved in willingness to share are shown to be operating in domains with a high reliance on shared resources, are reliant on funding from national or federal sources, work in sectors where internal practices are mandated, and live in regions with highly effective communication networks. Significantly, satisfaction with resources was inversely correlated with willingness to share across all regions. We posit that this relationship results from researchers learning what resources they actually need only after engaging with the tools and procedures extensively.

https://doi.org/10.1057/s41599-023-01831-7

Video: "Dryad in the Community: New Data Sharing Mandates and the Role of Academic Libraries"

In this presentation, Dryad’s Head of Community Engagement, Sarah Lippincott is joined by fellow presenters Michael Casp, Head of Production Division at J&J Editorial, Emma Molls, Director of Open Research & Publishing at University of Minnesota Libraries, and Alberto Pepe, Director of Strategy and Innovation at Wiley and Co-founder of Authorea. Sarah reviews some pertinent highlights from the Nelson memo and NIH policies, two of the major developments that will impact data sharing over the next few years. and concludes with a discussion on how libraries can help researchers move from data sharing to data publishing.

https://tinyurl.com/bdfd7axh

"It Takes a Researcher to Know a Researcher: Academic Librarian Perspectives Regarding Skills and Training for Research Data Support in Canada "

This study demonstrates that an in-depth qualitative portrait of data-related librarians within a national academic ecosystem provides valuable new insights regarding the perceived importance of conducting original empirical research to succeed in these roles.

https://doi.org/10.18438/eblip30297

University of Hawaii at Manoa: Needs Assessment for Data Management and Sharing Training Courses

Data management is an increasingly fundamental skill for graduate students and researchers in the biomedical sciences, especially as National Institutes of Health (NIH) and other funding agencies are now beginning to require data management and sharing plans as part of research. Since the University of Hawaii at Manoa (UHM) Library provided little support for this area and existing data management and sharing instructional content are either out of date or fail to address the unique needs of the UHM research community, the UHM Library took steps to establish data management and sharing instruction services to meet the specific needs of the UHM research community.

https://hdl.handle.net/10125/104944

"DataChat: Prototyping a Conversational Agent for Dataset Search and Visualization"

Data users need relevant context and research expertise to effectively search for and identify relevant datasets. Leading data providers, such as the Inter-university Consortium for Political and Social Research (ICPSR), offer standardized metadata and search tools to support data search. Metadata standards emphasize the machine-readability of data and its documentation. There are opportunities to enhance dataset search by improving users’ ability to learn about, and make sense of, information about data. Prior research has shown that context and expertise are two main barriers users face in effectively searching for, evaluating, and deciding whether to reuse data. In this paper, we propose a novel chatbot-based search system, DataChat, that leverages a graph database and a large language model to provide novel ways for users to interact with and search for research data. DataChat complements data archives’ and institutional repositories’ ongoing efforts to curate, preserve, and share research data for reuse by making it easier for users to explore and learn about available research data.

https://arxiv.org/abs/2305.18358

Digital Scholarship Has Released the Academic Libraries and Research Data Management Bibliography

The Academic Libraries and Research Data Management Bibliography includes over 345 selected English-language articles and books that are useful in understanding how academic libraries plan for, implement, provide, evaluate, and conduct studies about research data management (RDM) services. Most sources have been published from 2012 through 2023. It includes full abstracts for works under certain Creative Commons Licenses. It is available as a website and a website PDF with live links.

Digital Scholarship’s other bibliographies about research data curation include the Research Data Curation and Management Bibliography (over 800 works), the Research Data Publication and Citation Bibliography (over 225 works), and the Research Data Sharing and Reuse Bibliography (over 200 works).

"FAIR in Action — A Flexible Framework to Guide FAIRification"

The COVID-19 pandemic has highlighted the need for FAIR (Findable, Accessible, Interoperable, and Reusable) data more than any other scientific challenge to date. We developed a flexible, multi-level, domain-agnostic FAIRification framework, providing practical guidance to improve the FAIRness for both existing and future clinical and molecular datasets. We validated the framework in collaboration with several major public-private partnership projects, demonstrating and delivering improvements across all aspects of FAIR and across a variety of datasets and their contexts. We therefore managed to establish the reproducibility and far-reaching applicability of our approach to FAIRification tasks.

https://doi.org/10.1038/s41597-023-02167-2

"Rethinking Transparency and Rigor from a Qualitative Open Science Perspective"

To further complicate matters, many qualitative researchers would posit that while secondary data are a combination of the researcher’s perceptions and observations, even primary data, such as interview transcripts, are filtered to some extent through the researcher. This is because, in qualitative research, the researcher is an instrument of both data collection and analysis . . . .

The researcher-as-instrument tradition also complicates discussions around reproducibility (i.e., the ability for another researcher to look at someone’s data and reproduce the analyses), one of the key components of rigor as it is currently discussed in the open science movement (NIH, n.d.). Quantitative researchers’ focus on reproducibility is often contrary to the tenets of qualitative research, particularly in methodologies aiming to uncover new ways of knowing, such as constructivist and grounded theory approaches. If one understands the researcher as a data collection instrument and a filter through which data is processed, strict quantitative-focused reproducibility becomes less likely—not through misconduct or error, but because ultimately, people conduct research, and people are not likely to have exactly the same perspectives. Guidelines that reinforce reproducibility without addressing this tension are not going to be useful for all researchers.

https://bit.ly/3MEbtnk

"A Pilot Study to Locate Historic Scientific Data in a University Archive"

Historic data in analog (or print) format is a valuable resource that is utilized by scientists in many fields. This type of data may be found in various locations on university campuses including offices, labs, storage facilities, and archives. This study investigates whether biological data held in one institutional university archives could be identified, described, and thus made potentially useful for contemporary life scientists. Scientific data was located and approximately half of it was deemed to be of some value to current researchers and about 20% included enough information for the study to be repeated. Locating individual data sets in the collections at the University Archives at the University of Minnesota proved challenging. This preliminary work points to possible ways to move forward to make raw data in university archives collections more discoverable and likely to be reused. It raises questions that can help inform future work in this area.

https://bit.ly/41JBMNb

"Initial Insight Into Three Modes of Data Sharing: Prevalence of Primary Reuse, Data Integration and Dataset Release in Research Articles"

While data sharing has received research interest in recent times, its real status remains unclear, owing to its ambiguous concept. To understand the current status of data sharing, this study examined primary reuse, data integration, and dataset release as the actual practices of data sharing. A total of 963 articles, chosen from those published in 2018 and registered in the Web of Science global citation database, were manually checked. Existing data were reused in the mode of data integration (13.3%) as frequently as they were for the mode of primary reuse (12.1%). Dataset release was the least common mode (9.0%). The results show the variation in data sharing and indicate the need for standardization of data description in articles based on thorough registration and expansion in public data archives to close the loop that results in the virtuous cycle of research data.

https://doi.org/10.1002/leap.1546

"’We Share All Data with Each Other’: Data-Sharing in Peer-to-Peer Relationships"

The analysis identifies three social forms of data-sharing in peer-to-peer relationships: (a) closed communal sharing, which is based on a feeling of belonging together; (b) closed associative sharing, in which the participants act on the basis of an agreement; and (c) open associative sharing, which is oriented to “institutional imperatives” (Merton) and to formal regulations. The study shows that far more data-sharing is occurring in scientific practice than seems to be apparent from a concept of open data alone.

https://doi.org/10.1007/s11024-023-09487-y

Open Science: A Practical Guide for Early-Career Researchers

Beginning researchers are an important link in the transition to Open Science, so this guide is aimed at PhD candidates, Research Master Students, and early-career researchers from all disciplines at Dutch universities and research institutes. [This guide will be very useful to non-Dutch researchers.] It is designed to accompany researchers in every step of their research, from the phase of preparing your research project and discovering relevant resources (chapter 2) to the phase of data collection and analysis (chapter 3), writing and publishing articles, data, and other research output (chapter 4), and outreach and assessment (chapter 5). Every chapter provides you with the best tools and practices to implement immediately.

https://doi.org/10.5281/zenodo.7716152

Digital Scholarship Has Released Digital Curation Certificate and Master’s Degree Programs

Digital Scholarship has released Digital Curation Certificate and Master’s Degree Programs. This document describes digital curation certificate and master’s degree programs in North America, identifying those that are online. It does not cover individualized certificate programs, such as those at Indiana University Bloomington or the University of Illinois Urbana-Champaign. Nor does it cover digital curation specializations within MLS and other master’s degree programs in iSchools. It is available as a website and a website PDF with live links.

"Spain Adopts National Open Access Strategy"

Spain has approved a four-year national strategy for open science, under which all outputs of publicly financed research will made available free upon publication.

Under the strategy open access will become the default mode for all research funded directly or indirectly, with public funds. . . .

A budget of €23.8 million in 2023 will be maintained annually until 2027.

https://bit.ly/414w2gY

"How and Why Do Researchers Reference Data? A Study of Rhetorical Features and Functions of Data References in Academic Articles"

Data reuse is a common practice in the social sciences. While published data play an essential role in the production of social science research, they are not consistently cited, which makes it difficult to assess their full scholarly impact and give credit to the original data producers. Furthermore, it can be challenging to understand researchers’ motivations for referencing data. Like references to academic literature, data references perform various rhetorical functions, such as paying homage, signaling disagreement, or drawing comparisons. This paper studies how and why researchers reference social science data in their academic writing. We develop a typology to model relationships between the entities that anchor data references, along with their features (access, actions, locations, styles, types) and functions (critique, describe, illustrate, interact, legitimize). We illustrate the use of the typology by coding multidisciplinary research articles (n = 30) referencing social science data archived at the Inter-university Consortium for Political and Social Research (ICPSR). We show how our typology captures researchers’ interactions with data and purposes for referencing data. Our typology provides a systematic way to document and analyze researchers’ narratives about data use, extending our ability to give credit to data that support research.

https://doi.org/10.5334/dsj-2023-010

Common Scholarly Communication Infrastructure Landscape Review

Scholarly communication is a complicated sector, with numerous participants and multiple mechanisms for communicating and reviewing materials created in an increasing variety of formats by researchers across the globe.[1] In turn, the researcher who seeks to use the products of this system wishes to discover, access, and use relevant and trustworthy materials as effortlessly as possible. The work of driving efficiency into this complex sector while bringing its multiple strands together seamlessly for the reader (or, increasingly, for a computational user) rests on a foundation of infrastructure, much of it shared across multiple publishers. In this landscape review, we seek to provide a high-level overview of the shared infrastructure that supports scholarly communication.

https://doi.org/10.18665/sr.318775

"Data Sharing in the Context of Community-Engaged Research Partnerships"

Over the past 20 years, the National Institutes for Health (NIH) has implemented several policies designed to improve sharing of research data, such as the NIH public access policy for publications, NIH genomic data sharing policy, and National Cancer Institute (NCI) Cancer Moonshot public access and data sharing policy. . . . Important questions that we must consider as data sharing is expanded are to whom do benefits of data sharing accrue and to whom do benefits not accrue? In an era of growing efforts to engage diverse communities in research, we must consider the impact of data sharing for all research participants and the communities that they represent.

We examine the issue of data sharing through a community-engaged research lens, informed by a long-standing partnership between community-engaged researchers and a key community health organization (Kruse et al., 2022). We contend that without effective community engagement and rich contextual knowledge, biases resulting from data sharing can remain unchecked. We provide several recommendations that would allow better community engagement related to data sharing to ensure both community and researcher understanding of the issues involved and move toward shared benefits. By identifying good models for evaluating the impact of data sharing on communities that contribute data, and then using those models systematically, we will advance the consideration of the community perspective and increase the likelihood of benefits for all.

https://doi.org/10.1016/j.socscimed.2023.115895

"Estimating Social Bias in Data Sharing Behaviours: An Open Science Experiment"

Open data sharing is critical for scientific progress. Yet, many authors refrain from sharing scientific data, even when they have promised to do so. Through a preregistered, randomized audit experiment (N = 1,634), we tested possible ethnic, gender and status-related bias in scientists’ data-sharing willingness. 814 (54%) authors of papers where data were indicated to be ‘available upon request’ responded to our data requests, and 226 (14%) either shared or indicated willingness to share all or some data. While our preregistered hypotheses regarding bias in data-sharing willingness were not confirmed, we observed systematically lower response rates for data requests made by putatively Chinese treatments compared to putatively Anglo-Saxon treatments. Further analysis indicated a theoretically plausible heterogeneity in the causal effect of ethnicity on data-sharing. In interaction analyses, we found indications of lower responsiveness and data-sharing willingness towards male but not female data requestors with Chinese names. These disparities, which likely arise from stereotypic beliefs about male Chinese requestors’ trustworthiness and deservingness, impede scientific progress by preventing the free circulation of knowledge.

https://doi.org/10.1038/s41597-023-02129-8

"Building a Large-Scale Digital Library Search Interface Using the Libraries Online Catalog"

The Kentucky Digital Newspaper Program (KDNP) was born out of the University of Kentucky Libraries’ (UKL) work in the National Digital Newspaper Program (NDNP) that began in 2005. In early 2021, a team of specialists at UKL from library systems, digital archives, and metadata management was formed to explore a new approach to searching this content by leveraging the power of the library services platform (Alma) and discovery system (Primo VE) licensed from Ex Libris. The result was the creation of a dedicated Primo VE search interface that would include KDNP content as well as all Kentucky newspapers held on microfilm in the UKL system. This article will describe the journey from the question of whether we could harness the power of Alma and Primo VE to display KDNP content, to the methodology used in creating a new dedicated search interface that can be replicated to create custom search interfaces of your own.

https://journal.code4lib.org/articles/17257

Paywall: "We Need a Plan D"

Researchers, institutions and funders should collaborate to develop an overarching strategy for data preservation — a plan D. There will doubtless be calls for a ‘PubMed Central for data’. But what we really need is a federated system of repositories with functionality tailored to the information that they archive. This will require domain experts to agree standards for different types of data from different fields: what should be archived and when, which format, where, and for how long.

https://doi.org/10.1038/s41592-023-01817-y

"Science Journals Integrate Dryad to Simplify Data Deposition and Strengthen Scientific Reproducibility"

The Science family journals have announced a partnership with the nonprofit data repository Dryad that simplifies the process by which authors deposit data underlying new work — a critical step to facilitating data’s routine reuse. The partnership is yet another step taken by the Science journals to ensure data the scientific community requires to verify, replicate and reanalyze new research is openly available. . . .

Because the partnership with Dryad integrates Dryad’s platform with the Science family journal’s submission process, authors will have the option to deposit data at Dryad directly from the submission site of any Science family journal. As authors submit research to the journals, they will be prompted about data availability and welcome to deposit their data to any suitable disciplinary repository. But, if data do not yet have a home, authors will have the opportunity to upload their data to Dryad. . . .

To ensure that this service is widely available, the Science journals will cover costs of Dryad data publication for accepted papers.

http://bit.ly/43wtVoD

"Guest Post — Why Interoperability Matters for Open Research — And More Than Ever"

The question remains, why have we not achieved more in delivering connectivity across the research system? While funding for this kind of underpinning infrastructure is notable in its absence (or where it is available it is often too temporary in nature), the other major challenge is in securing adoption among the service providers (funders, publishers, and institutions among the key players) that would maximize the use and potential of building those connections. It is notoriously hard for organisations to tweak or adapt existing workflows and legacy systems and to demonstrate the benefits (and hence prioritise the work) at an individual organisation level that may seem obvious at a system level.

https://cutt.ly/K7hxFQz