Digital Repositories – DigitalKoans

"Enabling Preprint Discovery, Evaluation, and Analysis with Europe PMC"

Preprints provide an indispensable tool for rapid and open communication of early research findings. Preprints can also be revised and improved based on scientific commentary uncoupled from journal-organised peer review. The uptake of preprints in the life sciences has increased significantly in recent years, especially during the COVID-19 pandemic, when immediate access to research findings became crucial to address the global health emergency. With ongoing expansion of new preprint servers, improving discoverability of preprints is a necessary step to facilitate wider sharing of the science reported in preprints. To address the challenges of preprint visibility and reuse, Europe PMC, an open database of life science literature, began indexing preprint abstracts and metadata from several platforms in July 2018. Since then, Europe PMC has continued to increase coverage through addition of new servers, and expanded its preprint initiative to include the full text of preprints related to COVID-19 in July 2020 and then the full text of preprints supported by the Europe PMC funder consortium in April 2022. The preprint collection can be searched via the website and programmatically, with abstracts and the open access full text of COVID-19 and Europe PMC funder preprint subsets available for bulk download in a standard machine-readable JATS XML format. This enables automated information extraction for large-scale analyses of the preprint corpus, accelerating scientific research of the preprint literature itself. This publication describes steps taken to build trust, improve discoverability, and support reuse of life science preprints in Europe PMC. Here we discuss the benefits of indexing preprints alongside peer-reviewed publications, and challenges associated with this process.

https://doi.org/10.1371/journal.pone.0303005

"Knowledge Infrastructures are Growing Up: The Case for Institutional (Data) Repositories 10 Years After the Holdren Memo"

Institutional data repositories are uniquely positioned to support researchers in sharing scholarly outputs. As funding agencies develop and institute policies for research data access and sharing, institutional data repositories have emerged as a critical feature in ecosystems for data stewardship and sharing. We show that institutional data repositories can meet and exceed the requirements and recommendations of federal data policy, thereby maximizing the benefits of data sharing. We present results of a mixed-method study which explores the adoption and usage of institutional repositories to share data from 2017 to 2023. Data from two previous studies were combined with data collected in 2023 on the data sharing solutions of Association of Research Libraries member institutions in the United States and Canada. The analysis of the aggregated data indicates that data stewardship has increased in both institutional repositories and institutional data repositories with an increase in complementary infrastructure to support data sharing. We then conduct an “infrastructural inversion” (Bowker & Star, 1999) to ‘surface invisible work’ of making data repositories function well, and demonstrate that institutional data repositories have advantages for providing sustainable stewardship, curation, and sharing of research data. Finally, we show that institutional data repositories may produce additional benefits through established infrastructure, local interoperability, and control.

https://doi.org/10.5334/dsj-2024-046

"’Does It Feel like a Scientific Paper?’: A Qualitative Analysis of Preprint Servers’ Moderation and Quality Assurance Processes"

In recent years, preprints—i.e., scholarly manuscripts that have not been peer reviewed or published in a journal—have emerged as a major source of research communication and a critical component of open science. However, concerns have been raised about preprints’ potential to facilitate the spread of flawed or misleading research due to the lack of quality control performed by preprint servers. Yet, there is limited knowledge of how servers currently vet incoming content and how this impacts the openness and diversity of scholarly content. In this paper, we examine preprint servers’ moderation processes, the intentions underpinning them, and their potential effects through a qualitative analysis of in-depth interviews with 14 key preprint server personnel. We find a wide range of moderation processes, which vary depending on specific server contexts and needs and are motivated by a desire to prevent the spread of misinformation and protect trust in preprints and servers. Participants repeatedly emphasized the difference between their moderation processes and peer review, but in practice often applied similar criteria for delineating scientific from unscientific content. Moreover, moderation processes often relied on trust cues, such as article formats or author affiliations, as proxies for research quality, potentially introducing similar biases as have been found in traditional journal peer review. We discuss implications for the diversity of preprint content and authors, as well as the future of preprint servers within an evolving scholarly communication ecosystem.

https://doi.org/10.31222/osf.io/mp6ky

"Interview: Deciphering the Law: Hachette v. Internet Archive Pt. 1 (2023) with Dave Hansen"

This is the first in a series of interviews with those closely tied to the Hachette v. Internet Archive lawsuit. In March 2023, the court ruled against the Internet Archive and its use of the Emergency Lending Library causing a ripple throughout the library and education fields. Below, find the answers to some of the questions that the case elicited by JCEL contributors and copyright scholars Dave Hansen, Michelle Wu, and Kyle Courtney.

https://doi.org/10.17161/jcel.v7i2.21337

"Closing Gaps: A Model of Cumulative Curation and Preservation Levels for Trustworthy Digital Repositories "

Curation and preservation measures carried out by digital repository staff are an important building block in maintaining the accessibility and usability of digital resources over time. The measures adequate to achieve long-term usability for a given audience strongly depend on scenarios of (re)use, the (intended) users’ needs and skills, the organisational setting (e.g., mission, resources, policies), as well as the characteristics of the digital objects to be preserved. The assessment of curation and preservation measures also forms an important part of existing certification procedures for trustworthy digital repositories (TDRs) as offered, for example, by the CoreTrustSeal foundation, the nestor network, or ISO.

The digital curation community is presented with the challenge of finding community-, organisation-, and object-specific approaches to curation and preservation at the same time as defining the minimum level of curation and preservation measures expected from a TDR in sufficiently generic terms to ensure applicability to a wide array of repositories. Against this backdrop, this paper discusses the need for and benefits of community-agreed levels of curation and preservation to address this challenge, and considers the tiered model proposed by the CoreTrustSeal Board as an example.

The proposed model is then applied in an analysis of successful CoreTrustSeal applications from 2018–2022 in an effort to better understand the capacity of the curation and preservation levels to capture the respective practices of repositories and to identify potential gaps.

https://doi.org/10.2218/ijdc.v18i1.926

"Transparent Disclosure, Curation & Preservation of Dynamic Digital Resources "

This paper explores an enhanced curation lifecycle being developed at the UK Data Service (UKDS), with our Data Product Builder. Through a Graphical User Interface, we aim to provide the researcher with a tailored digital resource. We detail the threefold motivation behind this initiative: data dissemination scalability, researcher satisfaction and the reduction of nationwide duplication of research effort.

Subsequent sections detail the technical components and challenges involved. In addition to more standard data subsetting, filtering and linking components, this data dissemination platform offers dynamic disclosure assessments – identifying combinations of variables that present a potential disclosure risk. All components are underpinned by the Data Documentation Initiative’s new Cross-Domain Integration standard (DDI-CDI), designed to handle the many structures in which data may be organised.

Ever conscious of the scale of the task we are embarking on, we remain motivated by the need for such advances in data dissemination and optimistic of the feasibility of such a system to meet the needs of the researcher while balancing the data disclosivity concerns of the data depositor.

https://doi.org/10.2218/ijdc.v18i1.937

"Toward Enhanced Reusability: A Comparative Analysis of Metadata for Machine Learning Objects and Their Characteristics in Generalist and Specialist Repositories"

Objective: The rapidly increasing prevalence and application of machine learning (ML) across disciplines creates a pressing need to establish guidance for data curation professionals. However, we must first understand the characteristics of ML-related objects shared in generalist and specialist repositories and the extent to which repository metadata fields enable findability and reuse of ML objects.

Methods: We used a combination of API queries and web scraping to retrieve metadata for ML objects in eight commonly used generalist and ML-specific data repositories. We assessed both metadata schema and characteristics of deposited ML objects, within the context of the widely adopted FAIR Principles. We also calculated summary statistics for properties of objects, including number of objects per year, dataset size, domains represented, and availability of related resources.

Results: Generalist repositories excelled at providing provenance metadata, specifically unique identifiers, unambiguous citations, clear licenses, and related resources, while specialist repositories emphasized ML-specific descriptive metadata, such as number of attributes and instances and task type. In terms of object content, we noted a wide range of file formats, as well as licenses, all of which impact reusability.

Conclusions: Generalist repositories will benefit from some of the practices adopted by specialists, and specialist repositories will benefit from adopting proven data curation practices of generalist repositories. A step forward for repositories will be to invest more into use of labels and persistent identifiers to improve workflow documentation, provenance, and related resource linking of ML objects, which will increase their findability, interoperability, and reusability.

https://doi.org/10.7191/jeslib.685

Paywall: "Constructing Risk in Trustworthy Digital Repositories"

This article investigates the construction of risk within trustworthy digital repository audits. It contends that risk is a social construct, and social factors influence how stakeholders in digital preservation processes comprehend and react to risk.

https://doi.org/10.1108/JD-08-2023-0157

Paywall: "Data Quality Assurance Practices in Research Data Repositories — A Systematic Literature Review"

This study conducted a systematic analysis of data quality assurance (DQA) practices in RDRs, guided by activity theory and data quality literature, resulting in conceptualizing a data quality assurance model (DQAM) for RDRs. DQAM outlines a DQA process comprising evaluation, intervention, and communication activities and categorizes 17 quality dimensions into intrinsic and product-level data quality. It also details specific improvement actions for data products and identifies the essential roles, skills, standards, and tools for DQA in RDRs. By comparing DQAM with existing DQA models, the study highlights its potential to improve these models by adding a specific DQA activity structure.

https://doi.org/10.1002/asi.24948

"Internet Archive Forced to Remove 500,000 Books after Publishers’ Court Win"

As a result of book publishers successfully suing the Internet Archive (IA) last year, the free online library that strives to keep growing online access to books recently shrank by about 500,000 titles. . . .

To restore access, IA is now appealing, hoping to reverse the prior court’s decision by convincing the US Court of Appeals in the Second Circuit that IA’s controlled digital lending of its physical books should be considered fair use under copyright law. An April court filing shows that IA intends to argue that the publishers have no evidence that the e-book market has been harmed by the open library’s lending, and copyright law is better served by allowing IA’s lending than by preventing it. . . ./p>

Freeland [Chris Freeland, IA’s director of library service] told Ars it could take months or even more than a year before a decision is reached in the case.

While IA fights to end the injunction, its other library services continue growing, IA has said. IA "may still digitize books for preservation purposes" and "provide access to our digital collections" through interlibrary loan and other means. IA can also continue lending out-of-print and public domain books.

https://tinyurl.com/47aws7z7

"Analyzing Research Data Repositories (RDR) from BRICS Nations: A Comprehensive Study"

As of March 2, 2024, re3data.org indexes a total of 3,192 Research Data Repositories (RDRs) worldwide, with BRICS nations contributing 195. China leads among BRICS nations, followed by India, Russia, and Brazil. . . . "House, tailor-made " software is widely used for creating RDRs, followed by Dataverse and DSpace. . . . Most repositories are disciplinary, followed by institutional ones. Most repositories specify data upload types, with "restricted " being the most common, followed by closed types. Open access is predominant in data access, followed by restricted access and embargo periods, while a small number restrict access entirely.

https://doi.org/10.1108/LM-04-2024-0040

"Biomedical Data Repository Concepts and Management Principles"

The demand for open data and open science is on the rise, fueled by expectations from the scientific community, calls to increase transparency and reproducibility in research findings, and developments such as the Final Data Management and Sharing Policy from the U.S. National Institutes of Health and a memorandum on increasing public access to federally funded research, issued by the U.S. Office of Science and Technology Policy. This paper explores the pivotal role of data repositories in biomedical research and open science, emphasizing their importance in managing, preserving, and sharing research data. Our objective is to familiarize readers with the functions of data repositories, set expectations for their services, and provide an overview of methods to evaluate their capabilities. The paper serves to introduce fundamental concepts and community-based guiding principles and aims to equip researchers, repository operators, funders, and policymakers with the knowledge to select appropriate repositories for their data management and sharing needs and foster a foundation for the open sharing and preservation of research data.

https://doi.org/10.1038/s41597-024-03449-z

"Understanding the Value of Curation: A Survey of Us Data Repository Curation Practices and Perceptions"

Data curators play an important role in assessing data quality and take actions that may ultimately lead to better, more valuable data products. This study explores the curation practices of data curators working within US-based data repositories. We performed a survey in January 2021 to benchmark the levels of curation performed by repositories and assess the perceived value and impact of curation on the data sharing process. Our analysis included 95 responses from 59 unique data repositories. Respondents primarily were professionals working within repositories and examined curation performed within a repository setting. A majority 72.6% of respondents reported that "data-level" curation was performed by their repository and around half reported their repository took steps to ensure interoperability and reproducibility of their repository’s datasets. Curation actions most frequently reported include checking for duplicate files, reviewing documentation, reviewing metadata, minting persistent identifiers, and checking for corrupt/broken files. The most "value-add" curation action across generalist, institutional, and disciplinary repository respondents was related to reviewing and enhancing documentation. Respondents reported high perceived impact of curation by their repositories on specific data sharing outcomes including usability, findability, understandability, and accessibility of deposited datasets; respondents associated with disciplinary repositories tended to perceive higher impact on most outcomes. Most survey participants strongly agreed that data curation by the repository adds value to the data sharing process and that it outweighs the effort and cost. We found some differences between institutional and disciplinary repositories, both in the reported frequency of specific curation actions as well as the perceived impact of data curation. Interestingly, we also found variation in the perceptions of those working within the same repository regarding the level and frequency of curation actions performed, which exemplifies the complexity of a repository curation work. Our results suggest data curation may be better understood in terms of specific curation actions and outcomes than broadly defined curation levels and that more research is needed to understand the resource implications of performing these activities. We share these results to provide a more nuanced view of curation, and how curation impacts the broader data lifecycle and data sharing behaviors.

https://doi.org/10.1371/journal.pone.0301171

"The Puzzle of Large-Scale Digital Collections: Have We Reached an Inflection Point?"

Shared Collections allows institutions either to have JSTOR harvest their digital collections of documents, photos, and other special collections from a local Digital Asset Management System, or to create and share those same collections through JSTOR’s collection management tool. . . . While Shared Collections appears to represent a significant advance, the jury will be out for some time. The fundamental issues facing DPLA and Shared Collections are simply difficult, and the struggles with them have little or nothing to do with the skills or intentions of the capable people of both organizations. It is both a tough economic problem and an outcome of what we might call "rugged individualism in heritage collections": while shared descriptive efforts have been in place for books for more than a century, many standards for heritage collections have emerged since 2000. It’s a symptom of under-investment in cultural heritage in the United States.

https://rb.gy/597nkq

"A Census of Institutional Repositories at Regional Public Universities"

This study reports on the implementation of institutional repositories (IRs) at regional public universities (RPUs) in the United States and its territories. The author investigated repository platform choice, operation style, and content. More than half of RPUs have implemented an IR. The author discusses how these findings align with trends in previous research and explores the unique aspects of IRs at RPUs—particularly the prevalence of student works and special collections materials. For over two decades, institutional repositories (IRs) have been used at institutions of higher education to collect, preserve, and share the scholarly works of an institution. During that same time there have been an increasing number of studies looking at who has implemented an IR, the most popular IR platforms, and type and number of objects deposited in IRs. While some studies have looked at small or teaching-focused institutions, most of these studies have focused on IR implementations at large research-focused institutions.

https://tinyurl.com/yc2fs4r2

"Developing Open Access Resource Management Principles in a Consortial Environment: A University of California Model"

In the summer of 2021, the University of California (UC) migrated to a new integrated library system, called the Systemwide Integrated Library System project (SILS), which for the first time brought all ten UC campuses, two regional storage facilities, and the California Digital Library (CDL) together into one shared library system. With new potential for increased collaboration and cooperation, SILS leadership groups identified consortial open access (OA) resource management as a key opportunity in the new system, in alignment with UC’s priorities around discovery and access to library collections, as well as UC’s commitment to open access and transforming the scholarly communication landscape. This article discusses the formation of the UC Open Access Resource Management Task Force (OARMTF), a group charged to investigate what it would mean to consortially manage OA resources. Specifically, this article focuses on the OARMTF’s work setting out principles for OA resource management, which the authors hope may serve as a useful case study for other institutions or consortia interested in developing principles around OA resource management, as well as encourage more discussion and research into best practices for consortial management of OA resources.

https://doi.org/10.5860/lrts.68n1.8216

"Opening Up: A Global Context for Local Open Access Initiatives in Higher Education"

Open access policies and mandates can be a useful tool in persuading faculty at higher education institutions around the globe to produce and share open scholarship. But are such policies widely written, accepted, and adopted? Leveraging information found on the Registry of Open Access Repositories Mandatory Archiving Policies, this paper analyzes open access policies at higher education institutions worldwide. The data indicate that Europe holds the most policies, while fewer policies have been enacted in the Americas, Africa, Oceania, and Asia due to a myriad of barriers. Overall, better strategies to promote open access are needed, and such strategies may not necessarily take the form of an open access policy. My own investigation of global open access policies has informed my practices with respect to open access. In this paper, I demonstrate how librarians acting as policy entrepreneurs can assist with the promotion of open access at their institutions and then conclude with suggestions, solutions, and pathways beyond policy adoption to promote and advocate for open access.

https://tinyurl.com/2h3uz5n4

"Preprints, Journals and Openness: Disentangling Goals and Incentives "

I would argue that private funders such as the Gates Foundation or the Howard Hughes Medical Institute (HHMI) could provide material support through grants and policies for quality peer review, baking peer review into selection of grantees. Such an approach will require careful structures and mechanisms for reviewer selection, and measures of success, or we may run the risk of creating further inequities. Mind you, in many fields it is just hard to find good reviewers prepared to put in the effort required for a considered, thoughtful review. Societies, such as my own, could also consider material ways to support peer review more actively — a philosophical and practical approach to raising the profile of peer review at an early stage in the life of a researcher.

https://tinyurl.com/ymckyb9x

1 Million Images and Counting: "AI-Startup Launches Ever-Expanding Library of Free Stock Photos and Music"

StockCake is a new platform by AI startup Imaginary Machines. The site currently hosts more than a million pre-generated images. These images can be downloaded, used, and shared for free. There are no strings attached as all photos are in the public domain.

https://tinyurl.com/mvjd3683

StockCake

2024 Fedora Technology Assessment Report

The Fedora Program Team, in collaboration with the Technology Working Group, designed a project to understand the specific Fedora-related priorities of using institutions, along with the capacity and available resources of both individuals and institutions to contribute to the Fedora community between 2024 and 2026. They collaborated with the Research and Innovation Division at Lyrasis to survey Fedora users. Responses were collected between November 2023 and January 31, 2024, and analyzed by Leigh A. Grinstead, Senior Digital Services Consultant from Lyrasis, an independent, nonprofit, research group.

https://tinyurl.com/2s4b4rec

"Support for OSF Preprint Infrastructure and Community Servers"

Numerous Ivy Plus Libraries Confederation (IPLC) partner institutions* will provide three years of financial support for the Center for Open Science’s OSF Preprints, an open source platform and infrastructure that enables the facilitation and discovery of scholarship. COS notes that submission and consumption of preprints continues to grow with "~150,000 preprints hosted across all of the current and prior preprint communities, and 1.7 million views on preprint pages since September 2023."

https://tinyurl.com/yn9nntvu

Advancing Ireland’s Open Repository Landscape: A Strategic Roadmap

This report presents an in-depth analysis and strategic roadmap for advancing the open repository landscape in Ireland. Drawing on comprehensive data from interviews, surveys, self-assessments, and both national and international initiatives, the document outlines the current status, challenges, and future prospects for open repositories in Ireland. Key findings highlight significant advancements in open access adherence. Despite these successes, persistent challenges such as metadata quality, resource limitations, and sustainability issues underscore the need for concerted effort and strategic planning.

The report proposes a forward-looking roadmap spanning from 2025 to 2030 and beyond, prioritising the enhancement of repository infrastructures, metadata quality improvement, open mandates promotion, technological advancements, capacity building, and fostering collaborative partnerships. This strategic vision aims to develop and encourage Ireland’s transition to open research, leveraging innovative practices and collaborative efforts to facilitate a more open, inclusive, and sustainable research environment. By addressing current limitations and embracing future opportunities, the roadmap sets the stage for a transformative shift in Ireland’s scholarly communication landscape, with potential significant impacts on researchers, institutions, and society at large.

https://doi.org/10.5281/zenodo.10810233

"Platformisation of Science: Conceptual Foundations"

The digital platforms we are dealing with in this article are auxiliary tools that do not produce anything themselves but provide an infrastructure for service providers and users to meet. They have potentially unlimited scaling potential and have become the central places of exchange. In academia, we can also observe that research and its communication become more digital and that digital services are aiming to become platforms. In this article we explore the concept of digital platforms and their potential impact on academic research, firstly addressing the question: To what extent can digital platforms be understood as a specific type of research infrastructure? We draw from recent literature on platforms and platformisation from different streams of scholarship and relate them to the science studies concept of research infrastructures, to eventually arrive at a framework for science platforms. Secondly, we aim to assess how science platforms may affect scholarly practice. Thirdly, we aim to assess to what extent science is platformised and how this interferes with scientific understandings of quality and autonomy. At the end of this article, we argue that the potential benefits of platform infrastructure for academic pursuits cannot be ignored, but the commercialization of the infrastructure for scholarly communication is a cause for concern. Ultimately, a nuanced and well-informed perspective on the impact of platformisation on academia is necessary to ensure that the academic community can maximize the benefits of digital infrastructures while mitigating negative consequences.

http://dx.doi.org/10.53377/lq.16693

"Enhancing the FAIRness of Arctic Research Data Through Semantic Annotation"

The National Science Foundation’s Arctic Data Center is the primary data repository for NSF-funded research conducted in the Arctic. There are major challenges in discovering and interpreting resources in a repository containing data as heterogeneous and interdisciplinary as those in the Arctic Data Center. This paper reports on advances in cyberinfrastructure at the Arctic Data Center that help address these issues by leveraging semantic technologies that enhance the repository’s adherence to the FAIR data principles and improve the Findability, Accessibility, Interoperability, and Reusability of digital resources in the repository. We describe the Arctic Data Center’s improvements. We use semantic annotation to bind metadata about Arctic data sets with concepts in web-accessible ontologies. The Arctic Data Center’s implementation of a semantic annotation mechanism is accompanied by the development of an extended search interface that increases the findability of data by allowing users to search for specific, broader, and narrower meanings of measurement descriptions, as well as through their potential synonyms. Based on research carried out by the DataONE project, we evaluated the potential impact of this approach, regarding the accessibility, interoperability, and reusability of measurement data. Arctic research often benefits from having additional data, typically from multiple, heterogeneous sources, that complement and extend the bases – spatially, temporally, or thematically – for understanding Arctic phenomena. These relevant data resources must be ‘found’, and ‘harmonized’ prior to integration and analysis. The findings of a case study indicated that the semantic annotation of measurement data enhances the capabilities of researchers to accomplish these tasks.

https://doi.org/10.5334/dsj-2024-002

"HERITRACE: Tracing Evolution and Bridging Data for Streamlined Curatorial Work in the GLAM Domain"

HERITRACE is a semantic data management system tailored for the GLAM sector. It is engineered to streamline data curation for non-technical users while also offering an efficient administrative interface for technical staff. The paper compares HERITRACE with other established platforms such as OmekaS, Semantic MediaWiki, Research Space, and CLEF, emphasizing its advantages in user friendliness, provenance management, change tracking, customization capabilities, and data integration. The system leverages SHACL for data modeling and employs the OpenCitations Data Model (OCDM) for provenance and change tracking, ensuring a harmonious blend of advanced technical features and user accessibility. Future developments include the integration of a robust authentication system and the expansion of data compatibility via the RDF Mapping Language (RML), enhancing HERITRACE’s utility in digital heritage management.

https://arxiv.org/abs/2402.00477