Digital Repositories – DigitalKoans

“Making Your Repository (More) Accessible”

Introduction: As colleges and universities make increasing and overdue efforts under the auspices of access, equity, and inclusion to make their resources accessible to all users, these efforts must extend to the institution’s online presence, including its institutional repository. IR managers must first ask what “accessible” means for compliance with university policies as well as the Americans with Disability Act (ADA), immediately followed by plans for both remediating existing content and imposing best practices on new content, amid current workflows and budgetary restraints.

Literature Review: Literature on the topic of accessibility in IRs has mostly focused on the need to make collections accessible and the challenges for doing so. Advice on how to navigate the actual process is harder to come by.

Description of Service: The University of Mississippi established a goal that everything going into its IR would use OCR software to convert images of text into searchable text and create a process by which patrons could request remediation of older content from the IR, whether documents or recordings. A combination of shared tools (including Equidox and SensusAccess) and interdepartmental partnerships has made a significant difference in making these digital collections proactively accessible.

Next Steps: We continue to maintain partnerships with units around campus, made challenging by frequent turnover as in demand specialists take positions at other institutions. Despite our efforts to provide searchable text as a minimum level of service, OCR correction provides tags but not necessarily headings or alt-text. Hopefully future versions of OCR editors will include such features.

https://doi.org/10.31274/jlsc.18308

“openRxiv Launch to Sustain and Expand Preprint Sharing in Life and Health Sciences”

Since their launches in 2013 and 2019, respectively, preprint servers bioRxiv and medRxiv have transformed how scientific findings are communicated. They have hosted more than 325,000 reports of new discoveries, enabling scientists worldwide to collaborate, iterate, and build upon each other’s work at an unprecedented pace. . . .

Establishing openRxiv aims to accelerate the value of these preprint servers, making it easier for these resources to grow and adapt. Created as services of Cold Spring Harbor Laboratory in partnership with other institutions, bioRxiv and medRxiv now move under openRxiv’s researcher-driven governance, ensuring that preprint sharing remains independent, sustainable, and responsive to researchers’ evolving needs.

https://tinyurl.com/2auerw5t

“DeepGreen—A Data Hub for the Distribution of Scholarly Articles From Publishers to Open Access Repositories in Germany”

DeepGreen is an automated delivery service for open access articles. Originally conceived to take advantage of the so-called open access component—a secondary publication right in Alliance and National licences in Germany to promote green open access—it aims to streamline open access processes by automating the distribution of full-text articles and metadata from publishers to repositories.

The service, developed by a consortium and funded by the German Research Foundation (DFG) in its initial phase, has successfully established itself as a national service, facilitating open access content distribution and contributing to Germany’s open access infrastructure.

As of December 2024, DeepGreen distributes articles from 14 publishers to 84 institutional repositories and 6 subject-specific repositories.

This article describes the role of the DeepGreen service in Germany, its collaboration with publishers and the potential of automated processes for storing articles in open access repositories, which, as publicly owned institutional infrastructures, ensure sustainable access and provide secure, redundant storage.

https://doi.org/10.1002/leap.70000

“Towards the Interoperability of Scholarly Repository Registries”

The enactment of Open Science relies on scholarly repositories that make research products findable and accessible, while scholarly repository registries maintain authoritative metadata and persistent identifiers (PIDs) to help researchers and infrastructure providers discover and access needed repositories. However, the proliferation of repositories targeting different research products (e.g., publications, data, and software) or serving specific disciplines has led to the creation of multiple registries whose scope is not mutually exclusive. . . . While favouring the existence of a plurality of registries, this paper advocates for their interoperability, which is essential to eliminate the aforementioned barriers and enable their full, unambiguous utilisation. We analyse the data models of four prominent registries—FAIRsharing, re3data, OpenDOAR, and ROAR—and classify their properties and overlap. We provide a crosswalk between their data models and suggest a common data model shared across the examined registries to pave the way toward interoperability. As a means of validation, we include a coverage evaluation of the proposed data model.The paper adopts a pragmatic approach towards scholarly registry interoperability and suggests a common metadata model to foster the exchange of information across these platforms.

https://doi.org/10.1007/s00799-025-00414-y

“Open But Hidden: Open Access Gaps in the National Science Foundation Public Access Repository”

Introduction: In 2022, the U.S. government released new guidelines for making publicly funded research open and available. For the National Science Foundation (NSF), these policies reinforce requirements in place since 2016 for supported research to be submitted to the Public Access Repository (PAR).

Methods: To evaluate the public access compliance of research articles submitted to the NSF-PAR, this study searched for NSF-PAR records published between 2017 and 2021 from two research intensive institutions. Records were reviewed to determine whether the PAR held a deposited copy, as required by the 2016 policies, or provided a link out to publisher-held version(s).

Results: A total of 841 unique records were identified, all with publicly accessible versions. Yet only 42% had a deposited PDF version available in the repository as required by the NSF 2016 Public Access Policy. The remaining 58% directed instead to publisher-held versions. In total, only 55% of record links labeled “Full Text Available” directed users to a publicly accessible version with a single click.

Discussion: Records within PAR do not clearly direct users to the publicly accessible full text. In almost half of records, the most prominently displayed link directed users to a paywall version, even when a publicly available version existed. Records accessible only through the CHORUS (Clearing House for the Open Research of the United States) initiative were further obscured by requiring specialized navigation of publisher-owned sites.

Conclusion: Despite having a repository mandate since 2016, NSF compliance rates remain low. Additional support and/or oversight is needed to address the additional requirements introduced under the 2022 memo.

https://doi.org/10.31274/jlsc.17767

“Moving Open Repositories out of the Blind Spot of Initiatives to Correct the Scholarly Record”

Open repositories were created to enhance access and visibility of scholarly publications, driven by open science ideals emphasising transparency and accessibility. However, they lack mechanisms to update the status of corrected or retracted publications, posing a threat to the integrity of the scholarly record. To explore the scope of the problem, a manually verified corpus was examined: we extracted all the entries in the Crossref × Retraction Watch database for which the publication date of the corrected or retracted document ranged from 2013 to 2023. This corresponded to 24,430 entries with a DOI, which we use to query Unpaywall and identify their possible indexing in HAL, an open repository (second largest institutional repository worldwide). In most cases (91%), HAL does not mention corrections. While the study needs broader scope, it highlights the necessity of improving the role of open repositories in correction processes with better curation practices. We discuss how harvesting operations and the interoperability of platforms can maintain the integrity of the entire scholarly record. Not only will the open repositories avoid damaging its reliability through ambiguous reporting, but on the contrary, they will also strengthen it.

https://doi.org/10.1002/leap.1655

"Enabling Preprint Discovery, Evaluation, and Analysis with Europe PMC"

Preprints provide an indispensable tool for rapid and open communication of early research findings. Preprints can also be revised and improved based on scientific commentary uncoupled from journal-organised peer review. The uptake of preprints in the life sciences has increased significantly in recent years, especially during the COVID-19 pandemic, when immediate access to research findings became crucial to address the global health emergency. With ongoing expansion of new preprint servers, improving discoverability of preprints is a necessary step to facilitate wider sharing of the science reported in preprints. To address the challenges of preprint visibility and reuse, Europe PMC, an open database of life science literature, began indexing preprint abstracts and metadata from several platforms in July 2018. Since then, Europe PMC has continued to increase coverage through addition of new servers, and expanded its preprint initiative to include the full text of preprints related to COVID-19 in July 2020 and then the full text of preprints supported by the Europe PMC funder consortium in April 2022. The preprint collection can be searched via the website and programmatically, with abstracts and the open access full text of COVID-19 and Europe PMC funder preprint subsets available for bulk download in a standard machine-readable JATS XML format. This enables automated information extraction for large-scale analyses of the preprint corpus, accelerating scientific research of the preprint literature itself. This publication describes steps taken to build trust, improve discoverability, and support reuse of life science preprints in Europe PMC. Here we discuss the benefits of indexing preprints alongside peer-reviewed publications, and challenges associated with this process.

https://doi.org/10.1371/journal.pone.0303005

"Knowledge Infrastructures are Growing Up: The Case for Institutional (Data) Repositories 10 Years After the Holdren Memo"

Institutional data repositories are uniquely positioned to support researchers in sharing scholarly outputs. As funding agencies develop and institute policies for research data access and sharing, institutional data repositories have emerged as a critical feature in ecosystems for data stewardship and sharing. We show that institutional data repositories can meet and exceed the requirements and recommendations of federal data policy, thereby maximizing the benefits of data sharing. We present results of a mixed-method study which explores the adoption and usage of institutional repositories to share data from 2017 to 2023. Data from two previous studies were combined with data collected in 2023 on the data sharing solutions of Association of Research Libraries member institutions in the United States and Canada. The analysis of the aggregated data indicates that data stewardship has increased in both institutional repositories and institutional data repositories with an increase in complementary infrastructure to support data sharing. We then conduct an “infrastructural inversion” (Bowker & Star, 1999) to ‘surface invisible work’ of making data repositories function well, and demonstrate that institutional data repositories have advantages for providing sustainable stewardship, curation, and sharing of research data. Finally, we show that institutional data repositories may produce additional benefits through established infrastructure, local interoperability, and control.

https://doi.org/10.5334/dsj-2024-046

"’Does It Feel like a Scientific Paper?’: A Qualitative Analysis of Preprint Servers’ Moderation and Quality Assurance Processes"

In recent years, preprints—i.e., scholarly manuscripts that have not been peer reviewed or published in a journal—have emerged as a major source of research communication and a critical component of open science. However, concerns have been raised about preprints’ potential to facilitate the spread of flawed or misleading research due to the lack of quality control performed by preprint servers. Yet, there is limited knowledge of how servers currently vet incoming content and how this impacts the openness and diversity of scholarly content. In this paper, we examine preprint servers’ moderation processes, the intentions underpinning them, and their potential effects through a qualitative analysis of in-depth interviews with 14 key preprint server personnel. We find a wide range of moderation processes, which vary depending on specific server contexts and needs and are motivated by a desire to prevent the spread of misinformation and protect trust in preprints and servers. Participants repeatedly emphasized the difference between their moderation processes and peer review, but in practice often applied similar criteria for delineating scientific from unscientific content. Moreover, moderation processes often relied on trust cues, such as article formats or author affiliations, as proxies for research quality, potentially introducing similar biases as have been found in traditional journal peer review. We discuss implications for the diversity of preprint content and authors, as well as the future of preprint servers within an evolving scholarly communication ecosystem.

https://doi.org/10.31222/osf.io/mp6ky

"Interview: Deciphering the Law: Hachette v. Internet Archive Pt. 1 (2023) with Dave Hansen"

This is the first in a series of interviews with those closely tied to the Hachette v. Internet Archive lawsuit. In March 2023, the court ruled against the Internet Archive and its use of the Emergency Lending Library causing a ripple throughout the library and education fields. Below, find the answers to some of the questions that the case elicited by JCEL contributors and copyright scholars Dave Hansen, Michelle Wu, and Kyle Courtney.

https://doi.org/10.17161/jcel.v7i2.21337

"Closing Gaps: A Model of Cumulative Curation and Preservation Levels for Trustworthy Digital Repositories "

Curation and preservation measures carried out by digital repository staff are an important building block in maintaining the accessibility and usability of digital resources over time. The measures adequate to achieve long-term usability for a given audience strongly depend on scenarios of (re)use, the (intended) users’ needs and skills, the organisational setting (e.g., mission, resources, policies), as well as the characteristics of the digital objects to be preserved. The assessment of curation and preservation measures also forms an important part of existing certification procedures for trustworthy digital repositories (TDRs) as offered, for example, by the CoreTrustSeal foundation, the nestor network, or ISO.

The digital curation community is presented with the challenge of finding community-, organisation-, and object-specific approaches to curation and preservation at the same time as defining the minimum level of curation and preservation measures expected from a TDR in sufficiently generic terms to ensure applicability to a wide array of repositories. Against this backdrop, this paper discusses the need for and benefits of community-agreed levels of curation and preservation to address this challenge, and considers the tiered model proposed by the CoreTrustSeal Board as an example.

The proposed model is then applied in an analysis of successful CoreTrustSeal applications from 2018–2022 in an effort to better understand the capacity of the curation and preservation levels to capture the respective practices of repositories and to identify potential gaps.

https://doi.org/10.2218/ijdc.v18i1.926

"Transparent Disclosure, Curation & Preservation of Dynamic Digital Resources "

This paper explores an enhanced curation lifecycle being developed at the UK Data Service (UKDS), with our Data Product Builder. Through a Graphical User Interface, we aim to provide the researcher with a tailored digital resource. We detail the threefold motivation behind this initiative: data dissemination scalability, researcher satisfaction and the reduction of nationwide duplication of research effort.

Subsequent sections detail the technical components and challenges involved. In addition to more standard data subsetting, filtering and linking components, this data dissemination platform offers dynamic disclosure assessments – identifying combinations of variables that present a potential disclosure risk. All components are underpinned by the Data Documentation Initiative’s new Cross-Domain Integration standard (DDI-CDI), designed to handle the many structures in which data may be organised.

Ever conscious of the scale of the task we are embarking on, we remain motivated by the need for such advances in data dissemination and optimistic of the feasibility of such a system to meet the needs of the researcher while balancing the data disclosivity concerns of the data depositor.

https://doi.org/10.2218/ijdc.v18i1.937

"Toward Enhanced Reusability: A Comparative Analysis of Metadata for Machine Learning Objects and Their Characteristics in Generalist and Specialist Repositories"

Objective: The rapidly increasing prevalence and application of machine learning (ML) across disciplines creates a pressing need to establish guidance for data curation professionals. However, we must first understand the characteristics of ML-related objects shared in generalist and specialist repositories and the extent to which repository metadata fields enable findability and reuse of ML objects.

Methods: We used a combination of API queries and web scraping to retrieve metadata for ML objects in eight commonly used generalist and ML-specific data repositories. We assessed both metadata schema and characteristics of deposited ML objects, within the context of the widely adopted FAIR Principles. We also calculated summary statistics for properties of objects, including number of objects per year, dataset size, domains represented, and availability of related resources.

Results: Generalist repositories excelled at providing provenance metadata, specifically unique identifiers, unambiguous citations, clear licenses, and related resources, while specialist repositories emphasized ML-specific descriptive metadata, such as number of attributes and instances and task type. In terms of object content, we noted a wide range of file formats, as well as licenses, all of which impact reusability.

Conclusions: Generalist repositories will benefit from some of the practices adopted by specialists, and specialist repositories will benefit from adopting proven data curation practices of generalist repositories. A step forward for repositories will be to invest more into use of labels and persistent identifiers to improve workflow documentation, provenance, and related resource linking of ML objects, which will increase their findability, interoperability, and reusability.

https://doi.org/10.7191/jeslib.685

Paywall: "Constructing Risk in Trustworthy Digital Repositories"

This article investigates the construction of risk within trustworthy digital repository audits. It contends that risk is a social construct, and social factors influence how stakeholders in digital preservation processes comprehend and react to risk.

https://doi.org/10.1108/JD-08-2023-0157

Paywall: "Data Quality Assurance Practices in Research Data Repositories — A Systematic Literature Review"

This study conducted a systematic analysis of data quality assurance (DQA) practices in RDRs, guided by activity theory and data quality literature, resulting in conceptualizing a data quality assurance model (DQAM) for RDRs. DQAM outlines a DQA process comprising evaluation, intervention, and communication activities and categorizes 17 quality dimensions into intrinsic and product-level data quality. It also details specific improvement actions for data products and identifies the essential roles, skills, standards, and tools for DQA in RDRs. By comparing DQAM with existing DQA models, the study highlights its potential to improve these models by adding a specific DQA activity structure.

https://doi.org/10.1002/asi.24948

"Internet Archive Forced to Remove 500,000 Books after Publishers’ Court Win"

As a result of book publishers successfully suing the Internet Archive (IA) last year, the free online library that strives to keep growing online access to books recently shrank by about 500,000 titles. . . .

To restore access, IA is now appealing, hoping to reverse the prior court’s decision by convincing the US Court of Appeals in the Second Circuit that IA’s controlled digital lending of its physical books should be considered fair use under copyright law. An April court filing shows that IA intends to argue that the publishers have no evidence that the e-book market has been harmed by the open library’s lending, and copyright law is better served by allowing IA’s lending than by preventing it. . . ./p>

Freeland [Chris Freeland, IA’s director of library service] told Ars it could take months or even more than a year before a decision is reached in the case.

While IA fights to end the injunction, its other library services continue growing, IA has said. IA "may still digitize books for preservation purposes" and "provide access to our digital collections" through interlibrary loan and other means. IA can also continue lending out-of-print and public domain books.

https://tinyurl.com/47aws7z7

"Analyzing Research Data Repositories (RDR) from BRICS Nations: A Comprehensive Study"

As of March 2, 2024, re3data.org indexes a total of 3,192 Research Data Repositories (RDRs) worldwide, with BRICS nations contributing 195. China leads among BRICS nations, followed by India, Russia, and Brazil. . . . "House, tailor-made " software is widely used for creating RDRs, followed by Dataverse and DSpace. . . . Most repositories are disciplinary, followed by institutional ones. Most repositories specify data upload types, with "restricted " being the most common, followed by closed types. Open access is predominant in data access, followed by restricted access and embargo periods, while a small number restrict access entirely.

https://doi.org/10.1108/LM-04-2024-0040

"Biomedical Data Repository Concepts and Management Principles"

The demand for open data and open science is on the rise, fueled by expectations from the scientific community, calls to increase transparency and reproducibility in research findings, and developments such as the Final Data Management and Sharing Policy from the U.S. National Institutes of Health and a memorandum on increasing public access to federally funded research, issued by the U.S. Office of Science and Technology Policy. This paper explores the pivotal role of data repositories in biomedical research and open science, emphasizing their importance in managing, preserving, and sharing research data. Our objective is to familiarize readers with the functions of data repositories, set expectations for their services, and provide an overview of methods to evaluate their capabilities. The paper serves to introduce fundamental concepts and community-based guiding principles and aims to equip researchers, repository operators, funders, and policymakers with the knowledge to select appropriate repositories for their data management and sharing needs and foster a foundation for the open sharing and preservation of research data.

https://doi.org/10.1038/s41597-024-03449-z

"Understanding the Value of Curation: A Survey of Us Data Repository Curation Practices and Perceptions"

Data curators play an important role in assessing data quality and take actions that may ultimately lead to better, more valuable data products. This study explores the curation practices of data curators working within US-based data repositories. We performed a survey in January 2021 to benchmark the levels of curation performed by repositories and assess the perceived value and impact of curation on the data sharing process. Our analysis included 95 responses from 59 unique data repositories. Respondents primarily were professionals working within repositories and examined curation performed within a repository setting. A majority 72.6% of respondents reported that "data-level" curation was performed by their repository and around half reported their repository took steps to ensure interoperability and reproducibility of their repository’s datasets. Curation actions most frequently reported include checking for duplicate files, reviewing documentation, reviewing metadata, minting persistent identifiers, and checking for corrupt/broken files. The most "value-add" curation action across generalist, institutional, and disciplinary repository respondents was related to reviewing and enhancing documentation. Respondents reported high perceived impact of curation by their repositories on specific data sharing outcomes including usability, findability, understandability, and accessibility of deposited datasets; respondents associated with disciplinary repositories tended to perceive higher impact on most outcomes. Most survey participants strongly agreed that data curation by the repository adds value to the data sharing process and that it outweighs the effort and cost. We found some differences between institutional and disciplinary repositories, both in the reported frequency of specific curation actions as well as the perceived impact of data curation. Interestingly, we also found variation in the perceptions of those working within the same repository regarding the level and frequency of curation actions performed, which exemplifies the complexity of a repository curation work. Our results suggest data curation may be better understood in terms of specific curation actions and outcomes than broadly defined curation levels and that more research is needed to understand the resource implications of performing these activities. We share these results to provide a more nuanced view of curation, and how curation impacts the broader data lifecycle and data sharing behaviors.

https://doi.org/10.1371/journal.pone.0301171

"The Puzzle of Large-Scale Digital Collections: Have We Reached an Inflection Point?"

Shared Collections allows institutions either to have JSTOR harvest their digital collections of documents, photos, and other special collections from a local Digital Asset Management System, or to create and share those same collections through JSTOR’s collection management tool. . . . While Shared Collections appears to represent a significant advance, the jury will be out for some time. The fundamental issues facing DPLA and Shared Collections are simply difficult, and the struggles with them have little or nothing to do with the skills or intentions of the capable people of both organizations. It is both a tough economic problem and an outcome of what we might call "rugged individualism in heritage collections": while shared descriptive efforts have been in place for books for more than a century, many standards for heritage collections have emerged since 2000. It’s a symptom of under-investment in cultural heritage in the United States.

https://rb.gy/597nkq

"A Census of Institutional Repositories at Regional Public Universities"

This study reports on the implementation of institutional repositories (IRs) at regional public universities (RPUs) in the United States and its territories. The author investigated repository platform choice, operation style, and content. More than half of RPUs have implemented an IR. The author discusses how these findings align with trends in previous research and explores the unique aspects of IRs at RPUs—particularly the prevalence of student works and special collections materials. For over two decades, institutional repositories (IRs) have been used at institutions of higher education to collect, preserve, and share the scholarly works of an institution. During that same time there have been an increasing number of studies looking at who has implemented an IR, the most popular IR platforms, and type and number of objects deposited in IRs. While some studies have looked at small or teaching-focused institutions, most of these studies have focused on IR implementations at large research-focused institutions.

https://tinyurl.com/yc2fs4r2

"Developing Open Access Resource Management Principles in a Consortial Environment: A University of California Model"

In the summer of 2021, the University of California (UC) migrated to a new integrated library system, called the Systemwide Integrated Library System project (SILS), which for the first time brought all ten UC campuses, two regional storage facilities, and the California Digital Library (CDL) together into one shared library system. With new potential for increased collaboration and cooperation, SILS leadership groups identified consortial open access (OA) resource management as a key opportunity in the new system, in alignment with UC’s priorities around discovery and access to library collections, as well as UC’s commitment to open access and transforming the scholarly communication landscape. This article discusses the formation of the UC Open Access Resource Management Task Force (OARMTF), a group charged to investigate what it would mean to consortially manage OA resources. Specifically, this article focuses on the OARMTF’s work setting out principles for OA resource management, which the authors hope may serve as a useful case study for other institutions or consortia interested in developing principles around OA resource management, as well as encourage more discussion and research into best practices for consortial management of OA resources.

https://doi.org/10.5860/lrts.68n1.8216

"Opening Up: A Global Context for Local Open Access Initiatives in Higher Education"

Open access policies and mandates can be a useful tool in persuading faculty at higher education institutions around the globe to produce and share open scholarship. But are such policies widely written, accepted, and adopted? Leveraging information found on the Registry of Open Access Repositories Mandatory Archiving Policies, this paper analyzes open access policies at higher education institutions worldwide. The data indicate that Europe holds the most policies, while fewer policies have been enacted in the Americas, Africa, Oceania, and Asia due to a myriad of barriers. Overall, better strategies to promote open access are needed, and such strategies may not necessarily take the form of an open access policy. My own investigation of global open access policies has informed my practices with respect to open access. In this paper, I demonstrate how librarians acting as policy entrepreneurs can assist with the promotion of open access at their institutions and then conclude with suggestions, solutions, and pathways beyond policy adoption to promote and advocate for open access.

https://tinyurl.com/2h3uz5n4

"Preprints, Journals and Openness: Disentangling Goals and Incentives "

I would argue that private funders such as the Gates Foundation or the Howard Hughes Medical Institute (HHMI) could provide material support through grants and policies for quality peer review, baking peer review into selection of grantees. Such an approach will require careful structures and mechanisms for reviewer selection, and measures of success, or we may run the risk of creating further inequities. Mind you, in many fields it is just hard to find good reviewers prepared to put in the effort required for a considered, thoughtful review. Societies, such as my own, could also consider material ways to support peer review more actively — a philosophical and practical approach to raising the profile of peer review at an early stage in the life of a researcher.

https://tinyurl.com/ymckyb9x

1 Million Images and Counting: "AI-Startup Launches Ever-Expanding Library of Free Stock Photos and Music"

StockCake is a new platform by AI startup Imaginary Machines. The site currently hosts more than a million pre-generated images. These images can be downloaded, used, and shared for free. There are no strings attached as all photos are in the public domain.

https://tinyurl.com/mvjd3683

StockCake