Digital Archives and Special Collections

"Capturing Captions: Using AI to Identify and Analyse Image Captions in a Large Dataset of Historical Book Illustrations"

This article outlines how AI methods can be used to identify image captions in a large dataset of digitised historical book illustrations. This dataset includes over a million images from 68,000 books published between the eighteenth and early twentieth centuries, covering works of literature, history, geography, and philosophy. The article has two primary objectives. First, it suggests the added value of captions in making digitized illustrations more searchable by picture content in online archives. To further this objective, we describe the methods we have used to identify captions, which can effectively be re-purposed and applied in different contexts. Second, we suggest how this research leads to new understandings of the semantics and significance of the captions of historical book illustrations. The findings discussed here mark a critical intervention in the fields of digital humanities, book history, and illustration studies.

https://tinyurl.com/bdvjespp

"AI and Medical Images: Addressing Ethical Challenges to Provide Responsible Access to Historical Medical Illustrations"

This article examines the ethical considerations and broader issues around access to digitised historical medical images. These illustrations and, later, photographs are often extremely sensitive, representing disability, disease, gender, and race in potentially harmful and problematic ways. In particular, the original metadata for such images can include demeaning and sometimes racist terms. Some of these images show sexually explicit and violent content, as well as content that was obtained without informed consent. Hiding these sensitive images can be tempting, and yet, archives are meant to be used, not locked away. Through a series of interviews with 10 archivists, librarians, and researchers based in the UK and US, the authors show that improved access to medical illustrations is essential to produce new knowledge in the humanities and medical research, as well as to bridge the gap between historical and modern understandings of the human body. Improving access to medical illustration can also help to address the "gender data gap", which has acquired mainstream visibility thanks to the work of activists such as Caroline Criado-Perez, the author of Invisible Women: Data Bias in a World Designed for Men.

https://tinyurl.com/3jek7ey4

Paywall: "Social Media Archiving in Practice: A Troubled Landscape in Review"

This article looks at several notable examples of major social media data loss that have already taken place in recent history and examines one of the largest institutional attempts to preserve social media thus far in the Library of Congress’ Twitter Archive. By exploring some of the key challenges that arose during this attempt which ultimately grounded the project, this article aims to better understand what continues to keep the practice of social media archiving at bay, and what large scale changes might be necessary to make any further progress in the field.

https://doi.org/10.1080/0361526X.2024.2367405

"Improving Accessibility of Digitization Outputs: Eodopen Project Research Findings"

In this contribution, the authors present the results of a trial implementation among EODOPEN partners regarding their digitization workflows, used delivery file formats and the resulting quality of OCR results, depending on the type of digitization output file format. It was shown that partners using the OCR tool ABBYY FineReader Professional and producing scanning outputs in tagged PDF and PDF/UA formats achieved better results according to set criteria.

https://doi.org/10.1108/DLP-09-2023-0080

"The Puzzle of Large-Scale Digital Collections: Have We Reached an Inflection Point?"

Shared Collections allows institutions either to have JSTOR harvest their digital collections of documents, photos, and other special collections from a local Digital Asset Management System, or to create and share those same collections through JSTOR’s collection management tool. . . . While Shared Collections appears to represent a significant advance, the jury will be out for some time. The fundamental issues facing DPLA and Shared Collections are simply difficult, and the struggles with them have little or nothing to do with the skills or intentions of the capable people of both organizations. It is both a tough economic problem and an outcome of what we might call "rugged individualism in heritage collections": while shared descriptive efforts have been in place for books for more than a century, many standards for heritage collections have emerged since 2000. It’s a symptom of under-investment in cultural heritage in the United States.

https://rb.gy/597nkq

"Artificial Intelligence’s Role in Digitally Preserving Historic Archives"

The term "Artificial Intelligence" (AI) is increasingly permeating public consciousness as it has gained more popularity in recent years, especially within the landscape of academia and libraries. AI in libraries has been a trending subject of interest for some time, as within the library there are numerous departments that serve a role in collectively contributing to the library’s mission. Consequently, it is imperative to consider AI’s influence on the digital preservation of historic documents. This paper delves into the historical evolution of preservation methods driven by technological advancements as, throughout history, libraries, archives, and museums have grappled with the challenge of preserving historical collections, while many of the traditional preservation methods are costly and involve a lot of manual (human) effort. AI being the catalyst for transformation could change this reality and perhaps redefine the process of preservation; thus, this paper explores the emerging trend of incorporating AI technology into preservation practices and provides predictions regarding the transformative role of Artificial Intelligence in preservation for the future. With that in mind, this paper addresses the following questions: could AI be what changes or creates a paradigm shift in how preservation is done?; and could it be the thing that will change the way history is safeguarded?

https://doi.org/10.1515/pdtc-2023-0050

Paywall: "Changes in Digital Collections and Their Metadata: A Longitudinal Study of UIUC Digital Library"

This article showcases the evolution of digital collections and their metadata at the University of Illinois Urbana-Champaign (UIUC) Library in the last 20 years. It discusses the growth of its collections and their characteristics, examines historical changes in the use of metadata elements, and explores responses to the changing nature of digitized and born-digital materials. Based on a large-scale data analysis of the digital collections and their metadata housed in UIUC Digital Library, the paper also examines the challenges and opportunities of the curation and management of digital collections and digital libraries in the future.

https://doi.org/10.1080/19386389.2024.2338015

Paywall: "For the People: How We Make Online LAM [Library, Archive, and Museum] Collections More Democratized"

The article critiques the misconception that online collections democratize artifact information for public consumption and explores the ways in which LAM institutions fall short of living up to their democratic ideals when it comes to digital collections projects. Inspired by others with similar critiques, the authors discuss how LAM institutions can better fulfill the ideal of accessible and equitable access to their collections. The article emphasizes the importance of five areas of digital collections projects: system design, metadata practices, digitization selection and prioritization, labor, and user participation and engagement.

https://doi.org/10.1080/1941126X.2024.2306042

2024 Fedora Technology Assessment Report

The Fedora Program Team, in collaboration with the Technology Working Group, designed a project to understand the specific Fedora-related priorities of using institutions, along with the capacity and available resources of both individuals and institutions to contribute to the Fedora community between 2024 and 2026. They collaborated with the Research and Innovation Division at Lyrasis to survey Fedora users. Responses were collected between November 2023 and January 31, 2024, and analyzed by Leigh A. Grinstead, Senior Digital Services Consultant from Lyrasis, an independent, nonprofit, research group.

https://tinyurl.com/2s4b4rec

"Responsible AI at the Vanderbilt Television News Archive: A Case Study"

We provide an overview of the use of machine-learning and artificial intelligence at the Vanderbilt Television News Archive (VTNA). After surveying our major initiatives to date, which include the full transcription of the collection using a custom language model deployed on Amazon Web Services (AWS), we address some ethical considerations we encountered, including the possibility of staff downsizing and misidentification of individuals in news recordings.

https://doi.org/10.7191/jeslib.805

"Islandora for Archival Access and Discovery"

This article is a case study describing the implementation of Islandora 2 to create a public online portal for the discovery, access, and use of archives and special collections materials at the University of Nevada, Las Vegas. The authors will explain how the goal of providing users with a unified point of access across diverse data (including finding aids, digital objects, and agents) led to the selection of Islandora 2 and they will discuss the benefits and challenges of using this open source software. They will describe the various steps of implementation, including custom development, migration from CONTENTdm, integration with ArchivesSpace, and developing new skills and workflows to use Islandora most effectively. As hindsight always provides additional perspective, the case study will also offer reflection on lessons learned since the launch, insights on open-source repository sustainability, and priorities for future development.

https://journal.code4lib.org/articles/17929

Partial Paywall: The Nordic Model of Digital Archiving

Bringing together contributions from practitioners and academics to offer a range of international case studies, this book offers practical solutions for archivists in terms of governance, technologies and processes. It highlights and analyses the cornerstones of the Nordic model of archiving: reliance on standards; powerful regulatory instruments — especially in public sector archiving, including legislation; and collaboration between archivists and government agencies, and among different tiers of central and local government.

One of four open access chapters: "The Nordic Model of Digital Archiving."

https://doi.org/10.4324/9781003325406

"E-preservation of Old and Rare Books: A Structured Approach for Creating a Digital Collection "

Antique books, old and rare documents are fragile and vulnerable to different hazards. Preserving them for an extended period is a real challenge. From ancient times people started expressing their knowledge by writing and keeping records and subsequently started collecting and storing these at later ages as antique materials. These can be seen in different museums, libraries, archives, individual households, and other places all over the world. Preserving and conserving these antique, old and rare books, documents etc. in good condition is a challenge for librarians, conservators, preservation administrators or persons associated with storing these. In this paper, details of the digital preservation of such a collection available in the Directorate of Historical and Antiquarian Studies (DHAS), Guwahati, Assam, India, are discussed. DHAS is a Government of Assam wing and is mainly mandated to collect, preserve and research historical and antiquarian resources. The collection of DHAS is one of the oldest collections and has been serving as a study and research centre in Assam since 1928. A special drive has been taken for the digital preservation of an identified part of the collection, with grant support from the National Archive of India. This paper discusses the entire project process starting from the project proposal formulation to the structuring of the digital collection. The paper sequentially discusses the different steps of the entire work of digitization of a collection of 241 old and rare books from the main collection of DHAS.

http://www.ijdc.net/article/view/855

"Finding the Right Platform: A Crosswalk of Academy-Owned and Open-Source Digital Publishing Platforms"

A key responsibility for many library publishers is to collaborate with authors to determine the best mechanisms for sharing and publishing research. Librarians are often asked to assist with a wide range of research outputs and publication types, including eBooks, digital humanities (DH) projects, scholarly journals, archival and thematic collections, and community projects. These projects can exist on a variety of platforms both for profit and academy owned. Additionally, over the past decade, more and more academy owned platforms have been created to support both library publishing programs. Library publishers who wish to emphasize open access and open-source publishing can feel overwhelmed by the proliferation of available academy-owned or -affiliated publishing platforms. For many of these platforms, documentation exists but can be difficult to locate and interpret. While experienced users can usually find and evaluate the available resources for a particular platform, this kind of documentation is often less useful to authors and librarians who are just starting a new publishing project and want to determine if a given platform will work for them. Because of the challenges involved in identifying and evaluating the various platforms, we created this comparative crosswalk to help library publishers (and potentially authors) determine which platforms are right for their services and authors’ needs.

https://hcommons.org/deposits/item/hc:59231/

"Creating a Full Multitenant Back End User Experience in Omeka S with the Teams Module"

When Omeka S appeared as a beta release in 2016, it offered the opportunity for researchers or larger organizations to publish multiple Omeka sites from the same installation. Multisite functionality was and continues to be a major advance for what had become the premiere platform for scholarly digital exhibits produced by libraries, museums, researchers, and students. However, while geared to larger institutional contexts, Omeka S poses some user experience challenges on the back end for larger organizations with numerous users creating different sites. These challenges include a "cluttered" effect for many users seeing resources they do not need to access and data integrity challenges due to the possibility of users editing resources that other users need in their current state. The University of Illinois Library, drawing on two local use cases as well as two additional external use cases, developed the Teams module to address these challenges. This article describes the needs leading to the decision to create the module, the project requirement gathering process, and the implementation and ongoing development of Teams. The module and findings are likely to be of interest to other institutions adopting Omeka S but also, more generally, to libraries seeking to contribute successfully to larger open-source initiatives.

https://journal.code4lib.org/articles/17389

"Data Sharing Implementation in Top 10 Ophthalmology Journals in 2021"

Background/Aims: Deidentified individual participant data (IPD) sharing has been implemented in the International Committee of Medical Journal Editors journals since 2017. However, there were some published clinical trials that did not follow the new implemented policy. This study examines the number of clinical trials that endorsed IPD sharing policy among top ophthalmology journals.

Method: All published original articles in 2021 in 10 highest-ranking ophthalmology journals according to the 2020 journal impact factor were included. Clinical trials were determined by the WHO definition of clinical trials. Each article was then thoroughly searched for the IPD sharing statement either in the manuscript or in the clinical trial registry. We collected the number of published clinical trials that implemented IPD sharing policy as our primary outcome.

Results: 1852 published articles in top 10 ophthalmology journals were identified, and 9.45% were clinical trials. Of these clinical trials, 44% had clinical trial registrations and 49.14% declared IPD sharing statements. Only 42 (48.83%) clinical trials were willing to share IPD, and 5 (10.21%) of these share IPD via an online repository platform. In terms of sharing period, 37 clinical trials were willing to share right after the publication and only 2 showed the ending of sharing period.

Conclusion: This report shows that the number of clinical trials in top ophthalmology journals that endorsed the IPD sharing policy and the number of registrations is lower than half even though the policy has been implemented for several years. Future updates are necessary as policy evolves.

http://dx.doi.org/10.1136/bmjophth-2023-001276

Paywall: "Human-AI Interaction for Exploratory Search & Recommender Systems with Application to Cultural Heritage "

This dissertation introduces three primary contributions through publicly deployed sys- tems and datasets. First, we demonstrate how the construction of large-scale cultural heritage datasets using machine learning can answer interdisciplinary questions in library & information science and the humanities (Chapter 2). Second, based on the feedback of users of these cultural heritage datasets, we introduce open faceted search, an extension of faceted search that leverages human-AI interaction affordances to empower users to define their own facets in an open domain fashion (Chapter 3). Third, encountering similar challenges with the deluge of scientific papers, we explore the question of how to improve recommender systems through human-AI interaction and tackle the broad challenge of advice taking for opaque machine learners (Chapter 4).

https://tinyurl.com/yc59txc5

"The Cloud, the Public Square, and Digital Public Archival Infrastructure"

Many governments have chosen to store their records in the cloud rather than invest in the increased digital infrastructure now required to manage them.. . . Yet, archivists and archival perspectives have not been much involved in public discussion of this change. . . . The shape of the emerging infrastructure underpinning the management of digital communication may well be the most significant lasting feature of the digital environment for societies and their archives. This article discusses why that development requires archival voices in the public square to address it.

https://doi.org/10.1007/s10502-023-09417-7

WorldFAIR Project (D13.2) Cultural Heritage Image Sharing Recommendations Report

Deliverable 13.2 aims to build on our understanding of what it means to support FAIR in the sharing of image data derived from GLAM collections. This report looks at previous efforts by the sector towards FAIR alignment and presents 5 recommendations designed to be implemented and tested at the DRI that are also broadly applicable to the work of the GLAMs. The recommendations are ultimately a roadmap for the Digital Repository of Ireland (DRI) to follow in improving repository services, as well as a call for continued dialogue around "what is FAIR?" within the cultural heritage research data landscape.

https://doi.org/10.5281/zenodo.7897243

"The Viability of Using an Open Source Locally Hosted AI for Creating Metadata in Digital Image Collections"

Artificial intelligence (AI) can support metadata creation for images by generating descriptions, titles, and keywords for digital collections in libraries. Many AI options are available, ranging from cloud-based corporate software solutions, including Microsoft Azure Custom Vision and Google Cloud Vision, to open-source locally hosted software packages. This case study examines the feasibility of deploying the open-source, locally hosted AI software, Sheeko, and the accuracy of the descriptions generated for images using two of the pre-trained models. The study aims to ascertain if Sheeko’s AI would be a viable solution for producing metadata in the form of descriptions, or titles for digital collections in Libraries and Cultural Resources at the University of Calgary.

https://journal.code4lib.org/articles/17186

"Building a Large-Scale Digital Library Search Interface Using the Libraries Online Catalog"

The Kentucky Digital Newspaper Program (KDNP) was born out of the University of Kentucky Libraries’ (UKL) work in the National Digital Newspaper Program (NDNP) that began in 2005. In early 2021, a team of specialists at UKL from library systems, digital archives, and metadata management was formed to explore a new approach to searching this content by leveraging the power of the library services platform (Alma) and discovery system (Primo VE) licensed from Ex Libris. The result was the creation of a dedicated Primo VE search interface that would include KDNP content as well as all Kentucky newspapers held on microfilm in the UKL system. This article will describe the journey from the question of whether we could harness the power of Alma and Primo VE to display KDNP content, to the methodology used in creating a new dedicated search interface that can be replicated to create custom search interfaces of your own.

https://journal.code4lib.org/articles/17257

"The Smithsonian Puts 4.5 Million High-Res Images Online and Into the Public Domain, Making Them Free to Use"

"Anyone can download, reuse, and remix these images at any time — for free under the Creative Commons Zero (CC0) license," write My Modern Met’s Jessica Stewart and Madeleine Muzdakis. "A dive into the 3D records shows everything from CAD models of the Apollo 11 command module to Horatio Greenough’s 1840 sculpture of George Washington."

http://bit.ly/3KBhZsV

"University of Oregon and Oregon State University Collaborate to Launch Oregon Digital"

The University of Oregon and Oregon State University are proud to announce the launch of Oregon Digital, a cultural heritage repository that brings together more than 500,000 digitized works from both universities, including unique digitized and born-digital collections. This collaborative effort includes historic and modern photographs, manuscripts, publications, and more.

https://library.uoregon.edu/node/7904

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |

"Designing Digital Discovery and Access Systems for Archival Description"

Archival description is often misunderstood by librarians, administrators, and technologists in ways that have seriously hindered the development of access and discovery systems. It is not widely understood that there is currently no off-the-shelf system that provides discovery and access to digital materials using archival methods. This article is an overview of the core differences between archival and bibliographic description, and discusses how to design access systems for born-digital and digitized materials using the affordances of archival metadata. It offers a custom indexer as a working example that adds the full text of digital content to an Arclight instance and argues that the extensibility of archival description makes it a perfect match for automated description. Finally, it argues that building archives-first discovery systems allows us to use our descriptive labor more thoughtfully, better enable digitization on demand, and overall make a larger volume of cultural heritage materials available online.

bit.ly/3DhKmcC

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |

A Preservationist’s Guide to the DMCA Exemption for Software Preservation, 2nd Edition

In late 2021, the Library of Congress adopted several exemptions to the Digital Millennium Copyright Act (DMCA) provision prohibiting circumvention of technological measures that control access to copyrighted works. In other words, they created a set of exceptions to the general legal rule against cracking digital locks on things like DVDs, software, and video games. The exemptions are set out in regulations published by the Copyright Office. They went into effect on October 28, 2021 and last until October 28th, 2024. This guide is intended to help preservationists determine whether their activities are protected by the new exemptions. It includes important updates to the first edition to reflect changes in the rule to allow offsite access to non-game software, along with a few other technical changes.

https://doi.org/10.5281/zenodo.7328908

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |