“Sustainable Speed: Project Management and Productive Capacity in Projects Using AI”


[The article] examines the use of a handwritten transcription tool to generate full-text transcripts for a wide-ranging digitization project focused on slavery and the lives of enslaved people in the Colonial and Antebellum periods of the United States. It then reviews the challenges, rewards, and implications of incorporating tools like this from a project management perspective.

https://doi.org/10.1080/1941126X.2025.2580900

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“Internet Archive’s Legal Fights Are Over, but Its Founder Mourns What Was Lost”


Kahle suggested that IA’s legal battles weren’t with creators or publishers so much as with large media companies that he thinks aren’t “satisfied with the restriction you get from copyright.”

“They want that and more,” Kahle said, pointing to e-book licenses that expire as proof that libraries increasingly aren’t allowed to own their collections. He also suspects that such companies wanted the Wayback Machine dead—but the Wayback Machine has survived and proved itself to be a unique and useful resource.

https://tinyurl.com/3dyeyu5y

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“Extracting A Large Corpus from the Internet Archive, A Case Study”


The Internet Archive was founded on May 10, 1996, in San Francisco, CA. Since its inception, the archive has amassed an enormous corpus of content, including over 866 billion web pages, more than 42.5 million print materials, 13 million videos, and 14 million audio files. It is relatively easy to upload content to the Internet Archive. It is also easy to download individual objects by visiting their pages and clicking on specific links. However, downloading a large collection, such as thousands or even tens of thousands of items, is not as easy. This article outlines how The University of Kentucky Libraries downloaded over 86 thousand previously uploaded newspaper issues from the Internet Archive for local use. The process leveraged ChatGPT to automate the process of generating Python scripts that accessed the Internet Archive via its API (Application Programming Interface).

https://tinyurl.com/42nf4u7x

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“Mitigating Aggressive Crawler Traffic in the Age of Generative AI: A Collaborative Approach from the University of North Carolina at Chapel Hill Libraries”


The rise of aggressive, adaptive, and evasive web crawlers is a significant challenge for libraries and archives, causing service disruptions and overwhelming institutional resources. This article details the experiences of the University of North Carolina at Chapel Hill University Libraries in combating an unprecedented flood of crawler traffic. It describes the escalating mitigation efforts, from traditional client blocking to the implementation of more advanced techniques such as request throttling, regional traffic prioritization, novel facet-based bot detection, commercial Web Application Firewalls (WAFs), and ultimately, in-browser client verification with Cloudflare Turnstile. The article highlights the adaptive nature of these crawlers, the limitations of isolated institutional responses, and the critical lessons learned from mitigation efforts, including the issues introduced by residential proxy networks and the extreme scale of the traffic. Our experiences demonstrate the effectiveness of a multi-layered defense strategy that includes both commercial and library-specific solutions, such as facet-based bot detection. The article emphasizes the importance of community-wide collaboration, proposing future directions such as formalized knowledge sharing and the ongoing development of best practices to collectively address this evolving threat to open access and the stability of digital library services.

https://shorturl.at/7AP3I

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Paywall: “Tracing the Past, Predicting the Future: A Systematic Review of AI in Archival Science”


Our study highlights how integrating artificial intelligence (AI) into archival science can help address these issues. We begin with a thorough analysis of 45 papers published between 2011 and 2023 that met our predetermined criteria. . . . We investigated the key AI techniques and their applications in archives and records management functions. Our findings highlight key AI-driven strategies that promise to streamline recordkeeping processes and improve data retrieval in the immediate future.

https://doi.org/10.1002/pra2.1286

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Paywall: “Tracing the Research Trends in the Application of Digital Technology in Cultural Heritage: A Bibliometric Analysis Integrating Large Language Models”


This study identifies research trends, contributors, and collaborations in digital technologies applied to cultural heritage through a bibliometric and topic modelling analysis of 1,153 articles from the Web of Science database. The analysis revealed that these documents commenced in the year 2000, which can be attributed to the influence of the advent of digital technologies. . . . The analysis reveals that research predominantly originates from developed regions, identifies 15 key topics, and highlights five emerging areas with significant growth from 2000 to 2024. The findings provide actionable insights for advancing the application of digital technologies in cultural heritage research and identifying future priorities.

https://doi.org/10.1177/01655515251373084

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“Finding Aids Unleashed: Iterative Development of a Portable Publication System”


New York University Libraries recently completed a redesign for their finding aids publishing service to replace an outdated XSLT stylesheet publishing method. The primary design goals focused on accessibility and usability for patrons, including improving the presentation of digital archival objects. In this article, we focus on the iterative process devised by a team of designers, developers, and archivists. We discuss our process for creating a data model to map Encoded Archival Description files exported from ArchivesSpace into JSON structured data for use with Hugo, an open-source static site generator. We present our overall systems design for the suite of microservices used to automate and scale this process. The new solution is available for other institutions to leverage for their finding aids.

https://tinyurl.com/39r44vfv

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“Recommender Systems for Digital Humanities and Archives: Multistakeholder Evaluation, Scholarly Information Needs, and Multimodal Similarity”


Current digital archive interfaces often rely on search and browsing functionalities insufficient for complex research tasks. They fail to reveal latent connections between historical sources or support the serendipitous discovery that is vital to humanistic inquiry. The cultural heritage sector is characterized by numerous repositories and siloed digital archives, where aggregated exploration across collections is beneficial yet challenging to implement and maintain. . . .

This research addresses the central question: How can RecSys effectively support scholarly research and facilitate discoverable, understandable, and value-aware access to cultural heritage materials in digital archives? This question is thoroughly explored through three interconnected research areas that address major challenges that have been identified in prior work [9].

https://dl.acm.org/doi/10.1145/3705328.3748761

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“The Discovery Adhocracy: Special Collections, Information Resource Management, and Scholarly Communications Departments Partnership”


The paper will explore how the library’s Special Collections, Information Resource Management, and Scholarly Communications departments are partnering to facilitate campus community publications and digital assets, while centering student voices. We will describe the roles of each department involved in the partnership and provide two case studies of how the partnership works in application. One of the case study publications, Toyon: Seven Decades of Student Driven Publishing, gave students the opportunity to work in research and writing teams to create an open-access book. These professional, paid positions provided experience for students to connect with alumni, conduct research in the archive, author a book, and design the layout. Two of the hired students took their experience with them into publishing-related, full-time positions within the university. The other case study publication, The North Coast Otters and Public Arts Initiative, connected the library’s digital archive with the publication to provide users with multiple images for each piece of artwork. Both works demonstrate the power of using discovery principles to engage readers with both publications and the archive, and are models for future publication and digital asset projects. The collaborations among the three library departments began informally, but coalesced over time into a regularly scheduled working group to address projects involving the entire scholarship cycle from research to authoring to publication to discovery. Besides providing students with opportunities to develop their resumes, the published products capture the history of the university, build connections between the library and the campus, and promote the university and library to the broader community.

https://doi.org/10.32473/cslp.v1i1.140040

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“An Open Framework for Archival, Reproducible, and Transparent Science”


Digital computational outputs are now ubiquitous in the research workflow and the way in which these data are stored and cataloged is becoming more standardized across fields. However, even with accessible data and code, the barrier to recreating figures and reproducing scientific findings remains high. One element generally missing is the computing environment and associated pipelines in which the data and code are executed to generate figures. The archival, reproducible, and transparent science (ARTS) open framework incorporates containers, version control systems, and persistent archives through which all data, code, and figures related to a research project can be stored together, easily recreated, and serve as an accessible platform for long-term sharing and validation. If the underlying principles behind this framework are broadly adopted, it will improve the reproducibility and transparency of research.

https://arxiv.org/abs/2504.08171

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“Web Archives Metadata Generation with GPT-4o:Challenges and Insights”


Current metadata creation for web archives is time consuming and costly due to reliance on human effort. This paper explores the use of GPT-4ofor metadata generation within the Web Archive Singapore, focusing on scalability, efficiency, and cost effectiveness. We processed 112 Web ARChive (WARC) files using data reduction techniques, achieving a notable 99.9% reduction in metadata generation costs. By prompt engineering, we generated titles and abstracts, which were evaluated both intrinsically using Levenshtein distance and BERTScore, and extrinsically with human cataloguers using McNemar’s test. Results indicate that while our method offers significant cost savings and efficiency gains, human curated metadata maintains an edge in quality. The study identifies keychallenges including content inaccuracies, hallucinations, and translation issues, suggesting that large language models (LLMs) should serve as complements rather than replacements for human cataloguers. Future work will focus on refining prompts,improving content filtering, and addressing privacy concerns through experimentation with smaller models. This research advances the integration of LLMs in web archiving, offering valuable insights into their current capabilities and outlining directions for future enhancements. The code is available at https://github.com/masamune-prog/warc2summaryfor further development and use by institutions facing similar challenges.

https://tinyurl.com/mtjdbeus

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“AI Chatbots Need More Books to Learn From. These Libraries Are Opening Their Stacks”


Harvard’s newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter’s handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.

https://tinyurl.com/bdzxx8r7

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Paywall: “A Checklist to Publish Collections as Data in GLAM Institutions”


The purpose of this study is to offer a checklist that can be used for both creating and evaluating digital collections, which are also sometimes referred to as data sets as part of the collections as data movement, suitable for computational use.

https://doi.org/10.1108/GKMC-06-2023-0195

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“Exploring Emerging Technologies in Archiving and Preservation: Leveraging 3D Models, Interactive Environments, and AI Tools”


This article. . . explores how cultural heritage practitioners can leverage emerging technologies to enhance their work. . . . This article highlights AI applications and emerging technologies that can generate scripts without needing coding experience, create 3D models that increase accessibility and engagement, and develop virtual exhibits that extend the lifespan and reach of physical exhibits while providing additional interactive elements.

https://doi.org/10.1177/18758789251336085

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“Can LLMs Categorize the Specialized Documents from Web Archives in a Better Way?”


The explosive growth of web archives presents a significant challenge: manually curating specialized document collections from this vast data. Existing approaches rely on supervised techniques, but recent advancements in Large Language Models (LLMs) offer new possibilities for automating collection creation. Large Language Models (LLMs) are demonstrating impressive performance on various tasks even without fine-tuning. This paper investigates the effectiveness of prompt design in achieving results comparable to fine-tuned models. We explore different prompting techniques for collecting specialized documents from web archives like UNT.edu, Michigan.gov, and Texas.gov. We then analyze the performance of LLMs under various prompt configurations. Our findings highlight the significant impact of incorporating task descriptions within prompts. Additionally, including the document type as justification for the search scope leads to demonstrably better results. This research suggests that well-crafted prompts can unlock the potential of LLMs for specialized tasks, potentially reducing reliance on resource-intensive fine-tuning. This research paves the way for automating specialized collection creation using LLMs and prompt engineering.

https://dl.acm.org/doi/10.1145/3677389.3702591

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“Developing Practices for FAIR and Linked Data in Heritage Science”


Heritage Science has a lot to gain from the Open Science movement but faces major challenges due to the interdisciplinary nature of the field, as a vast array of technological and scientific methods can be applied to any imaginable material. Historical and cultural contexts are as significant as the methods and material properties, which is something the scientific templates for research data management rarely take into account. While the FAIR data principles are a good foundation, they do not offer enough practical help to researchers facing increasing demands from funders and collaborators. In order to identify the issues and needs that arise “on the ground floor”, the staff at the Heritage Laboratory at the Swedish National Heritage Board took part in a series of workshops with case studies. The results were used to develop guides for good data practices and a list of recommended online vocabularies for standardised descriptions, necessary for findable and interoperable data. However, the project also identified areas where there is a lack of useful vocabularies and the consequences this could have for discoverability of heritage studies on materials from areas of the world that have historically been marginalised by Western culture. If Heritage Science as a global field of study is to reach its full potential this must be addressed.

https://doi.org/10.1038/s40494-025-01598-x

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“Building Trustworthy AI Solutions: Integrating Artificial Intelligence Literacy into Records Management and Archival Systems”


This paper explores the essential role of Artificial Intelligence (AI) competencies and literacy in the fields of records management and archival practices, within the framework of the InterPARES Trust AI project. . . . The study employs two complementary approaches: (1) a detailed competency framework developed through literature reviews, interviews with archival professionals who have applied AI to the processing of records, and validation workshops with practitioners; and (2) a comprehensive AI literacy framework derived from multiple case studies and theoretical discussions. . . . Findings indicate that archival professionals can leverage AI in their work practices by acquiring basic AI literacy, practical AI skills, data-related skills, tool-testing and evaluation, adaptation of AI to their workflows, and by actively engaging in collaborative projects with information technology (IT) developers.

https://doi.org/10.48550/arXiv.2307.14852

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“Datafication and Cultural Heritage Collections Data Infrastructures: Critical Perspectives on Documentation, Cataloguing and Data-sharing in Cultural Heritage Institutions”


The role of cultural heritage collections within the research ecosystem is rapidly changing. From often-passive primary source or reference point for humanities research, cultural heritage collections are now becoming integral part of large-scale interdisciplinary inquiries using computational-driven methods and tools. This new status for cultural heritage collections, in the ‘collections-as-data’ era, would not be possible without foundational work that was and is still going on ‘behind the scenes’ in cultural heritage institutions through cataloguing, documentation and curation of cultural heritage records. This article assesses the landscape for cultural heritage collections data infrastructure in the UK through an empirical and critical perspective, presenting insights on the infrastructure that cultural heritage organisations use to record and manage their collections, exploring the range of systems being used, the levels of complexity or ease at which collections data can be accessed, and the shape of interactions between software suppliers, cultural heritage organisations, and third-party partners. The paper goes on to include a critical analysis of the findings based on the sector’s approach to ‘3s’, that is standards, skill sets and scale, and how that applies to different cultural heritage organisations throughout the data lifecycle, from data creation, stewardship to sharing and re-using.

https://doi.org/10.5334/johd.277

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“Copyright and Licencing for Cultural Heritage Collections as Data”


Cultural Heritage (CH) institutions have been exploring innovative ways to publish digital collections to facilitate reuse, through initiatives like Collections as data and the International GLAM Labs Community. When making a digital collection available for computational use, it is crucial to have reusable and machine-readable open licences and copyright terms. While existing studies address copyright for digital collections, this study focuses specifically on the unique requirements of collections as data. This research highlights both the legal and technical aspects of copyright concerning collections as data. It discusses permissible uses of copyrighted collections, emphasising the need for interoperable, machine-readable licences and open licences. By reviewing current literature and examples, this study presents best practices and examples to help CH institutions better navigate copyright and licencing issues, ultimately enhancing their ability to convert their content into collections as data for computational research.

https://doi.org/10.5334/johd.263

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“Conceptualizing Aggregate-Level Description in Web Archives”


Web archives collections are often excluded from archival science discussions, and their description instead focuses on bibliographic approaches to item-level metadata. This article argues that web archives are best understood using approaches of archival description, focusing on a case study of the Danish Netarchive, a long-running national web archive. By capturing and preserving web sites for the purposes of legal deposit, the Netarchive creates and maintains historical records of the web. Examining the Netarchive’s systems and activities through the lens of archival representation, this article develops a typology of representational artifacts that support this work, including the use of database entities, wiki documentation, classification and management via Jira issues, and codes, identifiers, and structures embedded in network protocols themselves. The analysis considers how meaningful aggregations can be understood via these representational schemes, systems and architectures, and how the nature of born-networked records challenges concepts of singular, hierarchical orderings of records aggregations. The closing discussion proposes new modes of description that address these multiple interconnected systems, and raises questions about what this might mean for aggregate-level description in the context of digital and born-networked records more broadly.

https://doi.org/10.5334/johd.265

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Starting with the Digital Doesn’t Make it Easier: Developing Transparent Born Digital Acquisition Policies for Archives "


As organizations continue to overwhelmingly abandon all forms of paper-based record keeping, libraries are still adapting to increased offers of born digital archival donations. Simple misunderstandings or disconnects between the units facilitating donations and maintaining born-digital collections creates pain-points for donor relations and can result in a lack of transparency over how their records may be processed. To facilitate better donor transparency and cross-area collaboration over born digital records, Special Collections and archives need comprehensive policies and shifts in training and collaboration paradigms. This paper analyses the intersections of born digital archiving, collection development polices, donor relations, human-supported AI tools, and digital records education within American academic libraries to propose a functional toolkit for born digital acquisitions. Unrealistic expectations of collection processing, retention, growth, and publication onto openly accessible platforms can quickly overwhelm a libraries’ digital collections’ team due to size, need for digital forensics work, copyright limitations, or other capacity-related issues. Intertwined within this discussion is an additional discourse over the need to carefully curate our digital spaces not only for practical cost reasons, but due to the environmental costs of massive data storage solutions. Through an analysis of the elements stated above, the paper will reflect on the need to integrate born digital materials into archival acquisition procedures and provide practical solutions to meet this need.

https://tinyurl.com/8r3ucesb

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Text Analysis of Archival Finding Aids; Collection Scoping and Beyond"


In this study, we examine the suitability of text analysis as a method for analyzing collection scope strengths across a repository’s physical archival holdings. We apply a tool for text analysis called Leximancer to analyze a corpus of archival finding aids to explore topical coverage. Leximancer results were highly aligned with the baseline subject heading analysis that we performed, but the concepts, themes, and co-occurring topic pairs surfaced by Leximancer suggest areas of collection strength and potential focus for new acquisitions. We discuss the potential applications of text analysis for internal library use including collection development, as well as potential implications for wider description, discovery, and access. Text analysis can accurately surface topical strengths and directly lead to insights that can inform future acquisition decisions and archival collection development policies.

https://tinyurl.com/mr45f8e7

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Capturing Captions: Using AI to Identify and Analyse Image Captions in a Large Dataset of Historical Book Illustrations"


This article outlines how AI methods can be used to identify image captions in a large dataset of digitised historical book illustrations. This dataset includes over a million images from 68,000 books published between the eighteenth and early twentieth centuries, covering works of literature, history, geography, and philosophy. The article has two primary objectives. First, it suggests the added value of captions in making digitized illustrations more searchable by picture content in online archives. To further this objective, we describe the methods we have used to identify captions, which can effectively be re-purposed and applied in different contexts. Second, we suggest how this research leads to new understandings of the semantics and significance of the captions of historical book illustrations. The findings discussed here mark a critical intervention in the fields of digital humanities, book history, and illustration studies.

https://tinyurl.com/bdvjespp

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"AI and Medical Images: Addressing Ethical Challenges to Provide Responsible Access to Historical Medical Illustrations"


This article examines the ethical considerations and broader issues around access to digitised historical medical images. These illustrations and, later, photographs are often extremely sensitive, representing disability, disease, gender, and race in potentially harmful and problematic ways. In particular, the original metadata for such images can include demeaning and sometimes racist terms. Some of these images show sexually explicit and violent content, as well as content that was obtained without informed consent. Hiding these sensitive images can be tempting, and yet, archives are meant to be used, not locked away. Through a series of interviews with 10 archivists, librarians, and researchers based in the UK and US, the authors show that improved access to medical illustrations is essential to produce new knowledge in the humanities and medical research, as well as to bridge the gap between historical and modern understandings of the human body. Improving access to medical illustration can also help to address the "gender data gap", which has acquired mainstream visibility thanks to the work of activists such as Caroline Criado-Perez, the author of Invisible Women: Data Bias in a World Designed for Men.

https://tinyurl.com/3jek7ey4

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |