"Capturing Captions: Using AI to Identify and Analyse Image Captions in a Large Dataset of Historical Book Illustrations"


This article outlines how AI methods can be used to identify image captions in a large dataset of digitised historical book illustrations. This dataset includes over a million images from 68,000 books published between the eighteenth and early twentieth centuries, covering works of literature, history, geography, and philosophy. The article has two primary objectives. First, it suggests the added value of captions in making digitized illustrations more searchable by picture content in online archives. To further this objective, we describe the methods we have used to identify captions, which can effectively be re-purposed and applied in different contexts. Second, we suggest how this research leads to new understandings of the semantics and significance of the captions of historical book illustrations. The findings discussed here mark a critical intervention in the fields of digital humanities, book history, and illustration studies.

https://tinyurl.com/bdvjespp

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"AI and Medical Images: Addressing Ethical Challenges to Provide Responsible Access to Historical Medical Illustrations"


This article examines the ethical considerations and broader issues around access to digitised historical medical images. These illustrations and, later, photographs are often extremely sensitive, representing disability, disease, gender, and race in potentially harmful and problematic ways. In particular, the original metadata for such images can include demeaning and sometimes racist terms. Some of these images show sexually explicit and violent content, as well as content that was obtained without informed consent. Hiding these sensitive images can be tempting, and yet, archives are meant to be used, not locked away. Through a series of interviews with 10 archivists, librarians, and researchers based in the UK and US, the authors show that improved access to medical illustrations is essential to produce new knowledge in the humanities and medical research, as well as to bridge the gap between historical and modern understandings of the human body. Improving access to medical illustration can also help to address the "gender data gap", which has acquired mainstream visibility thanks to the work of activists such as Caroline Criado-Perez, the author of Invisible Women: Data Bias in a World Designed for Men.

https://tinyurl.com/3jek7ey4

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Academic Authors ‘Shocked’ After Taylor & Francis Sells Access to Their Research to Microsoft AI"


One of the biggest concerns raised by Clemens [Dr Ruth Alison Clemens] is over whether it is possible for Taylor & Francis’ authors to opt out of the AI partnership with Microsoft. Clemens told The Bookseller: "There is no clarity from Taylor & Francis about whether an opt-out policy is in place or on the cards. But as they did not inform their authors about the deal in the first place, any opt-out policy is now not functional."

Taylor & Francis was paid around $10 million for the license.

https://tinyurl.com/3yyarxnj

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Meta Releases the Biggest and Best Open-Source AI Model Yet"


Meta is releasing Llama 3.1, the largest-ever open-source AI model, which the company claims outperforms GPT-4o and Anthropic’s Claude 3.5 Sonnet on several benchmarks. It’s also making the Llama-based Meta AI assistant available in more countries and languages while adding a feature that can generate images based on someone’s specific likeness. . . .

Meta’s own implementation of Llama is its AI assistant, which is positioned as a general-purpose chatbot like ChatGPT and can be found [in a few weeks] in just about every part of Instagram, Facebook, and WhatsApp.

https://tinyurl.com/2cs552p4

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Paywall: "Exploring the Use of Generative Artificial Intelligence in Systematic Searching: A Comparative Case Study of a Human Librarian, ChatGPT-4 and ChatGPT-4 Turbo"


The findings suggest that AI could expand the scope of search terms and queries, automating the more repetitive and formulaic aspects of the systematic-review process, while human expertise remains crucial in refining search terms and ensuring methodological rigor. Meanwhile, challenges remain for AI tools’ capacity to access subscription-based or proprietary databases and generate sophisticated search strategies.

https://doi.org/10.1177/03400352241263532

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

AI Is Running Out of New Training Data: Consent in Crisis: The Rapid Decline of the AI Data Commons


General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. . . .Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems.

https://tinyurl.com/4k56axzk

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"STM Statement Regarding Unlicensed Use of STM’s Members’ Content in the Training, Development, and Operation of AI Models"


The unlicensed use of STM’s members’ content in the training, development, and operation of AI models is of great concern to STM and to our members. Because STM’s members do not share a single jurisdiction, the particular actions and practices of a given AI developer with respect to a given domestic copyright law are too varied to enumerate here. However, regardless of legal nuances among jurisdictions, STM considers the conclusion to be the same — the collection of our members’ content and its use in AI training without authorization, compensation or attribution, amounts to infringement. We support the statements about third parties’ use of content in generative AI training and development that have been made by our sister organizations the International Publishers Association and the UK Publishers Association.

https://tinyurl.com/5n6zh9sy

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Google’s Wrong Answer to the Threat of AI — Stop Indexing Content"


"Google is no longer trying to index the entire web," writes Schmalbach [Vincent Schmalbach, SEO expert]. "In fact, it’s become extremely selective, refusing to index most content. This isn’t about content creators failing to meet some arbitrary standard of quality. Rather, it’s a fundamental change in how Google approaches its role as a search engine." The default setting from now on will be not to index content unless it is genuinely unique, authoritative and has ‘brand recognition’.

https://tinyurl.com/32t98fhu

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"ARL & CNI Release Deluxe Edition of AI-Influenced Future Scenarios for Research Environment"


This Deluxe Edition of the ARL/CNI AI Scenarios includes:

  • The Final Scenario Set: This final scenario set explores potential futures where AI plays a pivotal role, providing critical insights into the evolving challenges and opportunities for the research environment.
  • The Strategic Context Report: This report summarizes community feedback gathered through focus groups and interviews about an AI-influenced future for the research environment that were held in winter 2023–24 and spring 2024.
  • The Provocateur Interview Report: Featuring forward-thinking dialogues with industry leaders, these interviews challenge conventional wisdom and stimulate stretch thinking with regards to an AI-influenced future.

https://tinyurl.com/5n7xwc8c

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"A Real-World Test of Artificial Intelligence Infiltration of a University Examinations System: A ‘Turing Test’ Case Study"


The recent rise in artificial intelligence systems, such as ChatGPT, poses a fundamental problem for the educational sector. In universities and schools, many forms of assessment, such as coursework, are completed without invigilation. Therefore, students could hand in work as their own which is in fact completed by AI. Since the COVID pandemic, the sector has additionally accelerated its reliance on unsupervised ‘take home exams’. If students cheat using AI and this is undetected, the integrity of the way in which students are assessed is threatened. We report a rigorous, blind study in which we injected 100% AI written submissions into the examinations system in five undergraduate modules, across all years of study, for a BSc degree in Psychology at a reputable UK university. We found that 94% of our AI submissions were undetected. The grades awarded to our AI submissions were on average half a grade boundary higher than that achieved by real students. Across modules there was an 83.4% chance that the AI submissions on a module would outperform a random selection of the same number of real student submissions.

https://doi.org/10.1371/journal.pone.0305354

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"RIAA Sues Suno & Udio AI Music Generators For ‘Trampling’ on Copyright"


Major recording labels of the RIAA have filed a pair of broadly similar copyright lawsuits against two key generative AI music services. The owners of Udio and Suno stand accused of copying the labels’ music on a massive scale and the labels suggest that they’re already on the back foot. In pre-litigation correspondence, both were ‘evasive’ on content sources before citing fair use, which the RIAA notes only arises as a defense in cases of unauthorized use of copyright works.

https://tinyurl.com/p9tnycte

See also: "World’s Biggest Music Labels Sue Over AI Copyright."

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Paywall: "AI Is Exhausting the Power Grid. Tech Firms Are Seeking a Miracle Solution."


In addition to fusion, tech giants are hoping to generate power through such futuristic schemes as small nuclear reactors hooked to individual computing centers and machinery that taps geothermal energy by boring 10,000 feet into the Earth’s crust. . . .

A recent Goldman Sachs analysis of energy that will power the AI boom into 2030. . . found data centers will account for 8 percent of total electricity use in the United States by 2030, a near tripling of their share today. New solar and wind energy will meet about 40 percent of that new power demand from data centers, the forecast said, while the rest will come from a vast expansion in the burning of natural gas. The new emissions created would be comparable to that of putting 15.7 million additional gas-powered cars on the road.

https://tinyurl.com/5fhwpc36

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Empowering Knowledge through AI: Open Scholarship Proactively Supporting Well Trained Generative AI"


Generative AI has taken the world by storm over the last few years, and the world of scholarly communications has not been immune to this. Most discussions in this area address how we can integrate these tools into our workflows, concerns about how researchers and students might misuse the technology or the unauthorised use of copyrighted work. This article argues for a novel viewpoint that librarians and publishers should be encouraging the use of their scholarly content in the training of AI algorithms. Inclusion of scholarly works would advance the reliability and accuracy of the information in training datasets and ensure that this content is included in new knowledge discovery platforms. The article also argues that inclusion can be achieved by improving linkage to content, and, by making sure that licences explicitly allow inclusion in AI training datasets, it advocates for a more collaborative approach to shaping the future of the information landscape in academia.

https://doi.org/10.1629/uksg.649

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Why Does AI Hallucinate?"


But none of these techniques will stop hallucinations fully. As long as large language models are probabilistic, there is an element of chance in what they produce. Roll 100 dice and you’ll get a pattern. Roll them again and you’ll get another. Even if the dice are, like large language models, weighted to produce some patterns far more often than others, the results still won’t be identical every time. Even one error in 1,000—or 100,000—adds up to a lot of errors when you consider how many times a day this technology gets used.

https://tinyurl.com/2w2y3d94

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Springer Nature Unveils Two New AI Tools to Protect Research Integrity"


Geppetto works by dividing the paper up into sections and uses its own algorithms to check the consistency of the text in each section. The sections are then given a score based on the probability that the text in them has been AI generated. The higher the score, the greater the probability of there being problems, initiating a human check by Springer Nature staff. Geppetto is already responsible for identifying hundreds of fake papers soon after submission, preventing them from being published — and from taking up editors’ and peer reviewers’ valuable time. . . .

SnappShot, also developed in-house, is an AI-assisted image integrity analysis tool. Currently used to analyse PDF files containing gel and blot images and look for duplications in those image types— another known integrity problem within the industry — this will be expanded to cover additional image types and integrity problems and speed up checks on papers.

https://tinyurl.com/3uxbvans

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Taylor & Francis Issues Expanded Guidance on AI Application for Authors, Editors and Reviewers "


Taylor & Francis has issued the latest iteration of its policy on the application of AI tools. The policy aims to promote ethical and transparent use of AI, while addressing the risks and challenges it can pose for research publishing.

From the policy:

Authors must clearly acknowledge within the article or book any use of Generative AI tools through a statement which includes: the full name of the tool used (with version number), how it was used, and the reason for use. For article submissions, this statement must be included in the Methods or Acknowledgments section. Book authors must disclose their intent to employ Generative AI tools at the earliest possible stage to their editorial contacts for approval — either at the proposal phase if known, or if necessary, during the manuscript writing phase. If approved, the book author must then include the statement in the preface or introduction of the book .

https://tinyurl.com/h3rfkynm

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Apple Is Putting ChatGPT in Siri for Free Later This Year"


Apple is partnering with OpenAI to put ChatGPT into Siri, the company announced at its WWDC 2024 keynote on Monday.

ChatGPT will be available for free in iOS 18 and macOS Sequoia later this year without an account, and Apple says that user queries won’t be logged. The popular chatbot will also be integrated into Apple’s systemwide writing tools.

https://tinyurl.com/29bs35b3

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Charles W. Bailey, Jr.Posted on Categories Artificial Intelligence/Robots

Generative AI: "The Impossibility of Fair LLMs"


The need for fair AI is increasingly clear in the era of general-purpose systems such as ChatGPT, Gemini, and other large language models (LLMs). However, the increasing complexity of human-AI interaction and its social impacts have raised questions of how fairness standards could be applied. Here, we review the technical frameworks that machine learning researchers have used to evaluate fairness, such as group fairness and fair representations, and find that their application to LLMs faces inherent limitations. We show that each framework either does not logically extend to LLMs or presents a notion of fairness that is intractable for LLMs, primarily due to the multitudes of populations affected, sensitive attributes, and use cases. To address these challenges, we develop guidelines for the more realistic goal of achieving fairness in particular use cases: the criticality of context, the responsibility of LLM developers, and the need for stakeholder participation in an iterative process of design and evaluation. Moreover, it may eventually be possible and even necessary to use the general-purpose capabilities of AI systems to address fairness challenges as a form of scalable AI-assisted alignment.

https://arxiv.org/abs/2406.03198

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Towards Conversational Discovery: New Discovery Applications for Scholarly Information in the Era of Generative Artificial Intelligence "


Here, we. . . discuss how GenAI is moving us towards conversational discovery and what this might mean for publishing, as well as potential future trends in information discovery.

AI-powered features include natural language search, concise summaries, and synthesis of research. . . .

It [Scopus AI] has the ability to use keywords from research abstracts to generate concept maps for each query. Dimensions Assistant offers well-structured explanations. . . researchers can receive notifications each time content is generated . . . .

There are two types of AI/GenAI powered discovery systems: AI+ refers to native applications which can only be built based on GenAI (such as Chat GPT and Perplexity.ai), while +AI means AI/GenAI can be integrated to improve existing discovery tools and search engines such as Google and Bing.

https://tinyurl.com/53chtzu7

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"4 Types of Gen AI Risk and How to Mitigate Them"


Risk around using gen AI can be classified based on two factors: intent and usage. Accidental misapplication of gen AI is different from deliberate malpractices (intent). Similarly, using gen AI tools to create content is differentiated from consuming content that other parties may have created with gen AI (usage). To mitigate the risk of gen AI content misuse and misapplication, organizations need to develop the capabilities to detect, identify, and prevent the spread of such potentially misleading content.

https://tinyurl.com/3shctfct

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Generative AI Issues in Scholarly Publishing: "Guest Post: Jagged Edges of Conversational Interfaces Over Scholarly and Professional Content "


The fundamental tension is that unlike web distribution of static content, which has enormous scale advantages due to very low marginal costs, the RAG [Retrieval-Augmented Generation] pattern has high marginal costs (10-1000X) that scale linearly. While token costs remain high, for general scholarly applications outside of specialty practitioners, the central business or product challenge will be how to generate sufficient incremental revenue to offset the vastly higher compute costs to use GenAI technology to generate responses to queries.

https://tinyurl.com/38x432h7

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

ARL Poll: "AI and Libraries: Strengths in a Digital Tomorrow"


The poll results from the ARL/CNI 2035 Scenarios exploration reveal diverse strengths that research libraries can harness as they navigate AI-influenced futures. These strengths underscore libraries’ vital role in maintaining information integrity and ensuring equitable access amidst the challenges posed by AI advancements. For libraries, these insights emphasize the importance of continuing to build on these core competencies while staying adaptive and responsive to emerging technological trends. Leveraging the ARL/CNI 2035 Scenarios and continued attention to the broader strategic landscape will enable libraries to be proactive and remain relevant and effective as custodians of knowledge in an increasingly digital and AI-driven world.

https://tinyurl.com/38mmuxnb

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Generative AI in Higher Education: A Global Perspective of Institutional Adoption Policies and Guidelines"


Integrating generative AI (GAI) into higher education is crucial for preparing a future generation of GAI-literate students. Yet a thorough understanding of the global institutional adoption policy remains absent, with most of the prior studies focused on the Global North and the promises and challenges of GAI, lacking a theoretical lens. This study utilizes the Diffusion of Innovations Theory to examine GAI adoption strategies in higher education across 40 universities from six global regions. It explores the characteristics of GAI innovation, including compatibility, trialability, and observability, and analyses the communication channels and roles and responsibilities outlined in university policies and guidelines. The findings reveal a proactive approach by universities towards GAI integration, emphasizing academic integrity, teaching and learning enhancement, and equity. Despite a cautious yet optimistic stance, a comprehensive policy framework is needed to evaluate the impacts of GAI integration and establish effective communication strategies that foster broader stakeholder engagement. The study highlights the importance of clear roles and responsibilities among faculty, students, and administrators for successful GAI integration, supporting a collaborative model for navigating the complexities of GAI in education. This study contributes insights for policymakers in crafting detailed strategies for its integration.

https://arxiv.org/abs/2405.11800

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

International Scientific Report on the Safety of Advanced AI


The interim report highlights several key takeaways, including:

  • General-purpose AI can be used to advance the public interest, leading to enhanced wellbeing, prosperity, and scientific discoveries.
  • According to many metrics, the capabilities of general-purpose AI are advancing rapidly. Whether there has been significant progress on fundamental challenges such as causal reasoning is debated among researchers.
  • Experts disagree on the expected pace of future progress of general-purpose AI capabilities, variously supporting the possibility of slow, rapid, or extremely rapid progress.
  • There is limited understanding of the capabilities and inner workings of general-purpose AI systems. Improving our understanding should be a priority.
  • Like all powerful technologies, current and future general-purpose AI can be used to cause harm. For example, malicious actors can use AI for large-scale disinformation and influence operations, fraud, and scams.
  • Malfunctioning general-purpose AI can also cause harm, for instance through biassed decisions with respect to protected characteristics like race, gender, culture, age, and disability.
  • Future advances in general-purpose AI could pose systemic risks, including labour market disruption, and economic power inequalities. Experts have different views on the risk of humanity losing control over AI in a way that could result in catastrophic outcomes.
  • Several technical methods (including benchmarking, red-teaming and auditing training data) can help to mitigate risks, though all current methods have limitations, and improvements are required.
  • The future of AI is uncertain, with a wide range of scenarios appearing possible. The decisions of societies and governments will significantly impact its future.

Report

https://tinyurl.com/3h7bdvzr

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Project Astra is the Future of AI at Google"


Google calls it Project Astra, and it’s a real-time, multimodal AI assistant that can see the world, knows what things are and where you left them, and can answer questions or help you do almost anything. In an incredibly impressive demo video that Hassabis swears is not faked or doctored in any way, an Astra user in Google’s London office asks the system to identify a part of a speaker, find their missing glasses, review code, and more. It all works practically in real time and in a very conversational way.

https://tinyurl.com/bkkfaxrd

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |