"Introducing OpenAI o1-preview: A New Series of Reasoning Models for Solving Hard Problems"


In our tests, the next model update performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o [the last model] correctly solved only 13% of problems, while the [new]reasoning model scored 83%. Their coding abilities were evaluated in contests and reached the 89th percentile in Codeforces competitions. . . .

As an early model, it doesn’t yet have many of the features that make ChatGPT useful, like browsing the web for information and uploading files and images. For many common cases GPT-4o will be more capable in the near term.

https://tinyurl.com/5ap6p996

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Paywall: "Reshaping Academic Library Information Literacy Programs in the Advent of ChatGPT and Other Generative AI Technologies"


This article reports on three digital information literacy initiatives created by instruction librarians to support students’ use of generative AI technologies, namely ChatGPT, in academic library research. The cumulative and formative data gathered from the initiatives reveals a continuing need for academic libraries to provide information literacy instruction that guides students toward the ethical use of information and awareness of using generative AI tools in library research.

https://doi.org/10.1080/10875301.2024.2400132

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"The AI-Copyright Trap"


As AI tools proliferate, policy makers are increasingly being called upon to protect creators and the cultural industries from the extractive, exploitative, and even existential threats posed by generative AI. In their haste to act, however, they risk running headlong into the Copyright Trap: the mistaken conviction that copyright law is the best tool to support human creators and culture in our new technological reality (when in fact it is likely to do more harm than good). It is a trap in the sense that it may satisfy the wants of a small group of powerful stakeholders, but it will harm the interests of the more vulnerable actors who are, perhaps, most drawn to it. Once entered, it will also prove practically impossible to escape. I identify three routes in to the copyright trap in current AI debates: first is the “if value, then (property) right” fallacy; second is the idea that unauthorized copying is inherently wrongful; and third is the resurrection of the starving artist trope to justify copyright’s expansion.

https://tinyurl.com/bdett6ue

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Datacenters to Emit 3X More Carbon Dioxide Because of Generative AI"


The datacenter industry is set to emit 2.5 billion tonnes of greenhouse gas (GHG) emissions worldwide between now and the end of the decade, three times more than if generative AI had not been developed.

https://tinyurl.com/4vatmm8a

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Clarivate Report Unveils the Transformative Role of Artificial Intelligence on Shaping the Future of the Library"


The report combines feedback from a survey of more than 1,500 librarians from across the world with qualitative interviews, covering academic, national and public libraries. In addition to the downloadable report, the accompanying microsite’s dynamic and interactive data visualizations enable rapid comparative analyses according to regions and library types. . . .

Key findings of the report include:

  • Most libraries have an AI plan in place, or one in progress: Over 60% of respondents are evaluating or planning for AI integration.
  • AI adoption is the top tech priority: AI-powered tools for library users and patrons top the list of technology priorities for the next 12 months, according to 43% of respondents.
  • AI is advancing library missions: Key goals for those evaluating or implementing AI include supporting student learning (52%), research excellence (47%) and content discoverability (45%), aligning closely with the mission of libraries.
  • Librarians see promise and pitfalls in AI adoption: 42% believe AI can automate routine tasks, freeing librarians for strategic and creative activities. Levels of optimism vary regionally.
  • AI skills gaps and shrinking budgets are top concerns. Lack of expertise and budget constraints are seen as greater challenges than privacy and security issues: — Shrinking budgets: Almost half (47%) cite shrinking budgets as their greatest challenge. — Skills gap: 52% of respondents see upskilling as AI’s biggest impact on employment, yet nearly a third (32%) state that no training is available.
  • AI advancement will be led by IT: By combining the expertise of heads of IT with strategic investment and direction from senior leadership, libraries can move from consideration to implementation of AI in the coming years.
  • Regional priorities differ: Librarians’ views on other key topics such as sustainability, diversity, open access and open science show notable regional diversity.

https://tinyurl.com/9azeessa

Pulse of the Library report

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"The AI Copyright Hype: Legal Claims That Didn’t Hold Up"


Over the past year, two dozen AI-related lawsuits and their myriad infringement claims have been winding their way through the court system. None have yet reached a jury trial. While we all anxiously await court rulings that can inform our future interaction with generative AI models, in the past few weeks, we are suddenly flooded by news reports with titles such as “US Artists Score Victory in Landmark AI Copyright Case,” “Artists Land a Win in Class Action Lawsuit Against A.I. Companies,” “Artists Score Major Win in Copyright Case Against AI Art Generators”—and the list goes on. The exuberant mood in these headlines mirror the enthusiasm of people actually involved in this particular case (Andersen v. Stability AI). The plaintiffs’ lawyer calls the court’s decision “a significant step forward for the case.” “We won BIG,” writes the plaintiff on X.

In this blog post, we’ll explore the reality behind these headlines and statements. The “BIG” win in fact describes a portion of the plaintiffs’ claims surviving a pretrial motion to dismiss. If you are already familiar with the motion to dismiss per Federal Rules of Civil Procedure Rule 12(b)(6), please refer to Part II to find out what types of claims have been dismissed early on in the AI lawsuits.

https://tinyurl.com/rhmzkr8y

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"AI Models Collapse When Trained on Recursively Generated Data"


Yet, although current LLMs. . ., including GPT-3, were trained on predominantly human-generated text, this may change. If the training data of most future models are also scraped from the web, then they will inevitably train on data produced by their predecessors. In this paper, we investigate what happens when text produced by, for example, a version of GPT forms most of the training dataset of following models. . . .

Model collapse is a degenerative process affecting generations of learned generative models, in which the data they generate end up polluting the training set of the next generation. Being trained on polluted data, they then mis-perceive reality. . . .

In our work, we demonstrate that training on samples from another generative model can induce a distribution shift, which—over time—causes model collapse. This in turn causes the model to mis-perceive the underlying learning task. To sustain learning over a long period of time, we need to make sure that access to the original data source is preserved and that further data not generated by LLMs remain available over time. The need to distinguish data generated by LLMs from other data raises questions about the provenance of content that is crawled from the Internet: it is unclear how content generated by LLMs can be tracked at scale. One option is community-wide coordination to ensure that different parties involved in LLM creation and deployment share the information needed to resolve questions of provenance. Otherwise, it may become increasingly difficult to train newer versions of LLMs without access to data that were crawled from the Internet before the mass adoption of the technology or direct access to data generated by humans at scale.

https://doi.org/10.1038/s41586-024-07566-y

See also: “When A.I.’s Output Is a Threat to A.I. Itself.”

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Artificial Intelligence Assisted Curation of Population Groups in Biomedical Literature "


Curation of the growing body of published biomedical research is of great importance to both the synthesis of contemporary science and the archiving of historical biomedical literature. Each of these tasks has become increasingly challenging given the expansion of journal titles, preprint repositories and electronic databases. Added to this challenge is the need for curation of biomedical literature across population groups to better capture study populations for improved understanding of the generalizability of findings. To address this, our study aims to explore the use of generative artificial intelligence (AI) in the form of large language models (LLMs) such as GPT-4 as an AI curation assistant for the task of curating biomedical literature for population groups. We conducted a series of experiments which qualitatively and quantitatively evaluate the performance of OpenAI’s GPT-4 in curating population information from biomedical literature. Using OpenAI’s GPT-4 and curation instructions, executed through prompts, we evaluate the ability of GPT-4 to classify study ‘populations’, ‘continents’ and ‘countries’ from a previously curated dataset of public health COVID-19 studies.

Using three different experimental approaches, we examined performance by: A) evaluation of accuracy (concordance with human curation) using both exact and approximate string matches within a single experimental approach; B) evaluation of accuracy across experimental approaches; and C) conducting a qualitative phenomenology analysis to describe and classify the nature of difference between human curation and GPT curation. Our study shows that GPT-4 has the potential to provide assistance in the curation of population groups in biomedical literature. Additionally, phenomenology provided key information for prompt design that further improved the LLM’s performance in these tasks. Future research should aim to improve prompt design, as well as explore other generative AI models to improve curation performance. An increased understanding of the populations included in research studies is critical for the interpretation of findings, and we believe this study provides keen insight on the potential to increase the scalability of population curation in biomedical studies.

https://doi.org/10.2218/ijdc.v18i1.950

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"NVIDIA: Copyrighted Books Are Just Statistical Correlations to Our AI Models"


Earlier this year, several authors sued NVIDIA over alleged copyright infringement. The class action lawsuit alleged that the company’s AI models were trained on copyrighted works and specifically mentioned Books3 data [a database of over 180,000 pirated books]. Since this happened without permission, the rightsholders demand compensation. . . .

The company believes that AI companies should be allowed to use copyrighted books to train their AI models, as these books are made up of “uncopyrightable facts and ideas” that are already in the public domain. . . .

“[AI] Training measures statistical correlations in the aggregate, across a vast body of data, and encodes them into the parameters of a model. Plaintiffs do not try to claim a copyright over those statistical correlations, asserting instead that the training data itself is ‘copied’ for the purposes of infringement,” NVIDIA writes [to the court hearing the case].

According to NVIDIA, the lawsuit boils down to two related questions. First, whether the authors’ direct infringement claim is essentially an attempt to claim copyright on facts and grammar. Second, whether making copies of the books is fair use.

https://tinyurl.com/mpa6e8jj

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Artists Claim ‘Big’ Win in Copyright Suit Fighting AI Image Generators"


In an order on Monday, US district judge William Orrick denied key parts of motions to dismiss from Stability AI, Midjourney, Runway AI, and DeviantArt. The court will now allow artists to proceed with discovery on claims that AI image generators relying on Stable Diffusion violate both the Copyright Act and the Lanham Act, which protects artists from commercial misuse of their names and unique styles. . . .

While Orrick agreed with Midjourney that “plaintiffs have no protection over ‘simple, cartoony drawings’ or ‘gritty fantasy paintings,'” artists were able to advance a “trade dress” claim under the Lanham Act, too.

https://tinyurl.com/yd27cvar

"Trade Dress Infringement"

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery"


One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aids to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world’s most challenging problems.

https://arxiv.org/abs/2408.06292

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Wiley and Oxford University Press Confirm AI Partnerships as Cambridge University Press Offers ‘Opt-In’"


Wiley and Oxford University Press (OUP) told The Bookseller they have confirmed AI partnerships, with the availability of opt-ins and remuneration for authors appearing to vary. . . .

Meanwhile, Cambridge University Press has said it is talking to authors about opt ins along with ‘fair remuneration’ before making any deals.

Hachette, HarperCollins, and Pan Macmillan have not made AI deals.

https://tinyurl.com/bdzax5sk

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"What Happens When Your Publisher Licenses Your Work for AI Training?"


In a lot of cases, yes, publishers can license AI training rights without asking authors first. Many publishing contracts include a full and broad grant of rights–sometimes even a full transfer of copyright to the publisher for them to exploit those rights and to license the rights to third parties. . . .

Not all publishing contracts are so broad, however. For example, in the Model Publishing Contract for Digital Scholarship (which we have endorsed), the publisher’s sublicensing rights are limited and specifically defined, and profits resulting from any exploitation of a work must be shared with authors. . . .

There are lots of variations, and specific terms matter. Some publisher agreements are far more limited–transferring only limited publishing and subsidiary rights. . . .

This is further complicated by the fact that authors sometimes are entitled to reclaim their rights, such as by rights reversion clause and copyright termination. . . .

We [the Authors Alliance] think it is certainly reasonable to be skeptical about the validity of blanket licensing schemes between large corporate rights holders and AI companies, at least when they are done at very large scale. Even though in some instances publishers do hold rights to license AI training, it is dubious whether they actually hold, and sufficiently document, all of the purported rights of all works being licensed for AI training.

https://tinyurl.com/53fnj9h7

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"AI’s Future in Grave Danger from Nvidia’s Chokehold on Chips, Groups Warn"


Nvidia is currently “the world’s most valuable public company,” their letter said, worth more than $3 trillion after taking near-total control of the high-performance AI chip market. Particularly “astonishing,” the letter said, was Nvidia’s dominance in the market for GPU accelerator chips, which are at the heart of today’s leading AI.

According to the advocacy groups that strongly oppose Big Tech monopolies, Nvidia “now holds an 80 percent overall global market share in GPU chips and a 98 percent share in the data center market.” This “puts it in a position to crowd out competitors and set global pricing and the terms of trade,” the letter warned. . . .

https://tinyurl.com/y5c769nk

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"European Artificial Intelligence Act Comes into Force"


The AI Act introduces a forward-looking definition of AI, based on a product safety and risk-based approach in the EU:

Minimal risk: Most AI systems, such as AI-enabled recommender systems and spam filters, fall into this category. These systems face no obligations under the AI Act due to their minimal risk to citizens’ rights and safety. Companies can voluntarily adopt additional codes of conduct.

Specific transparency risk: AI systems like chatbots must clearly disclose to users that they are interacting with a machine. Certain AI-generated content, including deep fakes, must be labelled as such, and users need to be informed when biometric categorisation or emotion recognition systems are being used. In addition, providers will have to design systems in a way that synthetic audio, video, text and images content is marked in a machine-readable format, and detectable as artificially generated or manipulated.

High risk: AI systems identified as high-risk will be required to comply with strict requirements, including risk-mitigation systems, high quality of data sets, logging of activity, detailed documentation, clear user information, human oversight, and a high level of robustness, accuracy, and cybersecurity. Regulatory sandboxes will facilitate responsible innovation and the development of compliant AI systems. Such high-risk AI systems include for example AI systems used for recruitment, or to assess whether somebody is entitled to get a loan, or to run autonomous robots.

Unacceptable risk: AI systems considered a clear threat to the fundamental rights of people will be banned. This includes AI systems or applications that manipulate human behaviour to circumvent users’ free will, such as toys using voice assistance encouraging dangerous behaviour of minors, systems that allow ‘social scoring’ by governments or companies, and certain applications of predictive policing. In addition, some uses of biometric systems will be prohibited, for example emotion recognition systems used at the workplace and some systems for categorising people or real time remote biometric identification for law enforcement purposes in publicly accessible spaces (with narrow exceptions).

https://tinyurl.com/32jy9pat

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"AI and the Workforce: Industry Report Calls for Reskilling and Upskilling as 92 Percent of Technology Roles Evolve"


"The Transformational Opportunity of AI on ICT Jobs" report finds that 92 percent of jobs analyzed are expected to undergo either high or moderate transformation due to advancements in AI.

Led by Cisco, created by Consortium members, and analyzed by Accenture, the new report identifies essential trainings in AI literacy, data analytics and prompt engineering for workers seeking to adapt to the AI revolution.

The AI-Enabled ICT Workforce Consortium consists of Cisco, Accenture, Eightfold, Google, IBM, Indeed, Intel, Microsoft and SAP. Advisors include the American Federation of Labor and Congress of Industrial Organizations, CHAIN5, Communications Workers of America, DIGITALEUROPE, the European Vocational Training Association, Khan Academy and SMEUnited.

https://tinyurl.com/3hj8ypx2

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Is the AI Bubble about to Pop? Internal Documents Reveal OpenAI May Go Bankrupt within 12 Months"


Net losses for 2024 alone are expected to hit US$5 billion. . . .

The company spends US$7 billion on training its GPT models, with additional US$1.5 billion in staffing expenses.

It makes back anywhere between US$3.5 to US$4.5 billion in ChatGPT subscriptions and access fees. . .

https://tinyurl.com/y8hen3ep

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Copyright Office Releases Part 1 of Artificial Intelligence Report, Recommends Federal Digital Replica Law"


Today, the U.S. Copyright Office is releasing Part 1 of its Report on the legal and policy issues related to copyright and artificial intelligence (AI), addressing the topic of digital replicas. This Part of the Report responds to the proliferation of videos, images, or audio recordings that have been digitally created or manipulated to realistically but falsely depict an individual. Given the gaps in existing legal protections, the Office recommends that Congress enact a new federal law that protects all individuals from the knowing distribution of unauthorized digital replicas. The Office also offers recommendations on the elements to be included in crafting such a law. . . .

The Report is being released in several Parts, beginning today. Forthcoming Parts will address the copyrightability of materials created in whole or in part by generative AI, the legal implications of training AI models on copyrighted works, licensing considerations, and the allocation of any potential liability.

https://tinyurl.com/yc2fhthm

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"AI Is Complicating Plagiarism. How Should Scientists Respond?"


A central question is whether using unattributed content written entirely by a machine — rather than by a human — counts as plagiarism. Not necessarily, say many researchers. For example, the European Network for Academic Integrity, which includes universities and individuals, defines the prohibited or undeclared use of AI tools for writing as "unauthorized content generation" rather than as plagiarism as such.

https://www.nature.com/articles/d41586-024-02371-z

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Capturing Captions: Using AI to Identify and Analyse Image Captions in a Large Dataset of Historical Book Illustrations"


This article outlines how AI methods can be used to identify image captions in a large dataset of digitised historical book illustrations. This dataset includes over a million images from 68,000 books published between the eighteenth and early twentieth centuries, covering works of literature, history, geography, and philosophy. The article has two primary objectives. First, it suggests the added value of captions in making digitized illustrations more searchable by picture content in online archives. To further this objective, we describe the methods we have used to identify captions, which can effectively be re-purposed and applied in different contexts. Second, we suggest how this research leads to new understandings of the semantics and significance of the captions of historical book illustrations. The findings discussed here mark a critical intervention in the fields of digital humanities, book history, and illustration studies.

https://tinyurl.com/bdvjespp

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"AI and Medical Images: Addressing Ethical Challenges to Provide Responsible Access to Historical Medical Illustrations"


This article examines the ethical considerations and broader issues around access to digitised historical medical images. These illustrations and, later, photographs are often extremely sensitive, representing disability, disease, gender, and race in potentially harmful and problematic ways. In particular, the original metadata for such images can include demeaning and sometimes racist terms. Some of these images show sexually explicit and violent content, as well as content that was obtained without informed consent. Hiding these sensitive images can be tempting, and yet, archives are meant to be used, not locked away. Through a series of interviews with 10 archivists, librarians, and researchers based in the UK and US, the authors show that improved access to medical illustrations is essential to produce new knowledge in the humanities and medical research, as well as to bridge the gap between historical and modern understandings of the human body. Improving access to medical illustration can also help to address the "gender data gap", which has acquired mainstream visibility thanks to the work of activists such as Caroline Criado-Perez, the author of Invisible Women: Data Bias in a World Designed for Men.

https://tinyurl.com/3jek7ey4

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Academic Authors ‘Shocked’ After Taylor & Francis Sells Access to Their Research to Microsoft AI"


One of the biggest concerns raised by Clemens [Dr Ruth Alison Clemens] is over whether it is possible for Taylor & Francis’ authors to opt out of the AI partnership with Microsoft. Clemens told The Bookseller: "There is no clarity from Taylor & Francis about whether an opt-out policy is in place or on the cards. But as they did not inform their authors about the deal in the first place, any opt-out policy is now not functional."

Taylor & Francis was paid around $10 million for the license.

https://tinyurl.com/3yyarxnj

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Meta Releases the Biggest and Best Open-Source AI Model Yet"


Meta is releasing Llama 3.1, the largest-ever open-source AI model, which the company claims outperforms GPT-4o and Anthropic’s Claude 3.5 Sonnet on several benchmarks. It’s also making the Llama-based Meta AI assistant available in more countries and languages while adding a feature that can generate images based on someone’s specific likeness. . . .

Meta’s own implementation of Llama is its AI assistant, which is positioned as a general-purpose chatbot like ChatGPT and can be found [in a few weeks] in just about every part of Instagram, Facebook, and WhatsApp.

https://tinyurl.com/2cs552p4

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Paywall: "Exploring the Use of Generative Artificial Intelligence in Systematic Searching: A Comparative Case Study of a Human Librarian, ChatGPT-4 and ChatGPT-4 Turbo"


The findings suggest that AI could expand the scope of search terms and queries, automating the more repetitive and formulaic aspects of the systematic-review process, while human expertise remains crucial in refining search terms and ensuring methodological rigor. Meanwhile, challenges remain for AI tools’ capacity to access subscription-based or proprietary databases and generate sophisticated search strategies.

https://doi.org/10.1177/03400352241263532

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

AI Is Running Out of New Training Data: Consent in Crisis: The Rapid Decline of the AI Data Commons


General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. . . .Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems.

https://tinyurl.com/4k56axzk

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |