"The New York Times Prohibits AI Vendors from Devouring Its Content"


The new terms prohibit the use of Times content—which includes articles, videos, images, and metadata—for training any AI model without express written permission. In Section 2.1 of the TOS, the NYT says that its content is for the reader’s “personal, non-commercial use” and that non-commercial use does not include “the development of any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system.”

https://tinyurl.com/2cc4uhuc

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Sites Scramble to Block ChatGPT Web Crawler after Instructions Emerge"


But for large website operators, the choice to block large language model (LLM) crawlers isn’t as easy as it may seem. Making some LLMs blind to certain website data will leave gaps of knowledge that could serve some sites very well (such as sites that don’t want to lose visitors if ChatGPT supplies their information for them), but it may also hurt others. For example, blocking content from future AI models could decrease a site’s or a brand’s cultural footprint if AI chatbots become a primary user interface in the future. As a thought experiment, imagine an online business declaring that it didn’t want its website indexed by Google in the year 2002—a self-defeating move when that was the most popular on-ramp for finding information online.

https://tinyurl.com/yc4mcejn

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |

"AI Can Crack Double Blind Peer Review — Should We Still Use It?"


However, in the era of artificial intelligence (AI) and big data, a pressing question arises: can an author’s identity be deduced even from an anonymized paper (in cases where the authors do not advertise their submitted article on social media)?

In a recent article we investigate this very question, by leveraging an artificial intelligence model trained on the largest authorship attribution dataset to date. . . . Focusing purely on well-established researchers with at least a few dozen publications, our work demonstrates that reliable author identification is possible.

https://tinyurl.com/2kbuh7wn

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Will Building LLMs [AI Large Language Models] Become the New Revenue Driver for Academic Publishing?"


In a world where peer-reviewed content holds value for Generative AI companies, the question arises whether content that is locked behind a paywall has greater value than OA content. . . . Will publishers who still have a lot of content locked up, such as IEEE or NEJM, retain the most valuable assets? Will publishers that limit licensing to more restrictive terms such as CC BY-NC and CC BY-NC-ND have revenue streams denied to those exclusively using CC BY licenses? . . . Could authors receive income from their work via a CMO (Collective Management of Copyright) license, regardless of the agreement they have with the publisher?

https://tinyurl.com/zm6u5spc

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

OpenAI’s New Web Crawler: GPTBot

OpenAI has released a brief overview of GPTBot.

GPTBot is OpenAI’s web crawler and can be identified by the following user agent and string.

User agent token: GPTBot

Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

Usage

Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety. Below, we also share how to disallow GPTBot from accessing your site.

Disallowing GPTBot

To disallow GPTBot to access your site you can add the GPTBot to your site’s robots.txt:

User-agent: GPTBot Disallow: /

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Artificial Intelligence in Subject-Specific Library Work: Trends, Perspectives, and Opportunities"


The general implications of AI for libraries are much discussed in library literature. But while this discussion takes place at the library-wide level, there are also important implications for subject librarians due to the specific uses of AI in different professions and areas of study. These are often overlooked as these specializations tend to publish in subject-specific journals. This article aims to address this research gap by providing a comparison and thematic analysis of this literature. Subject-specific library journals in the areas of law, health sciences, business, and humanities and social sciences were searched to identify relevant journal articles that discussed AI. 139 articles were identified and tagged with at least one category that reflected the nature of the discussion around AI. The following analysis showed that literature related to law had the greatest number of articles by far, though the publishing activity in all disciplines has increased significantly in the last 10 years. This article explores these trends to gain a more comprehensive understanding of the implications for subject-specific library work.

https://doi.org/10.33137/cjal-rcbu.v9.39951

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Powering Research with Dimensions AI Assistant"


Imagine using AI to leverage the power of Dimensions with the click of a button. That’s exactly what you can do with Dimensions AI Assistant: your interaction with the world’s research knowledge is assisted by a powerful AI that takes you beyond keywords to a semantically rich summary with references, fully contextualizing the results and linking them with the literature. Digital Science has announced a closed beta release of Dimensions AI Assistant, which will allow users to achieve their goals quicker by helping them find the most relevant research and receive relevant synposes, leveraging the power of the Dimensions large language model, Dimensions General Science-BERT, and Open AI’s GPT models.

https://tinyurl.com/4w2jfukt

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Elsevier takes Scopus to the Next Level with Generative AI"


Scopus AI will help early-career researchers and seasoned academics alike through:

  • Summarized views based on Scopus abstracts: Researchers obtain a concise and trustworthy snapshot of any research topic, complete with academic references, reducing lengthy reading time and the risk of hallucinations.
  • Easy navigation to “Go Deeper Links” for extended exploration: Scopus AI provides relevant queries for further exploration, leading to hidden insights in various research topics.
  • Natural language queries: Researchers can ask questions about a subject in a natural, conversational manner.
  • A soon-to-be-added graphical representation, offering new perspectives of interconnected research themes: Scopus AI visually maps search results, offering a comprehensive overview that allows researchers to navigate complex relationships easily.

https://tinyurl.com/27xxj465

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Paywall: "Human-AI Interaction for Exploratory Search & Recommender Systems with Application to Cultural Heritage "


This dissertation introduces three primary contributions through publicly deployed sys- tems and datasets. First, we demonstrate how the construction of large-scale cultural heritage datasets using machine learning can answer interdisciplinary questions in library & information science and the humanities (Chapter 2). Second, based on the feedback of users of these cultural heritage datasets, we introduce open faceted search, an extension of faceted search that leverages human-AI interaction affordances to empower users to define their own facets in an open domain fashion (Chapter 3). Third, encountering similar challenges with the deluge of scientific papers, we explore the question of how to improve recommender systems through human-AI interaction and tackle the broad challenge of advice taking for opaque machine learners (Chapter 4).

https://tinyurl.com/yc59txc5

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Generative AI and the Future of Work in America


By 2030, activities that account for up to 30 percent of hours currently worked across the US economy could be automated—a trend accelerated by generative AI. However, we see generative AI enhancing the way STEM, creative, and business and legal professionals work rather than eliminating a significant number of jobs outright. Automation’s biggest effects are likely to hit other job categories. Office support, customer service, and food service employment could continue to decline. . . .

An additional 12 million occupational transitions may be needed by 2030. As people leave shrinking occupations, the economy could reweight toward higher-wage jobs. Workers in lower-wage jobs are up to 14 times more likely to need to change occupations than those in highest-wage positions, and most will need additional skills to do so successfully. Women are 1.5 times more likely to need to move into new occupations than men.

https://tinyurl.com/yn2xdt7p

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Paywall: "An Initial Interpretation of the U.S. Department of Education’s AI Report: Implications and Recommendations for Academic Libraries"


This article provides an analysis of the U.S. Department of Education’s report on Artificial Intelligence (AI) and its implications for academic libraries. It delves into the report’s key points, including the importance of AI literacy, the need for educator involvement in AI design and implementation, and the necessity of preparing for AI related issues. The author discusses how these points impact academic libraries and offers actionable recommendations for library leaders. It emphasizes the need for libraries to promote AI literacy, involve librarians in AI implementation, develop guidelines for AI use, prepare for AI issues, and collaborate with other stakeholders.

https://doi.org/10.1016/j.acalib.2023.102761

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Reproducibility in Machine Learning-Driven Research"


Research is facing a reproducibility crisis, in which the results and findings of many studies are difficult or even impossible to reproduce. This is also the case in machine learning (ML) and artificial intelligence (AI) research. Often, this is the case due to unpublished data and/or source-code, and due to sensitivity to ML training conditions. Although different solutions to address this issue are discussed in the research community such as using ML platforms, the level of reproducibility in ML-driven research is not increasing substantially. Therefore, in this mini survey, we review the literature on reproducibility in ML-driven research with three main aims: (i) reflect on the current situation of ML reproducibility in various research fields, (ii) identify reproducibility issues and barriers that exist in these research fields applying ML, and (iii) identify potential drivers such as tools, practices, and interventions that support ML reproducibility. With this, we hope to contribute to decisions on the viability of different solutions for supporting ML reproducibility.

https://arxiv.org/abs/2307.10320

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Analyzing and Navigating Electronic Theses and Dissertations"


This research is aimed at building tools and techniques for discovering and accessing the knowledge buried in ETDs, as well as to support end-user services for digital libraries, such as document browsing and long document navigation. First, we review several machine learning models that can be used to support such services. Next, to support a comprehensive evaluation of different models, as well as to train models that are tailored to the ETD data, we introduce several new datasets from the ETD domain. To minimize the resources required to develop high quality training datasets required for supervised training, a novel AI-aided annotation method is also discussed. Finally, we propose techniques and frameworks to support the various digital library services such as search, browsing, and recommendation.

https://tinyurl.com/33ay562h

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Webinar Recording: "ACRL LDG A Mutualistic View of AI in the Library or a Continuation of Craft by Thomas Padilla"


During this session, Thomas Padilla [Deputy Director, Archiving and Data Services at the Internet Archive] will present a critical and generative position aimed at empowering GLAM professionals on their journey to develop a mutually beneficial relationship with AI. The discussion will cover the individual, organizational, and community impacts of AI in the library landscape.

https://www.youtube.com/watch?v=hh5PTyBT6AA

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Wikipedia’s Moment of Truth"


The new A.I. chatbots have typically swallowed Wikipedia’s corpus. . . . While estimates of its influence can vary, Wikipedia is probably the most important single source in the training of A.I. models. "Without Wikipedia, generative A.I. wouldn’t exist," says Nicholas Vincent, Yet as bots like ChatGPT become increasingly popular and sophisticated, Vincent and some of his colleagues wonder what will happen if Wikipedia, outflanked by A.I. that has cannibalized it, suffers from disuse and dereliction.

https://tinyurl.com/bdbxrdbk

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Meta Is Expanding Its Generative A.I. Arsenal with a New Tool It’s Touting as a ‘State-of-the-Art’ Breakthrough"


Currently, there is a divide between A.I. image generators and A.I. text generators, like OpenAI’s ChatGPT.. . . Meta’s tool breaks down that divide with a model that allows for the input and generation of text and images, and allows for the creation of captions (or image-to-text generation) and images with "super-resolution."

https://tinyurl.com/mr25z6zd

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"The Future of Academic Publishing"


Ultimately, we might be forced to rethink publication. If scientific research is mostly read by machines, the question arises of whether it is relevant to package it into a single coherent narrative that is adapted to the limitations of human cognition. This seems like a lot of busywork for scientists. We could unbundle scientific research from the constraints of journal formatting, as suggested by Neuromatch Open Publishing. In this view, research will be a living compendium of code, datasets, graphs and narrative content remixable and always up to date. Open and freely accessible research will be more valuable and influential because it will be seen by LLMs.

https://doi.org/10.1038/s41562-023-01637-2

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Authors Join the Brewing Legal Battle over AI"


Neither Meta nor OpenAI has yet responded to the author suits. But multiple copyright lawyers told PW on background that the claims likely face an uphill battle in court. Even if the suits get past the threshold issues associated with the alleged copying at issue and how AI training actually works—which is no sure thing—lawyers say there is ample case law to suggest fair use. For example, a recent case against plagiarism detector TurnItIn.com held that works could be ingested to create a database used to expose plagiarism by students. The landmark Kelly v. Arriba Soft case held that the reproduction and display of photos as thumbnails was fair use.

https://tinyurl.com/bddvrykh

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Writing with CHATGPT: An Illustration of Its Capacity, Limitations & Implications for Academic Writers"


Rather than being alarmed or anxious, writers need to understand ChatGPT’s strengths and weaknesses. It is better at structure than it is at content. It is a good brainstorming tool (think titles, outlines, counter-arguments), but you must double check everything it tells you, especially if you’re outside your domain of expertise. It can provide summaries of complex ideas, and connect them with other ideas, but only if you have put a lot of thought into the incremental prompting needed to shift it from its generic default and train it to focus on what you care about. Its access to information is limited to what it was originally trained on, therefore your own training phase is essential to identify gaps and inaccuracies. It can be used for labor, such as reformatting abstracts or reducing the length of sections, but it can’t replace the thinking a writer does to determine why some paragraphs or ideas deserve more words and others can be cut back. It can be inaccurate: in fact, rather stubbornly so, persisting with inaccuracies even after they are pointed out, while at the same time presenting its next attempt as corrected.

https://doi.org/10.5334/pme.1072

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"CORE-GPT: Combining Open Access Research and Large Language Models for Credible, Trustworthy Question Answering"


In this paper, we present CORE-GPT, a novel question-answering platform that combines GPT-based language models and more than 32 million full-text open access scientific articles from CORE. We first demonstrate that GPT3.5 and GPT4 cannot be relied upon to provide references or citations for generated text. We then introduce CORE-GPT which delivers evidence-based answers to questions, along with citations and links to the cited papers, greatly increasing the trustworthiness of the answers and reducing the risk of hallucinations. CORE-GPT’s performance was evaluated on a dataset of 100 questions covering the top 20 scientific domains in CORE, resulting in 100 answers and links to 500 relevant articles. The quality of the provided answers and and relevance of the links were assessed by two annotators. Our results demonstrate that CORE-GPT can produce comprehensive and trustworthy answers across the majority of scientific domains, complete with links to genuine, relevant scientific articles.

https://arxiv.org/abs/2307.04683

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Claude 2: ChatGPT Rival Launches Chatbot That Can Summarise a Novel"


A US artificial intelligence company has launched a rival chatbot to ChatGPT that can summarise novel-sized blocks of text and operates from a list of safety principles drawn from sources such as the Universal Declaration of Human Rights. . . .

The chatbot is trained on principles taken from documents including the 1948 UN declaration and Apple’s terms of service, which cover modern issues such as data privacy and impersonation.

https://tinyurl.com/ms44eccd

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

10 AI Researchers on How AI Can Either Improve the World or Destroy It

Steve Rose of The Guardian interviews the experts.

Five Ways AI Could Improve the World: ‘We Can Cure All Diseases, Stabilise Our Climate, Halt Poverty’

Five Ways AI Might Destroy the World: ‘Everyone on Earth Could Fall over Dead in the Same Second’

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"SSP Conference Debate: AI and the Integrity of Scholarly Publishing"


At the annual meeting of the Society for Scholarly Publishing held in Portland, Oregon last month, the closing plenary session was a formal debate on the proposition "Resolved: Artificial intelligence will fatally undermine the integrity of scholarly publishing." Arguing in favor of the proposition was Tim Vines, founder of DataSeer and a Scholarly Kitchen Chef. Arguing against was Jessica Miles, Vice President for Strategy and Investments at Holtzbrinck Publishing Group.

https://tinyurl.com/ururdfvw

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

AI Is Training AI: "Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks"


Large language models (LLMs) are remarkable data annotators. They can be used to generate high-fidelity supervised training data, as well as survey and experimental data. With the widespread adoption of LLMs, human gold-standard annotations are key to understanding the capabilities of LLMs and the validity of their results. However, crowdsourcing, an important, inexpensive way to obtain human annotations, may itself be impacted by LLMs, as crowd workers have financial incentives to use LLMs to increase their productivity and income. To investigate this concern, we conducted a case study on the prevalence of LLM usage by crowd workers. We reran an abstract summarization task from the literature on Amazon Mechanical Turk and, through a combination of keystroke detection and synthetic text classification, estimate that 33-46% of crowd workers used LLMs when completing the task. Although generalization to other, less LLM-friendly tasks is unclear, our results call for platforms, researchers, and crowd workers to find new ways to ensure that human data remain human, perhaps using the methodology proposed here as a stepping stone

https://arxiv.org/abs/2306.07899

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"OCLC Introduces AI-generated Book Recommendations in WorldCat.org and WorldCat Find beta"


OCLC is beta testing book recommendations generated by artificial intelligence (AI) in WorldCat.org, the website that allows users to explore the collections of thousands of libraries through a single search. Searchers can now obtain AI-enabled book recommendations for print and e-books and then look for those items in libraries near them. The AI-generated book recommendations beta is now available in WorldCat.org and WorldCat Find, the mobile app extension for WorldCat.org.

https://tinyurl.com/44j4ascr

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |