Stanford: Artifical Intelligence Index Report 2024


AI has surpassed human performance on several benchmarks, including some in image classification, visual reasoning, and English understanding. Yet it trails behind on more complex tasks like competition-level mathematics, visual commonsense reasoning and planning. . . .

According to AI Index estimates, the training costs of state-of-the-art AI models have reached unprecedented levels. For example, OpenAI’s GPT-4 used an estimated $78 million worth of compute to train, while Google’s Gemini Ultra cost $191 million for compute. . . .

New research from the AI Index reveals a significant lack of standardization in responsible AI reporting. Leading developers, including OpenAI, Google, and Anthropic, primarily test their models against different responsible AI benchmarks. This practice complicates efforts to systematically compare the risks and limitations of top AI models. . . .

Despite a decline in overall AI private investment last year, funding for generative AI surged, nearly octupling from 2022 to reach $25.2 billion.

https://tinyurl.com/53wsjxyj

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Is ChatGPT Transforming Academics’ Writing Style?"


Based on one million arXiv papers submitted from May 2018 to January 2024, we assess the textual density of ChatGPT’s writing style in their abstracts by means of a statistical analysis of word frequency changes. Our model is calibrated and validated on a mixture of real abstracts and ChatGPT-modified abstracts (simulated data) after a careful noise analysis. We find that ChatGPT is having an increasing impact on arXiv abstracts, especially in the field of computer science, where the fraction of ChatGPT-revised abstracts is estimated to be approximately 35%, if we take the output of one of the simplest prompts, "revise the following sentences", as a baseline. We conclude with an analysis of both positive and negative aspects of the penetration of ChatGPT into academics’ writing style.

https://arxiv.org/abs/2404.08627v1

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Generative AI Can Turn Your Most Precious Memories into Photos That Never Existed"


Dozens of people have now had their memories turned into images in this way via Synthetic Memories, a project run by Domestic Data Streamers. The studio uses generative image models, such as OpenAI’s DALL-E, to bring people’s memories to life. Since 2022, the studio, which has received funding from the UN and Google, has been working with immigrant and refugee communities around the world to create images of scenes that have never been photographed, or to re-create photos that were lost when families left their previous homes.

https://tinyurl.com/yekzh6sy

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Is ChatGPT Corrupting Peer Review? Telltale Words Hint at AI Use"


A study that identified buzzword adjectives that could be hallmarks of AI-written text in peer-review reports suggests that researchers are turning to ChatGPT and other artificial intelligence (AI) tools to evaluate others’ work. . . .

Their analysis suggests that up to 17% of the peer-review reports have been substantially modified by chatbots — although it’s unclear whether researchers used the tools to construct reviews from scratch or just to edit and improve written drafts.

https://www.nature.com/articles/d41586-024-01051-2

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"AI Race Heats Up as OpenAI, Google and Mistral Release New Models"


OpenAI, Google, and the French artificial intelligence startup Mistral have all released new versions of their frontier AI models within 12 hours of one another, as the industry prepares for a burst of activity over the summer.

The unprecedented flurry of releases come as the sector readies for the expected launch of the next major version of GPT, the system that underpins OpenAI’s hit chatbot Chat-GPT.

https://tinyurl.com/36zmymwp

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Towards a Books Data Commons for AI Training"


This white paper describes ways of building a books data commons: a responsibly designed, broadly accessible data set of digitized books to be used in training AI models. This report, written in partnership with Creative Commons and Proteus Strategies, is based on a series of workshops that brought together practitioners building AI models, legal and policy scholars, and experts working with collections of digitized books.

In the paper, we first explain why books matter for AI training and how broader access could be beneficial. We then summarize two tracks that might be considered for developing such a resource, highlighting existing projects that help foreground the potential challenges. One track relies on public domain and permissively licensed books, while the other depends on exceptions to copyright to enable training on in-copyright books. The report also presents several key design choices and next steps that could advance further development of this approach.

https://tinyurl.com/2fu47552

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"PubTator 3.0: An AI-Powered Literature Resource for Unlocking Biomedical Knowledge"


PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0’s online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.

https://doi.org/10.1093/nar/gkae235

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Advancing the Search Frontier with AI Agents"


As many of us in the information retrieval (IR) research community know and appreciate, search is far from being a solved problem. Millions of people struggle with tasks on search engines every day. Often, their struggles relate to the intrinsic complexity of their task and the failure of search systems to fully understand the task and serve relevant results. The task motivates the search, creating the gap/problematic situation that searchers attempt to bridge/resolve and drives search behavior as they work through different task facets. Complex search tasks require more than support for rudimentary fact finding or re-finding. Research on methods to support complex tasks includes work on generating query and website suggestions, personalizing and contextualizing search, and developing new search experiences, including those that span time and space. The recent emergence of generative artificial intelligence (AI) and the arrival of assistive agents, based on this technology, has the potential to offer further assistance to searchers, especially those engaged in complex tasks. There are profound implications from these advances for the design of intelligent systems and for the future of search itself. This article, based on a keynote by the author at the 2023 ACM SIGIR Conference, explores these issues and how AI agents are advancing the frontier of search system capabilities, with a special focus on information interaction and complex task completion.

https://arxiv.org/abs/2311.01235

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"How Tech Giants Cut Corners to Harvest Data for A.I."


The volume of data is crucial [to train AIs]. Leading chatbot systems have learned from pools of digital text spanning as many as three trillion words, or roughly twice the number of words stored in Oxford University’s Bodleian Library, which has collected manuscripts since 1602. The most prized data, A.I. researchers said, is high-quality information, such as published books and articles, which have been carefully written and edited by professionals. . . .

Tech companies are so hungry for new data that some are developing "synthetic" information. This is not organic data created by humans, but text, images and code that A.I. models produce — in other words, the systems learn from what they themselves generate.

https://tinyurl.com/3uxuwekh

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Unleashing the Power of AI. A Systematic Review of Cutting-Edge Techniques in AI-Enhanced Scientometrics, Webometrics, and Bibliometrics"


Findings: (i) Regarding scientometrics, the application of AI yields various distinct advantages, such as conducting analyses of publications, citations, research impact prediction, collaboration, research trend analysis, and knowledge mapping, in a more objective and reliable framework. (ii) In terms of webometrics, AI algorithms are able to enhance web crawling and data collection, web link analysis, web content analysis, social media analysis, web impact analysis, and recommender systems. (iii) Moreover, automation of data collection, analysis of citations, disambiguation of authors, analysis of co-authorship networks, assessment of research impact, text mining, and recommender systems are considered as the potential of AI integration in the field of bibliometrics.

https://arxiv.org/abs/2403.18838

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Now You Can Use ChatGPT without an Account"


OpenAI will no longer require an account to use ChatGPT, the company’s free AI platform. However, this only applies to ChatGPT, as other OpenAI products, like DALL-E 3, cost money to access and will still require an account for access. . . .

OpenAI said it introduced "additional content safeguards for this experience," including blocking prompts in a wider range of categories, but did not expound more on what these categories are. The option to opt out of model training will still be available, even to those without accounts.

https://tinyurl.com/582ehjhm

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Paywall: "Developing a Foundation for the Informational Needs of Generative AI Users through the Means of Established Interdisciplinary Relationships"


University faculty immediately had many questions and concerns in response to the public proliferation of generative artificial intelligence programs leveraging large language models to generate complex text responses to simple prompts. Librarians at the University of South Florida (USF) pooled their skills, existing relationships with faculty and professional staff across campus to provide information that answered common questions raised by those faculty on generative artificial intelligence usage within research related topics. Faculty concern regarding the worry of plagiarism, how to instruct students to use the new tools and how to discern the reliability of information generated by artificial intelligence tools were placed at the forefront.

https://doi.org/10.1016/j.acalib.2024.102876

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Generative AI for Trustworthy, Open, and Equitable Scholarship"


We focus on the potential of GenAI to address known problems for the alignment of science practice and its underlying core values. As institutions culturally charged with the curation and preservation of the world’s knowledge and cultural heritage, libraries are deeply invested in promoting a durable, trustworthy, and sustainable scholarly knowledge commons. With public trust in academia and in research waning [reference] and in the face of recent high-profile instances of research misconduct [reference], the scholarly community must act swiftly to develop policies, frameworks, and tools for leveraging the power of GenAI in ways that enhance, rather than erode, the trustworthiness of scientific communications, the breadth of scientific impact, and the public’s trust in science, academia, and research.

https://doi.org/10.21428/e4baedd9.567bfd15

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Evolving AI Strategies in Libraries: Insights from Two Polls of ARL Member Representatives over Nine Months—Report Published"


To effectively chart this [AI] transition, two quick polls were conducted among members of the Association of Research Libraries (ARL) to capture changing perspectives on the potential impact of AI, assess the extent of AI exploration and implementation within libraries, and identify AI applications relevant to the current library environment.

Today, ARL has released the results of the two polls—analyzing and juxtaposing the outcomes of these two surveys to better understand how library leaders are managing the complexities of integrating AI into their operations and services. The report also includes recommendations for ARL research libraries.

https://tinyurl.com/2t9nywcv

Report

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"TDM & AI Rights Reserved? Fair Use & Evolving Publisher Copyright Statements"


Earlier this year, we noticed that some academic publishers have revised the copyright notices on their websites to state they reserve rights to text and data mining (TDM) and AI training (for example, see the website footers for Elsevier and Wiley). . . .SPARC asked Kyle K. Courtney, Director of Copyright and Information Policy for Harvard Library, to address key questions regarding these revised copyright statements and the continuing viability of fair use justifications for TDM.

https://tinyurl.com/4prkfbb3

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Use ‘Jan’ to Chat with AI without the Privacy Concerns"


Jan is a free an open source application that makes it easy to download multiple large language models and start chatting with them. There are simple installers for Windows, macOS, and Linux. Now, this isn’t perfect. The models aren’t necessarily as good as the latest ones from OpenAI or Google, and depending on how powerful your computer is, the results might take a while to come in.

https://tinyurl.com/4m8p4b82

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Human-Centered Explainable Artificial Intelligence: An Annual Review of Information Science and Technology (Arist) Paper"


Explainability is central to trust and accountability in artificial intelligence (AI) applications. The field of human-centered explainable AI (HCXAI) arose as a response to mainstream explainable AI (XAI) which was focused on algorithmic perspectives and technical challenges, and less on the needs and contexts of the non-expert, lay user. HCXAI is characterized by putting humans at the center of AI explainability. . . . This review identifies the foundational ideas of HCXAI, how those concepts are operationalized in system design, how legislation and regulations might normalize its objectives, and the challenges that HCXAI must address as it matures as a field.

https://doi.org/10.1002/asi.24889

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Exploring the Potential of Large Language Models and Generative Artificial Intelligence (GPT): Applications in Library and Information Science"


The presented study offers a systematic overview of the potential application of large language models (LLMs) and generative artificial intelligence tools, notably the GPT model and the ChatGPT interface, within the realm of library and information science (LIS). The paper supplements and extends the outcomes of a comprehensive information survey on the subject matter with the author’s own experiences and examples showcasing possible applications, demonstrated through illustrative instances. This study does not involve testing available LLMs or selecting the most suitable tool; instead, it targets information professionals, specialists, librarians, and scientists, aiming to inspire them in various ways.

https://doi.org/10.1177/09610006241241066

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"The Latest ‘Crisis’ — Is the Research Literature Overrun with ChatGPT- and LLM-generated Articles?"


Elsevier has been under the spotlight this month for publishing a paper that contains a clearly ChatGPT-written portion of its introduction. The first sentence of the paper’s Introduction reads, "Certainly, here is a possible introduction for your topic:. . . ." To date, the article remains unchanged, and unretracted. A second paper, containing the phrase "I’m very sorry, but I don’t have access to real-time information or patient-specific data, as I am an AI language model" was subsequently found, and similarly remains unchanged. This has led to a spate of amateur bibliometricians scanning the literature for similar common AI-generated phrases, with some alarming results.

https://tinyurl.com/4a8bjmzy

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Fair Use Rights to Conduct Text and Data Mining and Use Artificial Intelligence Tools Are Essential for UC Research and Teaching"


The UC Libraries invest more than $60 million each year licensing systemwide electronic content needed by scholars for these and other studies. (Indeed, the $60 million figure represents license agreements made at the UC systemwide and multi-campus levels. But each individual campus also licenses electronic resources, adding millions more in total expenditures.) Our libraries secure campus access to a broad range of digital resources including books, scientific journals, databases, multimedia resources, and other materials. In doing so, the UC Libraries must negotiate licensing terms that ensure scholars can make both lawful and comprehensive use of the materials the libraries have procured. Increasingly, however, publishers and vendors are presenting libraries with content license agreements that attempt to preclude, or charge additional and unsupportable fees for, fair uses like training AI tools in the course of conducting TDM. . . .

If the UC Libraries are unable to protect these fair uses, UC scholars will be at the mercy of publishers aggregating and controlling what may be done with the scholarly record. Further, UC scholars’ pursuit of knowledge will be disproportionately stymied relative to academic colleagues in other global regions, given that a large proportion of other countries preclude contractual override of research exceptions.

Indeed, in more than forty countries—including all those within the European Union (EU)—publishers are prohibited from using contracts to abrogate exceptions to copyright in non-profit scholarly and educational contexts. Article 3 of the EU’s Directive on Copyright in the Digital Single Market preserves the right for scholars within research organizations and cultural heritage institutions (like those researchers at UC) to conduct TDM for scientific research, and further proscribes publishers from invalidating this exception by license agreements (see Article 7). Moreover, under AI regulations recently adopted by the European Parliament, copyright owners may not opt out of having their works used in conjunction with artificial intelligence tools in TDM research—meaning copyrighted works must remain available for scientific research that is reliant on AI training, and publishers cannot override these AI training rights through contract. Publishers are thus obligated to—and do—preserve fair use-equivalent research exceptions for TDM and AI within the EU, and can do so in the United States, too. . . .

In all events, adaptable licensing language can address publishers’ concerns by reiterating that the licensed products may be used with AI tools only to the extent that doing so would not: i. create a competing or commercial product or service for use by third parties; ii. unreasonably disrupt the functionality of the subscribed products; or iii. reproduce or redistribute the subscribed products for third parties. In addition, license agreements can require commercially reasonable security measures (as also required in the EU) to extinguish the risk of content dissemination beyond permitted uses. In sum, these licensing terms can replicate the research rights that are unequivocally reserved for scholars elsewhere.

https://tinyurl.com/4fvpdz35

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Microsoft Is Developing Tech That Would Let Users Write with Their Eyes, a Huge Win for Accessibility"


Microsoft published a new patent for a device called the Eye-Gaze, which would allow users to communicate and interact with electronic devices without the use of hands and fingers for typing. . . .

The only other peripheral that comes to mind that’s remotely similar to the Eye-Gaze is the Apple Vision Pro, but that’s in a mixed reality setting which still requires some hand movements.

https://tinyurl.com/2s443y86

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Responsible Artificial Intelligence: A Structured Literature Review"


Our research endeavors to advance the concept of responsible artificial intelligence (AI), a topic of increasing importance within EU policy discussions. The EU has recently issued several publications emphasizing the necessity of trust in AI, underscoring the dual nature of AI as both a beneficial tool and a potential weapon. This dichotomy highlights the urgent need for international regulation. Concurrently, there is a need for frameworks that guide companies in AI development, ensuring compliance with such regulations. Our research aims to assist lawmakers and machine learning practitioners in navigating the evolving landscape of AI regulation, identifying focal areas for future attention. This paper introduces a comprehensive and, to our knowledge, the first unified definition of responsible AI. Through a structured literature review, we elucidate the current understanding of responsible AI. Drawing from this analysis, we propose an approach for developing a future framework centered around this concept. Our findings advocate for a human-centric approach to Responsible AI. This approach encompasses the implementation of AI methods with a strong emphasis on ethics, model explainability, and the pillars of privacy, security, and trust.

https://arxiv.org/abs/2403.06910

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"An OpenAI Spinoff Has Built an AI Model That Helps Robots Learn Tasks Like Humans"


Now three of OpenAI’s early research scientists say the startup they spun off in 2017, called Covariant, has solved that problem and unveiled a system that combines the reasoning skills of large language models with the physical dexterity of an advanced robot. . . .

This represents a leap forward, Chen told me, in robots that can adapt to their environment using training data rather than the complex, task-specific code that powered the previous generation of industrial robots. It’s also a step toward worksites where managers can issue instructions in human language without concern for the limitations of human labor. ("Pack 600 meal-prep kits for red pepper pasta using the following recipe. Take no breaks!")

https://tinyurl.com/3nek7xx2

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Paywall: "The Obscene Energy Demands of A.I."


It’s been estimated that ChatGPT is responding to something like two hundred million requests per day, and, in so doing, is consuming more than half a million kilowatt-hours of electricity. (For comparison;s sake, the average U.S. household consumes twenty-nine kilowatt-hours a day.)

https://tinyurl.com/ynrd4k4p

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Generative AI in Higher Education: The Product Landscape


Since last fall, Ithaka S+R has been partnering with 19 colleges and universities from the US and Canada to assess GAI’s impact on higher education and make evidence-based, proactive decisions about how to manage the far-ranging effects of GAI.[3] As part of this project, Ithaka S+R has been cataloging GAI applications geared towards teaching, learning, and research in the higher education context. Today, we are excited to make our Product Tracking tool (https://sr.ithaka.org/our-work/generative-ai-product-tracker/) publicly available. . . .

This issue brief is designed to enrich the descriptive data captured in the Product Tracker. In the brief’s first section, we provide a typology of existing products and value propositions. In the second, we offer observations about what the product landscape suggests about the future of teaching, learning, and research practices, and speculations on the near-term future of the academic GAI market.

https://doi.org/10.18665/sr.320394

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |